Upload
sergejus-barinovas
View
2.961
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Kung-Fu of releasing often but safely for high loaded systems
Citation preview
Release Often Release Safely
Sergejus Barinovas (@sergejusb)
http://sergejus.blogas.lt
This is not a theoretical presentation
This presentation based on real life experience
Successful software workflow
Your software
cannot go down
You got even more
customers
You got customers
You released software
Dilemma: Innovative or Stable?
Innovative Often (bi-weekly) releases of new features Higher risk of bugs and downtimes
Stable Higher uptime and better customer perception Seasonal releases of new features
We wanted both …
… be innovative and agile while staying as much stable as possible
Stability in our terms
99.999% uptime for serving ads
2 datacenters + clouds
500 M requests / day
Let’s learn Kung Fu of releasing often
and safely
Challenges we ha(d/ve)
Detect issues in production as soon as possible
Test new features in production while reducing impact for customers
Roll-out new features in a controlled manner
Detect issues in production ASAP
Monitoring Choose monitoring system carefully
It took us about 1 year (Zabbix) First list all your possible monitoring use cases
Prepare your software for monitoring Logging is a must have! Performance / SLA counters help to measure
and understand software better Create a clear baseline to compare
with after releases
Detect issues in production ASAP
Automated functional tests Designed to detect end-user issues
Differently than unit and integration tests
UI / business logic Still not as many as we want (Selenium UI / C#) Ongoing process of unifying automated QA tests
Run after each release and on periodic basis Very important if you have > 1 server Huge time saver if tests are repetitive
Though unit tests help in finding bugs during coding, they are more
vital when software evolves!
Finding
Test new features in production
Even ideal staging environment is not equal to production environment
Before starting rolling-out new feature it is important to check its Resource consumption
CPU / RAM / HDD / IO / Network
Performance impact on existing functionality Response times / SLA
Stability Errors / memory leaks
Test new features in production
Use Case #1:
Safely rollout new feature that integrates into core data collection pipeline
Test new features in production
Dark releases Works best with brand new features Release new feature to one or several servers New feature gets real load, but is not available
for customers Have automated rollback package in
case something goes wrong
Test new features in production
Dark release notes from our release plan
Release Date
Release Type
Team Project/Product Release Notes
2011.08.03 Dark RnD Topic Modelling Final part of the Topic Model Storage dark release. Changes to pullTransactions procedure on all Collect serversEnabled for Danish, Sweden and English languages
2011.08.02 Dark RnD Topic Modelling Part 2 of the Topic Model Storage dark release.Changes to pullTransactions procedure on Collect2 serverEnabled for Danish language only
2011.08.01 Dark RnD Topic Modelling Part 1 of the Topic Model Storage dark release.SQL part of Administration and Collect servers (apart from pullTransactions procedure, this will be in part 2)Windows service part of Proc03 including integration with Amazon
Test new features in production
Use Case #2:Safely migrate to the new SQL connection pooling mechanism
Test new features in production
Feature flags and switchers Works both for brand new features and updates Feature can be switched on / off any time
if (FeatureEnabled) then … if (UseNewLogic) then … else …
Can effect existing customers Possible to test each server one by one
by switching feature on / off
Test new features in production
Use Case #3:
Safely migrate to the brand-new intelligent targeting subsystem
Test new features in production
Valves Very similar to switches Feature can get from 0% to 100% of real load Very handy to gradually roll-out new features on
each server one by one So far helped us a lot though require extra
development effort
Test new features in production
Caveats we had so far Make sure you can turn features on / off without
effecting connected users Create simple interface to display current status
of all switches and valves on each affected server Secure access to switches and valves
Controlling roll-out of new feature
Switches and valves enable very smooth and controlled roll-out
Partial roll-out to different datacenters / clouds Different datacenters / clouds have different version
of feature released Redirect all traffic to the new or old version of feature
Controlling roll-out of new feature
Future research: application level load balancing Load balancer can act as a switches / valve without
actually programming load distribution logic Ability to automatically redirect users to the new
version of application while preserving old one
Summary
Monitoring system is very important, but your software should be prepared for this
Automated functional tests are functional monitoring of your software
Switches and valves are very powerful concept for testing in production and roll-outs, but require extra development and maintenance time
Dark releases and partial roll-outs are the most cost effective safety mechanism
Thanks! Questions?Sergejus Barinovas (@sergejusb)
http://sergejus.blogas.lt