Release Often Release Safely

Preview:

DESCRIPTION

Kung-Fu of releasing often but safely for high loaded systems

Citation preview

Release Often Release Safely

Sergejus Barinovas (@sergejusb)

http://sergejus.blogas.lt

This is not a theoretical presentation

This presentation based on real life experience

Successful software workflow

Your software

cannot go down

You got even more

customers

You got customers

You released software

Dilemma: Innovative or Stable?

Innovative Often (bi-weekly) releases of new features Higher risk of bugs and downtimes

Stable Higher uptime and better customer perception Seasonal releases of new features

We wanted both …

… be innovative and agile while staying as much stable as possible

Stability in our terms

99.999% uptime for serving ads

2 datacenters + clouds

500 M requests / day

Let’s learn Kung Fu of releasing often

and safely

Challenges we ha(d/ve)

Detect issues in production as soon as possible

Test new features in production while reducing impact for customers

Roll-out new features in a controlled manner

Detect issues in production ASAP

Monitoring Choose monitoring system carefully

It took us about 1 year (Zabbix) First list all your possible monitoring use cases

Prepare your software for monitoring Logging is a must have! Performance / SLA counters help to measure

and understand software better Create a clear baseline to compare

with after releases

Detect issues in production ASAP

Automated functional tests Designed to detect end-user issues

Differently than unit and integration tests

UI / business logic Still not as many as we want (Selenium UI / C#) Ongoing process of unifying automated QA tests

Run after each release and on periodic basis Very important if you have > 1 server Huge time saver if tests are repetitive

Though unit tests help in finding bugs during coding, they are more

vital when software evolves!

Finding

Test new features in production

Even ideal staging environment is not equal to production environment

Before starting rolling-out new feature it is important to check its Resource consumption

CPU / RAM / HDD / IO / Network

Performance impact on existing functionality Response times / SLA

Stability Errors / memory leaks

Test new features in production

Use Case #1:

Safely rollout new feature that integrates into core data collection pipeline

Test new features in production

Dark releases Works best with brand new features Release new feature to one or several servers New feature gets real load, but is not available

for customers Have automated rollback package in

case something goes wrong

Test new features in production

Dark release notes from our release plan

Release Date

Release Type

Team Project/Product Release Notes

2011.08.03 Dark RnD Topic Modelling Final part of the Topic Model Storage dark release. Changes to pullTransactions procedure on all Collect serversEnabled for Danish, Sweden and English languages

2011.08.02 Dark RnD Topic Modelling Part 2 of the Topic Model Storage dark release.Changes to pullTransactions procedure on Collect2 serverEnabled for Danish language only

2011.08.01 Dark RnD Topic Modelling Part 1 of the Topic Model Storage dark release.SQL part of Administration and Collect servers (apart from pullTransactions procedure, this will be in part 2)Windows service part of Proc03 including integration with Amazon

Test new features in production

Use Case #2:Safely migrate to the new SQL connection pooling mechanism

Test new features in production

Feature flags and switchers Works both for brand new features and updates Feature can be switched on / off any time

if (FeatureEnabled) then … if (UseNewLogic) then … else …

Can effect existing customers Possible to test each server one by one

by switching feature on / off

Test new features in production

Use Case #3:

Safely migrate to the brand-new intelligent targeting subsystem

Test new features in production

Valves Very similar to switches Feature can get from 0% to 100% of real load Very handy to gradually roll-out new features on

each server one by one So far helped us a lot though require extra

development effort

Test new features in production

Caveats we had so far Make sure you can turn features on / off without

effecting connected users Create simple interface to display current status

of all switches and valves on each affected server Secure access to switches and valves

Controlling roll-out of new feature

Switches and valves enable very smooth and controlled roll-out

Partial roll-out to different datacenters / clouds Different datacenters / clouds have different version

of feature released Redirect all traffic to the new or old version of feature

Controlling roll-out of new feature

Future research: application level load balancing Load balancer can act as a switches / valve without

actually programming load distribution logic Ability to automatically redirect users to the new

version of application while preserving old one

Summary

Monitoring system is very important, but your software should be prepared for this

Automated functional tests are functional monitoring of your software

Switches and valves are very powerful concept for testing in production and roll-outs, but require extra development and maintenance time

Dark releases and partial roll-outs are the most cost effective safety mechanism

Thanks! Questions?Sergejus Barinovas (@sergejusb)

http://sergejus.blogas.lt

Recommended