Ways to minimise performance risks in continuous delivery

WAYS TO MINIMISE PERFORMANCE RISKS IN CONTINUOUS DELIVERY

Adriaan Thomas4 June 2013

https://intranet.xebia.com/confluence/display/knowledge/Ways+to+minimise+performance+risks+in+continuous+delivery




INTRODUCTION

OBJECTIVEPut working software into production as quickly as possible, whilst minimising risk of load-related problems:

• Bad response times

• Lack of capacity

• Availability too low

• Excessive system resource use

Within the context of websites.

TRADITIONAL APPROACHLoad testing through simulation

http://www.flickr.com/photos/danramarch/4423023837

DECIDE WHAT TO TEST

•Focus on busiest instant•Model most-hit functionality•Extrapolate to expected load

•Look at production traffic•Or attempt educated guess

DECIDE ON SCOPE

Component test

Chain test

Full environment test•Test coverage•Level of certainty•Number of systems•Amount of work

SET UP TEST DATA

• Usually starts as a copy from production

• Or educated guess what people will enter

• Render anonymous

• Make tests deterministic

• Synchronise between all systems

http://www.flickr.com/photos/22168167@N00/3889737939/

DECIDE ON STRATEGY

One or more of:

•Scalability test

•Stress test

•Endurance test

•Regression test

•Resilience testhttp://www.flickr.com/photos/timjoyfamily/5935279962/

DECIDE ON TEST DURATION

(which is tricky)

http://www.flickr.com/photos/wwarby/3297205226

PROVIDE HARDWARE

http://www.flickr.com/photos/s_w_ellis/2681151694/

Copy of production?

Only one copy?

Virtualisation?

Sharing between teams?

INTEGRATE INTO PIPELINE

Unit testFunctional integration

testLoad test

Very fast Fast Takes longer

INTEGRATE INTO PIPELINE

Unit test

Functional integration

test

Load test

Very fast Takes longer

PERMANENT LOAD TESTING

Daytime: constant load, teams inspect impact of changes

Nighttime: Endurance test

Weekends: refresh test data

http://ww

w.flickr.com/photos/renaissancecham

bara/5106171956/

RESPONSE TIMEDNS lookup (www.xebia.com)

Time to first byte + loading HTMLTime to render

Time to document complete

Browser CPU useBandwidth

# connections to a single host

http://www.webpagetest.org/result/130522_FG_10SC/1/details/

SSL handshake

Parse times

Blocking client code





















































































http://www.xebia.com

http://www.xebia.com

IMPACT OF THE BROWSERwww.browserscope.org

http://www.browserscope.org

http://www.browserscope.org

CLEAR REQUIREMENTSResponse time

Fail: 10 Now: 3.5 Goal: 1Intention: Users get a response quickly so that they are happy and spend more money.

Stakeholder: Marketing dept.

Scale: 95th percentile of “document complete” response times, in seconds, measured over one minute.

Metric: Page load times as reported by our RUM tool.

Inspired by Tom Gilb, Competitive Engineering

WebPageTest: first view + repeat view (median of 3)

95th percentile response times from access logs

ADJUST REQUIREMENTS DUE TO LACK OF REAL BROWSERS

Playground to test changesNo impact on real users

Less pressure

More work

Guesswork and extrapolationCan take a significant amount of time

More hardware

THINGS WILL BREAK...... in spite of your best efforts

http://www.flickr.com/photos/jmarty/1239950166/

SO INSTEAD WE SHOULD FOCUS ON FAST RECOVERY

http://www.flickr.com/photos/19107136@N02/8386567228/

“MTTR is more important than MTBF*”

John Allspaw

* for most types of F

0

0.5

1.0

1.5

2.0

99th

per

cent

ile re

spon

se ti

me

(s)

Test duration

MTBF LEADS TO FUD

Time→TTD find cause (RCA) write & test fix build deploy validatecom

pile

deploy & testMonitoring

Alerts

• Skills•Organisation•Culture•Maintainability• Simple architecture

•Fast w

orkstations•

Good tooling

•A

ble to quickly test locally

•A

utomation

•Fast build server•

Efficient tests

Monitoring•

Autom

ation•

Flexible architecture

TTR

DEMING FEEDBACK LOOPS

Plan

Do

Study

Act

OODA LOOPS

Observe

Orient

Decide

Act

AVOID TEST-ONLY MEASUREMENTS

SIMPLE ARCHITECTURE

THE ONLY THING THAT MATTERS IS WHAT HAPPENS IN PRODUCTION

Everything else is an assumption.

DEPLOYING CHANGES

http://www.flickr.com/photos/39463459@N08/5083733600

BLUE-GREEN DEPLOYMENTS

Version n+1

Version n

Amazon Route 53

Elastic Load

Balancer

Elastic Load

Balancer

Instances

Instances

DARK LAUNCHINGWeb page DB

DARK LAUNCHINGWeb page DB Weather SP

DARK LAUNCHINGWeb page DB Weather SP

FEATURE TOGGLES

CANARY RELEASING

0% 100%

PRODUCTION-IMMUNE SYSTEMS

CONTROLLED LOAD TESTING

Instance RDS DB Instance

RDS DB InstanceRead Replica

Instance

Instance

Amazon Route 53

Elastic Load

Balancer

MONITORING

http://www.flickr.com/photos/smieyetracking/5609671098/

MONITORINGTechnical metrics•CPU use•Memory use•TPS•Response times•etc

Process metrics•# bugs•MTTR, MTTD•Time from idea to live on site•etc

Business metrics•Revenue•# unique visitors•etc

http://www.flickr.com/photos/smieyetracking/5609671098/

MEASURE IMPACT OF CHANGES

tail -‐f access_log | alstat.pl -‐i10 -‐n10 -‐stt

Hits Hits% TPS AvgTmTk TTmTk% AvgRSize RSize% 2013-‐06-‐04 19:37:40 (08) 14 0.1% 1.4 1.652 5.7% 2691 0.2% POST 200 /login.do 14 0.1% 1.4 0.918 3.2% 3739 0.3% GET 200 /home.do 14 0.1% 1.4 0.879 3.1% 3185 0.2% POST 200 /order.do 7 0.1% 0.7 0.807 1.4% 1974 0.1% POST 200 /account.do 4 0.0% 0.4 0.735 0.7% 3228 0.1% GET 200 /products.do 5 0.0% 0.5 0.697 0.9% 969 0.0% POST 200 /settings.do 9 0.1% 0.9 0.687 1.5% 1827 0.1% POST 200 /changeorder.do 27 0.2% 2.7 0.649 4.3% 2997 0.4% POST 200 /newpasswd.do 15 0.1% 1.5 0.580 2.2% 2488 0.2% GET 200 /offer.do 95 0.9% 9.5 0.520 12.2% 4801 2.3% GET 200 /search.do

MEASURE LATENCYAvg. response times front end vs backend

Number of calls

SMALL DEPLOYMENTS

http://www.flickr.com/photos/rbulmahn/4925464931/

GO/NO-GO MEETINGS

• What are the biggest fears?

• How can we measure this?

• What can be done if it does happen?

RETROSPECTIVESHow can we prevent a failure from happening again?

How can we detect it earlier?

Was there only one root cause?

http://www.flickr.com/photos/katerha/8380451137

INTRODUCE OUTAGES

Chaos monkey

Game day exercises

http://www.flickr.com/photos/frostnova/440551442/

CULTURE

• Dev and Ops work together on providing information.

• Assumptions are dangerous, try to eliminate as many as possible.

• Small changes are easier to fix than large ones.

• Deploy during office hours so everyone is available in case problems happen.

• All information, including business metrics, should be accessible to everyone.

CLAMS

Culture

Lean

Automation

Measurement

Sharing

SIMPLE, FLEXIBLE ARCHITECTURE

• If the site goes down often, probably its architecture is at fault

• Avoid fragile systems

• Resilience is key

• Scalable (redundancy is not waste)

• Rather many small systems than a few large ones

• State is a “hot brick”

CHANGES FOR THE BUSINESS

• Accept to push smaller changes.

• Continuous delivery vs continuous deployment.

• Share data.

CONCLUSION

Work on your ability to respond to failure. Trying to prevent failure can slow you down and make you focus on the wrong things.

Keep assumptions clearly separated from facts. Make your decisions based on evidence.

Measure everything, including the impact of changes to the business.

Look for your compromise, try permanent load testing first and learn from that.

QUESTIONS?

[email protected]@a32anwww.xebia.comblog.xebia.com

(we’re hiring)

Technology

Ways to minimise performance risks in continuous delivery