Upload
a32an
View
128
Download
0
Embed Size (px)
Citation preview
WAYS TO MINIMISE PERFORMANCE RISKS IN CONTINUOUS DELIVERY
Adriaan Thomas4 June 2013
INTRODUCTION
OBJECTIVEPut working software into production as quickly as possible, whilst minimising risk of load-related problems:
• Bad response times
• Lack of capacity
• Availability too low
• Excessive system resource use
Within the context of websites.
TRADITIONAL APPROACHLoad testing through simulation
http://www.flickr.com/photos/danramarch/4423023837
DECIDE WHAT TO TEST
•Focus on busiest instant•Model most-hit functionality•Extrapolate to expected load
•Look at production traffic•Or attempt educated guess
DECIDE ON SCOPE
Component test
Chain test
Full environment test•Test coverage•Level of certainty•Number of systems•Amount of work
SET UP TEST DATA
• Usually starts as a copy from production
• Or educated guess what people will enter
• Render anonymous
• Make tests deterministic
• Synchronise between all systems
http://www.flickr.com/photos/22168167@N00/3889737939/
DECIDE ON STRATEGY
One or more of:
•Scalability test
•Stress test
•Endurance test
•Regression test
•Resilience testhttp://www.flickr.com/photos/timjoyfamily/5935279962/
DECIDE ON TEST DURATION
(which is tricky)
http://www.flickr.com/photos/wwarby/3297205226
PROVIDE HARDWARE
http://www.flickr.com/photos/s_w_ellis/2681151694/
Copy of production?
Only one copy?
Virtualisation?
Sharing between teams?
INTEGRATE INTO PIPELINE
Unit testFunctional integration
testLoad test
Very fast Fast Takes longer
INTEGRATE INTO PIPELINE
Unit test
Functional integration
test
Load test
Very fast Takes longer
PERMANENT LOAD TESTING
Daytime: constant load, teams inspect impact of changes
Nighttime: Endurance test
Weekends: refresh test data
http://ww
w.flickr.com/photos/renaissancecham
bara/5106171956/
RESPONSE TIMEDNS lookup (www.xebia.com)
Time to first byte + loading HTMLTime to render
Time to document complete
Browser CPU useBandwidth
# connections to a single host
http://www.webpagetest.org/result/130522_FG_10SC/1/details/
SSL handshake
Parse times
Blocking client code
CLEAR REQUIREMENTSResponse time
Fail: 10 Now: 3.5 Goal: 1Intention: Users get a response quickly so that they are happy and spend more money.
Stakeholder: Marketing dept.
Scale: 95th percentile of “document complete” response times, in seconds, measured over one minute.
Metric: Page load times as reported by our RUM tool.
Inspired by Tom Gilb, Competitive Engineering
WebPageTest: first view + repeat view (median of 3)
95th percentile response times from access logs
ADJUST REQUIREMENTS DUE TO LACK OF REAL BROWSERS
Playground to test changesNo impact on real users
Less pressure
More work
Guesswork and extrapolationCan take a significant amount of time
More hardware
THINGS WILL BREAK...... in spite of your best efforts
http://www.flickr.com/photos/jmarty/1239950166/
SO INSTEAD WE SHOULD FOCUS ON FAST RECOVERY
http://www.flickr.com/photos/19107136@N02/8386567228/
“MTTR is more important than MTBF*”
John Allspaw
* for most types of F
0
0.5
1.0
1.5
2.0
99th
per
cent
ile re
spon
se ti
me
(s)
Test duration
MTBF LEADS TO FUD
Time→TTD find cause (RCA) write & test fix build deploy validatecom
pile
deploy & testMonitoring
Alerts
• Skills•Organisation•Culture•Maintainability• Simple architecture
•Fast w
orkstations•
Good tooling
•A
ble to quickly test locally
•A
utomation
•Fast build server•
Efficient tests
Monitoring•
Autom
ation•
Flexible architecture
TTR
DEMING FEEDBACK LOOPS
Plan
Do
Study
Act
OODA LOOPS
Observe
Orient
Decide
Act
AVOID TEST-ONLY MEASUREMENTS
SIMPLE ARCHITECTURE
THE ONLY THING THAT MATTERS IS WHAT HAPPENS IN PRODUCTION
Everything else is an assumption.
DEPLOYING CHANGES
http://www.flickr.com/photos/39463459@N08/5083733600
BLUE-GREEN DEPLOYMENTS
Version n+1
Version n
Amazon Route 53
Elastic Load
Balancer
Elastic Load
Balancer
Instances
Instances
DARK LAUNCHINGWeb page DB
DARK LAUNCHINGWeb page DB Weather SP
DARK LAUNCHINGWeb page DB Weather SP
FEATURE TOGGLES
CANARY RELEASING
0% 100%
PRODUCTION-IMMUNE SYSTEMS
CONTROLLED LOAD TESTING
Instance RDS DB Instance
RDS DB InstanceRead Replica
Instance
Instance
Amazon Route 53
Elastic Load
Balancer
MONITORING
http://www.flickr.com/photos/smieyetracking/5609671098/
MONITORINGTechnical metrics•CPU use•Memory use•TPS•Response times•etc
Process metrics•# bugs•MTTR, MTTD•Time from idea to live on site•etc
Business metrics•Revenue•# unique visitors•etc
http://www.flickr.com/photos/smieyetracking/5609671098/
MEASURE IMPACT OF CHANGES
tail -‐f access_log | alstat.pl -‐i10 -‐n10 -‐stt
Hits Hits% TPS AvgTmTk TTmTk% AvgRSize RSize% 2013-‐06-‐04 19:37:40 (08) 14 0.1% 1.4 1.652 5.7% 2691 0.2% POST 200 /login.do 14 0.1% 1.4 0.918 3.2% 3739 0.3% GET 200 /home.do 14 0.1% 1.4 0.879 3.1% 3185 0.2% POST 200 /order.do 7 0.1% 0.7 0.807 1.4% 1974 0.1% POST 200 /account.do 4 0.0% 0.4 0.735 0.7% 3228 0.1% GET 200 /products.do 5 0.0% 0.5 0.697 0.9% 969 0.0% POST 200 /settings.do 9 0.1% 0.9 0.687 1.5% 1827 0.1% POST 200 /changeorder.do 27 0.2% 2.7 0.649 4.3% 2997 0.4% POST 200 /newpasswd.do 15 0.1% 1.5 0.580 2.2% 2488 0.2% GET 200 /offer.do 95 0.9% 9.5 0.520 12.2% 4801 2.3% GET 200 /search.do
MEASURE LATENCYAvg. response times front end vs backend
Number of calls
SMALL DEPLOYMENTS
http://www.flickr.com/photos/rbulmahn/4925464931/
GO/NO-GO MEETINGS
• What are the biggest fears?
• How can we measure this?
• What can be done if it does happen?
RETROSPECTIVESHow can we prevent a failure from happening again?
How can we detect it earlier?
Was there only one root cause?
http://www.flickr.com/photos/katerha/8380451137
INTRODUCE OUTAGES
Chaos monkey
Game day exercises
http://www.flickr.com/photos/frostnova/440551442/
CULTURE
• Dev and Ops work together on providing information.
• Assumptions are dangerous, try to eliminate as many as possible.
• Small changes are easier to fix than large ones.
• Deploy during office hours so everyone is available in case problems happen.
• All information, including business metrics, should be accessible to everyone.
CLAMS
Culture
Lean
Automation
Measurement
Sharing
SIMPLE, FLEXIBLE ARCHITECTURE
• If the site goes down often, probably its architecture is at fault
• Avoid fragile systems
• Resilience is key
• Scalable (redundancy is not waste)
• Rather many small systems than a few large ones
• State is a “hot brick”
CHANGES FOR THE BUSINESS
• Accept to push smaller changes.
• Continuous delivery vs continuous deployment.
• Share data.
CONCLUSION
Work on your ability to respond to failure. Trying to prevent failure can slow you down and make you focus on the wrong things.
Keep assumptions clearly separated from facts. Make your decisions based on evidence.
Measure everything, including the impact of changes to the business.
Look for your compromise, try permanent load testing first and learn from that.