View
132
Download
1
Category
Preview:
Citation preview
Join the conversation #devseccon
From Resilient to Antifragile - Chaos Engineering Primer
By @Sergiu_Bodiu @PivotalPlatform Architect Asia Pacific & Japan
Singapore Spring UserGroup
DevOpsDays SingaporeConference
Imagine having an idea in the morning and shipping in the evening.
@Sergiu_Bodiu
Anewwaytolookatorganizations
3
Fragile:Atriskoftotalfailure/financialruinResilient:Takesdamage,avoidstotalfailure,recovers
Robust:Absorbsuncertainty,repelsblows,avoidsdamage
Antifragile:Respondstostressbymutating,maintainsfitnessforpurpose.IdentityChange.
@Sergiu_Bodiu
DistributedSystemsComplexity
5
Complexityislikeaddiction…Itcomesonslowly,formingweakbondsthatyoucanbarelyfeel.
Butasitcontinues,thebondsstrengthenquietlyuntiltheycalcifyandbecomehardtobreak.
@Sergiu_Bodiu
SoftwareisSinglePointofFailure
6
Instapaper-outage-cause-recoveryafter31hoursgitlab.comwasdownforabout18hoursandalsolostproductiondataroughly5,000projects,5,000commentsand700newuseraccounts.
RootCauseAnalysis:WhilecomponentfailuressuchasNETWORK,STORAGE,SERVER,HARDWARE,andPOWERfailuresareanticipatedandthusguardedwithextraredundancies.
@Sergiu_Bodiu8 Alexey Krivitsky https://www.slideshare.net/krivitsky/dejirafication-clean-your-process
Dejirafication
@Sergiu_Bodiu
ChaosEngineering
9
ChaosEngineering-Disciplineofexperimentingonadistributedsysteminordertobuildconfidenceinthesystem’s
capabilitytowithstandturbulentconditionsinproduction.
http://principlesofchaos.org
@Sergiu_Bodiu
Backups
10
"Backupsalwayssucceed.It'stherestoresthatfail.
Testyourbackupsbypracticingrestores!"
UsingChaosMonkey
@Sergiu_Bodiu
SomeoutagesintheRegion
13
SingTelfinedarecord$6mforBukitPanjangexchangefire;
Telstragoesdownagain,peoplecan'tdrinkbeerorcatchUbers
AmazonWebServicesoutagecausesAustralianwebsitechaos
@Sergiu_Bodiu
NetflixSimianArmy
14
The Simian Army is a suite of tools for keeping your cloud operating in top form.
https://github.com/Netflix/SimianArmy
@Sergiu_Bodiu
ChaosMonkey
15
•Activeduringnormalworkinghours•Breakthingsinproduction•Designbettersoftwareservices•Embracingfailure
http://techblog.netflix.com/2016/10/netflix-chaos-monkey-upgraded.html
@Sergiu_Bodiu
OtherMonkeys
16
•LatencyMonkey•JanitorMonkey•ConformityMonkey•SecurityMonkey•DoctorMonkey
@Sergiu_Bodiu
PrinciplesofChaos
17
1. BuildaHypothesisaroundSteadyStateBehavior2. VaryReal-worldEvents3. RunExperimentsinProduction4. AutomateExperimentstoRunContinuously
TIP:Intentionallybreakthings,comparemeasuredwithexpectedimpact,andcorrectanyproblemsuncoveredthisway.Chaos Engineering Whitepaper 2016
@Sergiu_Bodiu
Hypothesize
18
Buildahypothesisaroundsteadystatebehavior.Steadystatecharacterizationsthatarevisibleattheboundaryofthesystem,whichdirectlycaptureaninteractionbetweentheusersandthesystem.
TIP:UtilisationisVirtuallyUselessasaMetric!
@Sergiu_Bodiu
VaryEvents
19
• terminatevirtualmachineinstances• injectlatencyintorequestsbetweenservices• failrequestsbetweenservices• failaninternalmicroservice• makeanentireregionunavailable
TIP:Selectonlyasubsetofusers
@Sergiu_Bodiu
Experiment
20
92%ofcatastrophicsystemfailuresweretheresultofincorrecthandlingofnonfatalerrors.Itissimplynotpossibletofullyreproducetheentirearchitectureandrunanendtoendtest.
TIP:Customersdon'tbehaveasyourJMeterscript.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
@Sergiu_Bodiu
Experiment
21
92%ofcatastrophicsystemfailuresweretheresultofincorrecthandlingofnonfatalerrors.Itissimplynotpossibletofullyreproducetheentirearchitectureandrunanendtoendtest.
TIP:Customersdon'tbehaveasyourJMeterscript.
https://www.usenix.org/system/files/conference/osdi14/osdi14paperyuan.pdf
@Sergiu_Bodiu
Automate
22
• Distributedsystemschangescontinuouslyovertime.• Engineersmodifythebehaviorofexistingservices,addnewservices.
• Engineersarechangingruntimeconfigurationparameters,upgradingandpatchingsystems
TIP:Dependingonthecontext,changetherateofeachexperiment.
@Sergiu_Bodiu
ChaosEngineeringWhitepaper2016
23
Buildahypothesisaroundsteadystatebehavior.Varyrealworldevents.Runexperimentsinproduction.Automateexperimentstoruncontinuously.
@Sergiu_Bodiu
Theimportanceofreliability
24
Don'ttrustclaimssystemsmakeaboutthemselves&theirdependencies.Verify
bybreaking.
@Sergiu_Bodiu
Locustdemo
25
Locustisanopen-sourcePythonloadtestingframework.• Defineuserbehaviourincode• Canexecuteend-to-endusertestwithsessionsandcookies.• Expandstomultipleslavestoincreaseloadcapacity• AllowsfordistributeduserpathsbasedonpercentagesGatlingisanopen-sourceScalaloadtestingframework• Highperformance• Ready-to-presentHTMLreports• Scenariorecorderanddeveloper-friendlyDSL
@Sergiu_Bodiu
LessonsLearned
26
• Don’twaitsolongtostartloadtesting.• Theconversationsdrivenewrequirements.• Changingarchitecturelastminuteisextremelydangerous.• Thisisincrediblehardunderpressure.• BuildrelationwithNetworkingTeam,DatabaseTea,ThirdPartyPartners,Vendorsetc..
• MakeeverythingAsynchronous(EmbraceFailure,BackgroundTasks,Retry,Idempotence)
@Sergiu_Bodiu
ChaosLemurdemo
28
ChaosLemurisanalternativetoChaosMonkey(whichwasdesignedforAWS)thatwasdesignedwithPCFinmind.
@Sergiu_Bodiu
Cleanyourprocess
29
Fromleanthinkingperspective:managingtheinventoryisanon-value-addingactivity
Culture>Principles>Tools
@Sergiu_Bodiu
TestingPyramid
30 https://watirmelon.blog/2012/01/31/introducing-the-software-testing-ice-cream-cone/
@Sergiu_Bodiu
Principles
31
Anydeveloperbuildingapplicationswhichrunasaservice.Opsengineerswhodeployormanagesuchapplications.https://12factor.net:
Anyoneworkinginsoftwarethatwritestestsormaintains
continuousintegrationpipelines.http://www.10factor.ci
@Sergiu_Bodiu
AgileManifesto
32
TheAgilemovementisnotanti-methodology,infact,manyofuswanttorestorecredibilitytothewordmethodology.
Now,abiggergatheringoforganizationalanarchistswouldbehardtofind,sowhatemergedfromthismeetingwassymbolic.
@Sergiu_Bodiu
FurtherReading
33
https://www.infoq.com/br/presentations/exercising-failure-at-netflixhttps://www.infoq.com/podcasts/failure-as-a-service
PeterAlvaro:OrchestratedChaos:ApplyingFailureTestingResearchatScaleMathiasLafeldt:WritingyourfirstpostmortemAdrianColyerSimpleTestingCanPreventMostCriticalFailures
@Sergiu_Bodiu
ChaosEngineering
35
ChaosEngineering-Disciplineofexperimentingonadistributedsysteminordertobuildconfidenceinthesystem’s
capabilitytowithstandturbulentconditionsinproduction.
http://principlesofchaos.org
@Sergiu_Bodiu
BlueprintforlivinginaBlackSwanworld.
36
Antifragile,andonlytheantifragile,will
Makeit.
Recommended