Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
TheChallengesofOperatingaComputingCloudandCharging
foritsUseMarvinTheimer
VP/DistinguishedEngineer
AmazonWebServices
1©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
CustomersWantItAll
• Lotsoffeaturesandallthe“ilities”• Payaslittleaspossible• Getitassoonaspossible
2©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
Trade-offsMustBeMade
• Inherenttensionbetweencustomers’desires
• MUSTworkbackwardsfromthecustomer
• It’snotalwaysobviouswhateachcustomerreallywants
3©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
ScalingChallenges
• Abigcomputecloudhasatleastamillionphysicalserversworld-wide
• AmazonS3storestrillionsofobjects,containsexabytes ofdata,andfieldsmillionsofrequests/second
• Aservice-orientedarchitecture(SOA)impliestherearemanyservices
• Amazonhastensofthousandsofservices
• AbigservicelikeAmazonS3mayrequiretensofthousandsofservers
4©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
YouNeedAvailabilityToo
• Exampledefinitionofavailability:
“Thenumberof5minuteintervalsduringwhichtheratiooferrorreturns(http500’s)tototalsystemrequestsislessthan5%overthetotalnumberof5minuteintervals.”
©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.5
LevelsofAvailability
• Availability Amountofdowntimeperyear
• 99.8%: 17.5hours
• 99.9%(3x9’s): 8.8hours
• 99.99%(4x9’s): 52.6minutes
• 99.999%(5x9’s): 5.26minutes
• 99.9999%(6x9’s): 31.5seconds
• 99.99999%(7x9’s): 3.15seconds
6©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
SomeImplicationsofVariousLevelsofAvailability
• 99.8%: 17.5hours
• Youmightcrippleyourbusiness(e.g.IntuitonApr15th)
• 3x9’s: 8.8hours
• Youcanaffordtodooccasionalsmallscheduleddowntimes
• 4x9’s: 52.6minutes
• Can’tdoscheduleddowntimesofanysignificance
• Pagedhumanhasabout30-40minutestocorrect/restartthings
7©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
SomeImplicationsofVariousLevelsofAvailability
• 5x9’s: 5.26minutes
• Pagedhumanwon’tbeon-linebeforeyou’veexceededyouryearlySLA
• De-factoneedfullyautomatedfailureresponsesystem
• Humanscanonlybeinvolvedwithlonger-termtrendsmanagement
• 6x9’s: 31.5seconds
• Havetoredefinewhatyoumeanbyavailability(5min.intervalstoocoarse)
• Intherangeofthrottlingdelays• 7x9’s: 3.15seconds
• Belowthepracticalthresholdfordistributedleasedlocks
8©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
RealityBites
• Developersarefallible• Cloudservicesevolvequickly• Near-perfectautomation/fault-toleranceisexpensive
• Currentstate-of-the-artrequireshumansinthelooptodealwithunforeseencircumstances
• Youcanbuild6x9’savailableservices,butitmaynotrepresenttherightcost/benefittrade-offformostcustomers
9©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
YouNeedLogging
• VolumeoflogtrafficismeasuredinTB/hour• Foralargeservicethelogvolumeisstillofthatorderofmagnitude
• Can’tjustgrepit:youneedafull-blownsearchcapability• Richerqueriesimplyevenmoretechnology
• Needfortimelyanswerspushesyoutowardsnear-real-timesupport
• It’salltechnologyfeasible• Itjustcostsalot• Howmuchcostisjustifiable?
10©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
YouNeedMetricsasWellasLogging
• Logqueriesarefordebugging• Todeterminewhetheraservice/systemisbehavingproperlyyouneedmetrics
• Ideallyyoutrack“everything”• Fartooexpensive
• Costofgathering• Attentioncost
• Youhavetofigureoutthemetrics“workingset”youneed
• Whatareyour“leading”metrics?
11©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
MetricsChallenges
• Theworkingsetmaychangeinunobviouswaysasyourservice– oritsworkloads– evolve
• Importanttohave“tripwire”metrics
• Importanttohaveautomatedalarms
• Alsoneedtohavealarmdeduplication/squelching
12©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
AvailabilityasSeenbyIndividualCustomers
• Aservicecanbe99.99%availableandanindividualcustomercanstillhaveareallybadday
• Ideally,wantnear-real-time“top-N”metrics
• Thesearenotcheap
©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.13
LatencyasSeenbyIndividualCustomers
• OnedefinitionofalatencySLA:
“Thenumberof5minuteintervalsduringwhichtheratioofreturnswithlatencyhigherthanthelatencySLAtototalsystemrequestsislessthan5%overthetotalnumberof5minuteintervals.”
• Same“badday”problemexistsforlatencyasforavailability
• Needtomonitorp90,p99,p99.99,andevenp100
14©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
Developing,Testing,Deploying,OperatingatScale• AmazonWebServices(AWS)launchedO(1000)featureslastyear.Customersareimpatientformore
• Thevarietyofworkloadsandexceptionscenarios(failures,distributeddenial-of-serviceattacks,customerloadspikes,etc.)ishuge
• Increasingemphasis/demandforplatform-widefeatures,suchtagging,policyenforcement,etc.
15©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
Thingscanchangeoutfromunderyouquickly
• Suddenloadquantumleaps
• Capacitychallenge• Ramp-upchallenge
• Newfeaturesthathaveunintendedscalingsideeffects• Newfeatureinoneplacemayacceleratetherateofloadgrowthinanother
• Non-lineareffects• Unintendedconsequencesduetounexpecteduses
©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.16
TestingisCrucial
• Avoidingthedeathspiralofmean-time-to-failure>mean-time-to-repair
• Testing:theonly“truth”youhaveiswhatyoutestregularly• Regressiontests• Scaling/performancetests
• Faulttolerancetests• Theimportanceoftestingtofailure
• Loadtestingtothebreakingpointalongallrelevantdimensions
• Chaosmonkeys
• LSEtests(chaosarmiesandgamedays)
17©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
YouCan’tAnticipateEverything
• Needrollingdeployments
• Need(automated)rollbackcapability
• Root-causeanalysisischallengingwhen“everything”isconstantlyinflux
• UseCI/CD:lotsofsmall,incrementalchangesareeasiertodealwiththanafew“bigbangs”
18©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
OperationalReadinessisCrucial
• Modelingyoursystem• Securitythreatmodel
• Failuremodel,includingLSEanalysis
• Operationalreadinessreview(ORR)checklist• On-callrotation
• Primarypersonnel
• Well-definedescalationpaths,includingtootherservices
• On-callrunbooks• Havetobeeasytounderstandanduse• Mustpracticeusingthem
19©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
HumansintheLoop
• Humansarenecessarybecausesystemsare• Extremelycomplex• Evolveataferociousrate• Exhibitdifficult-to-anticipateemergentbehaviors• Behaveinnon-linearways
• Humansareahugeproblembecausetheyareimperfect– especiallyatrepetitivetasks• Multi-stepstandardoperatingprocedures(SOPs)areagoodsourceoferrors• Dittoforcut-and-pastetasks• Dittoforcomplex,difficult-to-parse,textcommands
è Needcannedproceduresthathavesimpleinvocationsemantics
20©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
TheTensionBetweenPower/EfficiencyandSafety• ToolsandAPIsshouldbesafetouse:
• Projectedoutcomeofanactionshouldbeclearlydiscernible
• Ideallyactionscanbeundoneifnecessary
• Safetyaddsfriction• Dangerofpeopleinventingshort-cuts
• What’sthe“right”amountofsafetyfrictiontoimpose?
• Sometimesyouneedapowertoolthatwillletyoudo“heartsurgery”• Howoftendoyouuseit?
• Howoftendoyoupracticewithit?
21©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
“CorrectionofError”Reports
• Theimportanceofrecordingandpropagatingthingslearned
• Rootcauseanalysis:“The5Whys”
• Example:2011AmazonEBSoutage
• Networkmisconfiguredduringanupgrade
• Re-mirroringstorm
• What’stherootcause?
• ServicecontrolplaneproblemsàServicedataplaneproblemsà networktrafficproblemsànetworkmisconfigurationà difficult-to-usetoolsforconfiguringnetworkrouters
• Theimportanceofclosed-loopactionmechanismsvs.goodintentions
22©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
AbstractRepresentationofEBSinaRegion
©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.23
InitialFailureEvent
©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.24
Follow-onProblems
©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.25
ChargingforUse
• Youhavetobuildinsupportfromthebeginning(likewithsecurity)
• Havetobeabletotrackcustomers’usagealongallrelevantdimensionsandacrossallbackendsystemsandservices
• Meteringvolumes(attheedge)aremeasuredinmillionsofrecords/secandTB/hour.
• What’stherightpricingmodel?
• Fullycost-followingmodelsareverycomplicated
• Simplermodelsmayhaveunintendedconsequences
26©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
SomePricingNuances
• Freetiers• Invitationtouse• Alsoasimplerpricingmodelfor“glue”resources
• Derivativeusage• Resourceusageenabledbyotherresourceusage• Example:cheaperdataingestionleadstomorecompute
27©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
LimitingMistakesandFraud
• Highelasticityenablestheabilitytodoalotofdamagequickly
• Howdoyoudistinguishlegitimaterequestsformoreresourcesfrommistakesandfraud?
• Havetoputdynamiclimitsonwhatcanbeused
• Simplest– andleastcustomerfriendly– solutionisuniversalsoftquotas
• Canmakeaquotacustomer-specific
• Trustedcustomersgethigherdefaultlimits
• Pasthistoryusedaspredictoroffuturebehavior
• Differingpaymentstrategies
28©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.
SummaryandConclusions
• Afundamentaltension:customerswant• richfeaturesetandcapabilities• relentlesscostreduction• everythingassoonaspossible
• Atscaleit’sallaboutthetail
• Testingandautomationarecrucial,butthehuman(sofar)stillhastobeintheloop– forbetterandworse
• Whattochargehasmanynuancesandrequiressupportfromthebeginning
29©2017,AmazonWebServices,Inc.oritsaffiliates.Allrights
reserved.