View
4
Download
0
Category
Preview:
Citation preview
SREatAirbnb
DevOps&SRE SREOrganization FutureofOps
HowdoyoucombinethecultureandspiritofDevOpswithanoperationsteam?
SREatAirbnb
DevOps&SRE SREOrganization FutureofOps
HowisSREatAirbnborganized?CloudInfraandReliabilitydeep-dive.
SREatAirbnb
DevOps&SRE SREOrganization FutureofOps
Operatorsshouldgrow,learn,andberecognizedforon-callwork,whilemaintainingpager-lifebalance.
CentralizedOps
Positives
Reliabilitycanbeeasilyprioritized
Specializationofroles
Negatives
Operatorsunfamiliarwithcodebase
Tensionbetweenoperationsanddevelopment
CentralizedOperationsOrganization
CentralizedOps
DistributedOps
Positives
Agilitycanbeeasilyprioritized
Developersareincentivizedtobuildsystemsthatareeasytooperate(sincetheyaretheoperators!)
Negatives
Lackofspecialization--- devsareforcedtorelearndifficultlessonsover-and-over
Teamsspeakdifferentuptime/reliabilitylanguagestoeachother
DistributedOperations
CentralizedOps
DistributedOps
HybridApproach
Ableto'tune'abalancebetweenreliabilityandagility
Developersarestillexpectedtorunnormaloperationsfortheirservices==buildoperableservices
Centralizedoperationsorganizationcanbuildreusabletoolstomakeoperations/incidentresponseeasier.
Specializationofroleswithouttensionbetweenoperationsanddevelopmentteams.
Organizationthatunderstandandrecognizesthevalueinautomatingawaytheirjob.
HybridApproach:TwoPizzaTeams+SRETeam
BenTreynorVPEngineering,Google
Fundamentally,it'swhathappenswhenyouaskasoftwareengineertodesignanoperationsfunction...“
WhatmakesupSREatAirbnb?
SiteReliabilityEngineeringismadeupofthreecomponents:
CloudInfrastructureManagesourtouchpointswithAWSandothercloudpartners
CoreReliabilityDevelopstoolsandprocessestoimproveoperations,reliability,andincidentresponseforallteams
EmbeddedReliabilityTemporaryembeddingofSREsinproductteamstoworkonspecificreliabilityoravailabilityfocusedprojects
RequirementsforEachIntegration
Monitoring
Alerting
SecurityApproval
Auditing
VersionUpgrades
AccessControl
...
ThreePillarsofReliability
UptimeMeasurement Alerting&Detection IncidentResponse
Defense-in-depth:ourusersareprotectedfrombugsandregressionsbymultiplelayersofopinionatedalerts.
Engineerscancoordinateacrossteams,investigateproblemsinsystemstheydon'tfullyunderstand,andkeepstakeholdersup-to-date.
Everyteamatanytimeshouldbeabletoconfidently saywhethertheirserviceisworkingproperlyornot.
1.Uptime
Identifyquantifiablemetricswhicharerelatedtothehealthoftheirservices,called(ServiceLevelIndicatorsorSLI)
MakepublicandeasilydiscoverablepromisesaboutthebehaviorofyourserviceusingyourSLIs(ServiceLevelObjectivesorSLO)
TeamsreviewtheirservicescurrentSLIsandcomparethemtotheirpublishedSLOstomaketradeoffsbetweenreliabilityimprovementsandnewfeatures--- SLOsencodethetradeoffbetweenmovingfastandbreakingthings(Errorbudgets)
1.Uptime
2.Alerting
Alertingphilosophyshouldbeopinionated--- engineersknowwhatkindofalertstowriteandwhentowritethem
Alerts(likeconfiguration)shouldbecode
Practicedefenseindepth--- protectyourusersfrombugsandregressionswithlayersofalertslikeasecurityteamprotectsemployeesfrombeingcompromisedwithlayersofdefenses
1.Uptime
2.Alerting
3.Response
IncidentReporterTool
Mid-Incident
Engineerscaneffectivelycoordinate,evenacrossteams
Stakeholders(upstreamclients,management,employees)arekeptawareofupdates
WorkingonaSlackintegrationsoresponderscanstayinchatbutkeepthecompanyup-to-date
Post-Incident
Blamelesspostmortemprocess
Consistentimpactmeasurement(managementseesthatbetterincidentresponse+correctiveactionsmatterstothebottomline)
Easilysearchpastincidents/postmortems
FutureofOps
Pager-LifeBalance:Ensurethatmoreinvolved,tenuredengineersaren’talwaystheoneswakingupat3AMto
putoutfires
Learning/GrowthFocused:Continuingeducationandlearningopportunitiesforon-callengineers
EvaluationMetrics:Engineersshouldknowwheretheycanimproveandshouldberecognizedforexcellentwork
IntelligentScheduling:InDevOpswheneveryteamhasatleasttwoon-callrotations,howcanweschedule
aroundlivesoutsideofwork(andresponsibilitiesinsideofwork)?
People-FirstOn-call
Recommended