31
STAGES OF PRACTICE STAGES OF PRACTICE IN IN SITE RELIABILITY SITE RELIABILITY ENGINEERING ENGINEERING

STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

STAGES OF PRACTICESTAGES OF PRACTICE ININ

SITE RELIABILITYSITE RELIABILITY ENGINEERINGENGINEERING

Page 2: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Stages of PracticeStages of PracticeShu (obey/basics) 守Ha (detach/ready to learn) 破Ri (separate/intuitive) 離

Page 3: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Stages of PracticeStages of PracticeInnocent

Shu (obey) Novice

Beginner

Competent

Ha (detach) Pro�cient

Ri (separate) Master  

Expert/Researcher

Page 4: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts of SRE PracticeSignposts of SRE PracticeIncident ResponsePostmortemsIncident PreventionService Level Handling: Indicators, Objectives, Agreements (IOAs)MonitoringCapacity Planning and ForecastingPerformance Management

Page 5: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: Incident ResponseIncident Response

 

(hat tip to J. Paul Reed)

Page 6: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: Incident ResponseIncident Response

Novice “Alarmed” by incidentsPrimarily external sourced with inconsistentresponse

Beginner “Fears” incidentsE�ective response requires speci�c people

Competent “Aware” that incidents are normalWell de�ned handling process

Page 7: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: Incident ResponseIncident Response

Pro�cient “Accept” incidents as a normalSome inter-team coordination planning

Master “Embrace” incidents as learning experiencesWell documented processes and procedures withlearning inputs to the process

Page 8: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: PostmortemsPostmortems

Page 9: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: PostmortemsPostmortems

Novice “Blameful”, only for crisis incidentsLooking for a scapegoat

Beginner Only performed for major incidentsLooking for a cause with a focus on mistakes

Competent More common, starting to look past blamingFocus on improving local processes

Page 10: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: PostmortemsPostmortems

Pro�cient “Blameless”, used consistentlyAction items feed back to improve systems &processes

Master Used to derive “meta”-learningsApplying learnings across the system

Page 11: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: Incident PreventionIncident Prevention

 

(hat tip to J. Paul Reed)

Page 12: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: Incident PreventionIncident Prevention

Novice Focus on remediation (docs & metrics) formanually-identi�ed, static, contributory causes

Beginner Documentation done to an “acceptable” levelStatic & action-based causes recognized

Competent Focus on team response to incidents, maintainingdocs

Page 13: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: Incident PreventionIncident Prevention

Pro�cient Early phases of chaos engineering - scheduledLimited randomized chaos engineering

Master Chaos engineering as a tool to manage to an SLOFocus on general hygiene of operationalenvironment

Page 14: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: Service Level HandlingService Level Handling

SLIs / SLOs / SLAsSLIs / SLOs / SLAs

Page 15: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: SL[IOA]sSL[IOA]s

Novice Externally imposed (SLA), if anyOn paper, not necessarily measuredMay be manually calculated for contractual needs

Beginner Recognizes the di�erence in these termsMeasures “easy” things

Competent De�ned and measured primary characteristicsMeasures internal SLOs (80+%), not justcontractual performance

Page 16: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: SL[IOA]sSL[IOA]s

Pro�cient Well developed cascade of measuresHistorical record and correlation to events

Master Meaningful measures throughout the system

Page 17: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: MonitoringMonitoring

Page 18: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: MonitoringMonitoring

Novice No baseline metrics established

Beginner “OS level” or “out of the box”, inconsistentmonitoringPartial baselines being developed

Competent Consistent baseline monitoring across entiresystemAble to determine statistical anomalies

Page 19: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: MonitoringMonitoring

Pro�cient Thorough instrumentation of all service componentsAble to correlate internal and external measures

Master Data observable upon demandAutomated correlation and anomaly detection

Page 20: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: Capacity PlanningCapacity Planning and Forecastingand Forecasting

Page 21: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: Capacity Planning and ForecastingCapacity Planning and Forecasting

Novice Frequently running out of resourcesThrow hardware at the problem

Beginner Reactive, manual and/or time-consumingSome metrics, but incomplete

Competent Able to identify and react to danger situationsbefore crisisCoverage of the most critical 20-50%

Page 22: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: Capacity Planning and ForecastingCapacity Planning and Forecasting

Pro�cient Nearly complete coverageAble to reliably predict capacity in the short term (1+purchase cycles)Established methods for new service/featurehandling

Master Able to reliably predict capacity 3-4 purchase cyclesout

Page 23: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Signposts:Signposts: Performance ManagementPerformance Management

Page 24: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Shu Signposts:Shu Signposts: Performance ManagementPerformance Management

Novice “The site is up, isn’t that good enough?”

Beginner Selective monitoring

Competent Comprehensive monitoring, but still reactive onregressions

Page 25: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Ha-Ri Signposts:Ha-Ri Signposts: Capacity Planning and ForecastingCapacity Planning and Forecasting

Pro�cient Able to prevent regressions through pre- releaseanalysis/modeling/validationInitial “large market” segmentation

Master Finely granular performance monitoring andmanagement

Page 26: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Assessing Your Organization’sAssessing Your Organization’s Level of PracticeLevel of Practice

Mock Assessment Search

Monitoring

Novice

Competent

Pro..

BeginnerBeg..

I-Response

Beginner

SLx

Novice

Beginner

I-Prevent PM

Com..

NoviceNovice

Page 27: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Other Potential Areas to EvaluateOther Potential Areas to EvaluateObservability Tooling & CapabilitiesError Budget De�nition and Usage

Continuous Budget Tracking (vs. binary Bang-bang handling)Change Management PracticesFull Service Lifecycle Reliability: Do Your Services “Plan for Retirement”?

Page 28: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Even More Potential Areas toEven More Potential Areas toEvaluateEvaluate

New Services: Intro to Stability ArcToil FractionOncall Sustainability

Ryan Franz (Etsy): Mean Time to Sleep (MTTS)

Page 29: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Each Each ‘9’‘9’ will cost you more will cost you more than the one before it…than the one before it…

 

Org-wide Practice Adoption ?Org-wide Practice Adoption ?Practices Under Pressure?Practices Under Pressure?

Page 30: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

Will Will ChangeChange Ever Stop? Ever Stop?The root cause for both the functioning and malfunctions in all

complex systems is impermanence (…all systems are changeable bynature). Knowing the root cause, we no longer seek it, and insteadlook for the many conditions that allowed a particular situation to

manifest. We accept that not all conditions are knowable or �xable.

– Dave Zwieback Beyond Blame

. . . and that’s a . . . and that’s a goodgood thing thing

Page 31: STAGES OF PRACTICE IN SITE RELIABILITY ENGINEERING · Postmortems Novice “Blameful”, only for crisis incidents Looking for a scapegoat Beginner Only performed for major incidents

. . . the real story . . . goes on forever . . . every chapter is better than the one before.

Continuing the conversation. . .Continuing the conversation. . .Twitter: @DrKurtATwitter: @DrKurtA

https://linkedin.com/in/kurta1