40

Rewriting DevOps

Embed Size (px)

Citation preview

Page 1: Rewriting DevOps
Page 2: Rewriting DevOps

TITLE SPONSORS

TRACK SPONSORS

Page 3: Rewriting DevOps

HEADLINE SPONSORS

PARTNER SPONSORS MEMBER SPONSORSGuiceworksCooleyBridgepoint EducationFull ContactGeneral AssemblyDripjoyLyftOnDeckConnect for HealthWazee DigitalOfficescapesJake Jabs Center for EntrepreneurshipDenver Office of Economic Development

Alchemy SecurityAyla NetworksEdge linkSwift pageTaxnologiSpotxDavis Graham & StubbsDocumotoRight pointName.comThe Denver FoundationBoomtownSix ActualMaker SourceSlider Smith & FramptonNetsuiteLogistical Meetings & Events

Page 4: Rewriting DevOps

Rewriting DevOps

Matthew BoeckmanVP - Infrastructure

Craftsy@matthewboeckman

Page 5: Rewriting DevOps

This is not a DevOps definition

●Common Tooling●Organizational Empathy●Shared Responsibility

Page 6: Rewriting DevOps

Why Rewrite?

1. Support new business initiatives2. Scale and resilience3. Quicker iterations

Page 7: Rewriting DevOps

#30 on Forbes' 2015 list of Most Promising Companies10+MM registered members, 11+MM enrolled courses350 course enrollments/hour

Page 8: Rewriting DevOps
Page 9: Rewriting DevOps
Page 10: Rewriting DevOps
Page 11: Rewriting DevOps

DevOps 1.0● Some Ops dev’d, and a few Devs Ops’d● Great cross team culture, still separate teams● Shared Oncall but heavy Ops burden● Limited common tooling

Page 12: Rewriting DevOps

DevOps 2.0 goals● Integrated DevOps team and workflows● Common tools● Shared Oncall

Page 13: Rewriting DevOps
Page 14: Rewriting DevOps
Page 15: Rewriting DevOps
Page 16: Rewriting DevOps

Common Tooling

Page 17: Rewriting DevOps
Page 18: Rewriting DevOps

Common Tools

Jenkins (build, deploy, ETL, scheduled tasks)Terraform (infrastructure configuration)Splunk (data intelligence)AWS (all infrastructure)

Page 19: Rewriting DevOps
Page 20: Rewriting DevOps

Backend

OpsFrontend

Page 21: Rewriting DevOps
Page 22: Rewriting DevOps

Organizational Empathy

Page 23: Rewriting DevOps

SiteReliabilityEngineering

*not DevOps

Page 24: Rewriting DevOps

"Fundamentally, it's what happens when you ask a software engineer to design an operations function."

Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE

Page 25: Rewriting DevOps

SRE Phase 1 (Feb-May)

● Determine tooling○ Nagios, graphite, splunk, confluence

● SWAG at reliability metrics○ Errors; response time

● Runbooks● Blameless Postmortem every outage● Iterate

Page 26: Rewriting DevOps

The primary hurdle to DevOps and SRE adoption is

The Skill Gap

Page 27: Rewriting DevOps

Runbooks:● System overview● Escalation path● Alert descriptions● Common failure conditions● Known recovery procedures● Incident history

Page 28: Rewriting DevOps

Postmortem - 7 W’s and an H

1. What (happened)2. What (systems were impacted)3. When (did it occur)4. Who (was involved)5. How (did we discover the issue)6. Why (did it go explody)7. What (will we do to remedy it)8. When (will that remedy be actioned)

Page 29: Rewriting DevOps

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an

accident can give a detailed account of:

what actions they took at what time,

what effects they observed,

expectations they had,

assumptions they had made,

and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution

*John Allspaw, CTO - Etsyhttps://codeascraft.com/2012/05/22/blameless-postmortems/

Page 30: Rewriting DevOps
Page 31: Rewriting DevOps
Page 32: Rewriting DevOps

Shared Responsibility

Empathy drives action

Common tools and Runbooks bridge the skills gap

Postmortems direct iterations

Page 33: Rewriting DevOps

IncidentPost-Mortem

ToolsRunbook

Reward

Page 34: Rewriting DevOps

SRE Phase 2 (May-> … forever)

● Build a production environment● Tune reliability metrics● Load tests● Resilience tests● Recovery tests● Blameless Postmortem every outage● Runbooks● Iterate

Page 35: Rewriting DevOps
Page 36: Rewriting DevOps

Fastly - Content DeliveryF5 & ELB - load balancingFE - Node.jsBE - JavaPacker - AMI’sConsul - service discoveryTerraform - InfrastructurePostgres/RDS - databaseSQS/SNS/Lambda/S3 - everything else

Page 37: Rewriting DevOps

SRE - Two metrics

Mean Time to Identify

Mean Time to Resolve

Page 38: Rewriting DevOps

DevOps + SRET-18 days3 hours

Page 39: Rewriting DevOps

This is not a DevOps definition approach

●Common Tooling●Organizational Empathy●Shared Responsibility●Land and expand●Start with pre-prod and grow

Page 40: Rewriting DevOps

Thank you!Questions?

@matthewboeckman