Page 1
(without introducing more risk)
The Two Sides
PuppetGareth Rushgrove
Of Google Infrastructure for Everyone Else
Page 2
(without introducing more risk)
@garethr
Page 3
(without introducing more risk)
Gareth Rushgrove
Page 4
(without introducing more risk)IntroductionA strange format for a talk
Page 5
This is a debate
Gareth Rushgrove
Page 6
I’ll be debating both sides
Gareth Rushgrove
Page 7
Taking opposing viewpoints onthe same issue, as a way of exploring it in-depth
Gareth Rushgrove
Page 8
The talk is split into two parts;a For part and an Against part
Gareth Rushgrove
Page 9
I’d like to explore:- Technical practice evolution- How we adopt software- The organisational context
Gareth Rushgrove
Page 10
This house believes…
Gareth Rushgrove
Page 11
Successful companies will look like Google in the future, so we should adopt Google-like software and practices today
Gareth Rushgrove
Page 12
Important disclaimerI’ve never worked for Google
Gareth Rushgrove
Page 13
(without introducing more risk)For
Page 14
You’re probably:1 Struggling with distributed systems2 Missing out on machine learning3 Wondering how to scale operations
Gareth Rushgrove
Page 15
Gareth Rushgrove
have a 10+ year head start
Page 16
publish research that influences out industry
Gareth Rushgrove
Page 17
Gareth Rushgrove
MapReduce
Page 18
Gareth Rushgrove
Chubby
Page 19
Gareth Rushgrove
Borg
Page 20
releases (and inspires) software we use
Gareth Rushgrove
Page 22
Gareth Rushgrove
Go
Page 23
Gareth Rushgrove
from
Page 24
(without introducing more risk)
GFS = HDFSBigTable = HBaseProtocol Buffers = Thrift or Avro (serialization)Stubby = Thrift or Avro (RPC)ColumnIO = ParquetDremel = ImpalaOmega = MesosBlaze = Pants or BuckFlumeJava = CrunchLogsaver = Scribe or FlumeMillwheel = Storm or Samza?Borgmon/Monarch = GraphiteDapper = Zipkin
2014 from @avibryant, @joshwills, @skamille, @marius, @wickmanGareth Rushgrove
Page 25
We have a term for this; #GIFEE
Gareth Rushgrove
Page 26
Google Infrastructure forEveryone Else
Gareth Rushgrove
Page 27
Distributed systems are hard
Gareth Rushgrove
Page 28
Building your own in-house framework is likely a waste of time
Gareth Rushgrove
Page 29
Gareth Rushgrove From Adrian Colyer, Accel, https://speakerdeck.com/acolyer/making-sense-of-it-all
Page 30
Kubernetes is the 3rd generationof Googles cluster management software
Gareth Rushgrove
Page 31
Gareth Rushgrove
The Kubernetes API provides primitives that make doing theright thing easier
Page 32
- Orchestration- Logging- Configuration- Self-healing- Storage
Gareth Rushgrove
- Load balancing- Service discovery- Scaling- Batch workloads- Lots more
Page 33
Gareth Rushgrove
Exposed via a modern API
Page 34
Machine learning is goingto be massive
Gareth Rushgrove
Page 35
Soon We Won’t Program Computers. We’ll TrainThem Like Dogs
Gareth Rushgrove
”
“
Page 36
TensorFlow is an open source software library for numerical computation
Gareth Rushgrove
Page 37
(without introducing more risk)
Gareth Rushgrove
…
Page 38
- Nearest neighbour- Linear regression- Recurrent neural networks- Multilayer perceptron- Lots more
Gareth Rushgrove
Page 39
Gareth Rushgrove
Introductory ML docs
Page 40
How do I do devops?
Gareth Rushgrove
Everyone ever”
“
Page 41
Gareth Rushgrove
explain how they work too
Page 43
SRE: Have software engineersdo operations
Gareth Rushgrove
Dan Luu, ex Google ”“
http://danluu.com/google-sre-book/
Page 44
(without introducing more risk)
Gareth Rushgrove
Dev SRE Ops
From http://web.devopstopologies.com/ by Matthew Skelton
Page 45
The familiar:- Capacity planning- Performance- Change management- Monitoring
Gareth Rushgrove
Page 46
The unfamiliar:- Error budget- Strong software engineering skills- 50% operations work cap
Gareth Rushgrove
Page 47
A growing ecosystem
Gareth Rushgrove
Page 48
Gareth Rushgrove
Friendly vendors
Page 49
Gareth Rushgrove
More friendly vendors
Page 50
Gareth Rushgrove
Even more nice vendors
Page 51
(without introducing more risk)Summing up
For
Page 52
“infrastructure” is shifting to ahigher level of abstraction
Gareth Rushgrove
Page 53
It’s fine to just be a consumer
Gareth Rushgrove
Page 54
You should be standing on the shoulders of giants
Gareth Rushgrove
Page 55
You should be standing on the shoulders of
Gareth Rushgrove
Page 56
(without introducing more risk)Against
Page 57
Your organisation doesn’tlook like Google
Gareth Rushgrove
Page 58
YOUR ORGANISATION DOESN’T LOOK LIKE GOOGLE
Gareth Rushgrove
Page 59
Could your organisationlook like Google?
Gareth Rushgrove
Page 60
How many employees do you have? Google have about 60,000
Gareth Rushgrove
Page 61
What proportion of your organisation are software engineers or operations?
Gareth Rushgrove
Page 62
50 percent?Based on the Google annual report December 2014
Gareth Rushgrove
Page 63
How much do you paysoftware engineers?
Gareth Rushgrove
Page 64
Gareth Rushgrove Data from Glassdoor, June 2016, based on 14k salaries
Page 65
Gareth Rushgrove
The $3million engineer?
Page 67
Gareth Rushgrove
Build your own chips?
Page 68
Could your organisationreally look like Google?
Gareth Rushgrove
Page 69
So much of the information inthe SRE book makes PERFECT sense if you’re Google
Gareth Rushgrove
John Vincent, Ops Hero ”
“
Page 70
The reality outside Google
Gareth Rushgrove
Page 71
<1% of US workers are software engineers or programmers
Gareth Rushgrove US Bureau of Labor Statistics 2002. 1,069,000 jobs in working age population of 185million
Page 72
Strategic vendor relationships
Gareth Rushgrove
Page 73
Different applicationconstrains as well as differentorganisational constrains
Gareth Rushgrove
Page 74
Goal of SRE team isn’t zero outages – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity
Gareth Rushgrove
Dan Luu, ex Google ”
“
http://danluu.com/google-sre-book/
Page 75
What if you’re operating an air traffic control system or a nuclear power station? Your goal is probably closer to zero outages
Gareth Rushgrove
Page 76
Gareth Rushgrove
John Vincent SRE review
Page 77
bringing a software engineering perspective to a problem isn’t always the best or right solution
Gareth Rushgrove
”
“
John Vincent, Ops Hero
Page 78
Many of Google’s conclusions to operations problems are not unique
Gareth Rushgrove
Page 81
Innovation happens elsewhere applies as much to Google as to other organisations
Gareth Rushgrove
Page 82
(without introducing more risk)Summing up
Against
Page 83
If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow
Gareth Rushgrove
Carla Geisser, Google SRE
”
“
Page 84
What is normal for Googlemay not be suitable foryour organisation
Gareth Rushgrove
Page 85
Your startup with a single-purpose application does not have the luxury of having your operations team say I’m sorry you’re over your error budget
Gareth Rushgrove
John Vincent, Ops Hero ”
“
Page 87
(without introducing more risk)ConclusionsIf all you take away is…
Page 88
Who votes…
Gareth Rushgrove
For
Page 89
Who votes…
Gareth Rushgrove
Against
Page 90
Who thinks it’s the wrong question?
Gareth Rushgrove
Page 91
Context is king
Gareth Rushgrove
Page 93
The Overwhelming powerof context
Gareth Rushgrove
Charity Majors, Ops Person Extraordinaire”“
Page 94
The technology we run, and how we run it, are interlinked
Gareth Rushgrove
Page 95
(without introducing more risk)
The field of Sociotechnical Systems suggests that all human systems include both a technical system and a social system
Gareth Rushgrove
https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution
Page 96
(without introducing more risk)
Better outcomes are usually obtained by a reciprocal process of joint optimization, through which both the technical system and the social system change
Gareth Rushgrove
https://en.wikipedia.org/wiki/Coevolution#Technological_coevolution
Page 97
Containers will not fix yourbroken culture
Gareth Rushgrove
Bridget Kromhout, Worlds nicest Ops Person”“
Page 98
Awesome culture will not fix yourbroken containers
Gareth Rushgrove
Me, paraphrasing Bridget ”“
Page 99
We are all collectively evolving the practice of operations
Gareth Rushgrove
Page 100
Keep sharing, because it’s a pretty amazing ride
Gareth Rushgrove
Page 101
(without introducing more risk)Questions
And thanks for listening