View
3.953
Download
0
Category
Preview:
Citation preview
Instrumentation as a Living Documentation
TEACHING HUMANS ABOUT COMPLEX SYSTEMS
I do things to/with computers.
I build real-time systems.
I build distributed systems.
I build critical systems.
AdRoll
L E S S T H I S
M O R E T H I S
W E ’ R E A N A D T E C H
C O M PA N Y .
R E A L - T I M E B I D D I N G
The nature of the problem domain:
• Low latency ( < 100ms per transaction )
• Firm real-time system
• Highly concurrent ( > 55 billion transactions per day )
• Global, 24/7 operation
I build Complex Systems
Complex Systems
• Non-linear feedback
• Tightly coupled to external systems
• Difficult to model, understand
• Usually a solution to some “wicked problem”
- - C . WEST CHURCHMAN, - GUEST ED I TOR IAL : W ICKED PROBLEMS - MANAGEMENT SC IENCE VOL . 4 , 1967
[WICKED PROBLEMS ARE] SOCIAL PROBLEMS WHICH ARE ILL FORMULATED, WHERE THE INFORMATION IS CONFUSING, WHERE THERE ARE MANY CLIENTS AND DECISION-MAKERS WITH CONFLICTING VALUES, AND WHERE THE RAMIFICATIONS IN THE WHOLE SYSTEM ARE THOROUGHLY CONFUSING. […] THE ADJECTIVE ‘WICKED’ IS SUPPOSED TO DESCRIBE THE MISCHIEVOUS AND EVEN EVIL QUALITY OF THESE PROBLEMS, WHERE PROPOSED ‘SOLUTIONS’ OFTEN TURN OUT TO BE WORSE THAN THE SYMPTOMS.
”
Bad things happen when Complex Systems fail.
Complex Systems often create worse problems than those they solve.
HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.
CARLOS BUENO “MATURE OPT IM IZAT ION HANDBOOK”
The key challenge to sustaining a complex system is maintaining
our understanding of it.
We write documentation.
Complex systems are fiendishly difficult to communicate about.
Miscommunications are accidents in the making.
Documentation reduces accidents.
I F Y O U D O N ’ T K N O W H O W T H E S Y S T E M
S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T
S H O U L D N ’ T O R I S N ’ T .
Trouble is, documentation goes out of date.
Complex Systems evolve and written words “rot”
as the system moves on.
Engineers fail to update documentation as the
system changes.
DAV ID E . HOFFMAN “THE DEAD HAND: THE UNTOLD STORY OF THE COLD
WAR ARMS RACE AND I T ’ S DANGEROUS LEGACY”
ONE OPERATOR (…) WAS CONFUSED BY THE LOGBOOK. HE CALLED SOMEONE ELSE TO INQUIRE. !
“WHAT SHALL I DO?” HE ASKED. “IN THE PROGRAM THERE ARE INSTRUCTIONS OF WHAT TO DO, AND THEN A LOT OF THINGS CROSSED OUT.” !
THE OTHER PERSON THOUGHT FOR A MINUTE, THEN R E P L I E D , “ F O L L O W T H E C R O S S E D O U T INSTRUCTIONS.”
Engineers can be unaware of the system as it is actually used.
ER IC SCHLOSSER COMMAND AND CONTROL : NUCLEAR WEAPONS, THE DAMASCUS ACC IDENT, AND THE I L LUS ION OF SAFETY
CLEARLY THE TEXTBOOKS (…) DIDN’T TELL YOU WHAT REALLY HAPPENED IN THE FIELD. (…) (T)HERE WAS A WAY YOU WERE SUPPOSED TO DO THINGS – AND THE WAY THINGS GOT DONE. RFHCO SUITS WERE HOT AND CUMBERSOME (…) AND IF A MAINTENANCE TASK COULD BE ACCOMPLISHED QUICKLY WITHOUT AN OFFICER NOTICING, SOMETIMES THE SUITS WEREN’T WORN.
(Normal) Accidents happen.
HENRY S . F. COOPER , JR . X I I I : THE APOLLO FL IGHT THAT FA I LED
THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS W E R E N O T E V E N S U R E T H AT ANYTHING HAD.
Documentation doesn’t necessarily reflect the reality of the system.
What can we do?
INSTRUMENTATION
Instrumentation reflects the reality of the system as it exists.
Instrumentation allows users and engineers to explore the system as
it exists.
Exploration, done honestly, guides us to a new, better understanding
of the system.
THIS “COLLECTIVE ENTITY” WAS ORGANIZED AROUND THE PILOT TO MAKE IT “SAFER AND MORE EFFICIENT IF THERE WAS A FOCAL POINT. AND I WAS THE FOCAL POINT. JIM FED THINGS INTO MY EARS. THE MOON FED THINGS INTO MY EYES AND I COULD FEEL THE MACHINE OPERATING.”
COMMANDER DAV ID SCOTT AS QUOTED IN DAV ID A . M INDELL 'S
D IG I TAL APOLLO : HUMAN AND MACH INE IN SPACEFL IGHT
Instrumentation democratizes the organization around a complex
system.
Case Studies
Case Study: Exchange Throttling
Case Study: Exchange Throttling
Healthy pattern of bid requests
Case Study: Exchange Throttling
The trough of throttling
B A D
G O O D
Case Study: Exchange Throttling
Problem confirmed with Exchange
Case Study: Exchange Throttling
Case Study: Exchange Throttling
• All other metrics (run-queue, CPU, network IO) were fine.
• Confirmed that no changes had been made to the running systems via deployment.
• Amazon data showed no network issues to our machines.
What happened?
Case Study: Exchange Throttling
We hit an implicit exchange limit. (Arguably, a bug.)
Case Study: Exchange Throttling
Case Study: Timeout Jumps
Case Study: Timeout Jumps
Healthy Pattern of Background Timeouts
Case Study: Timeout Jumps
Unhealthy timeouts.
Case Study: Timeout Jumps
Healthy Bid Requests
Case Study: Timeout Jumps
Unhealthy Bid Requests
Cliff of Throttling
Case Study: Timeout Jumps• Timeouts jump occurred only in US East, US
West fine.
• All other metrics (as above) checked out.
• System deployment strongly correlated with timeout jump.
• Rollback to previous release reduce timeouts to acceptable levels.
What happened?
Case Study: Timeout Jumps
Who can say? ¯\_(シ)_/¯
Case Study: Timeout Jumps
Lessons Learned
It is possible to have too little information.
(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF CHERNOBYL REACTOR 4). THEY KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.- SVETLANA ALEX IEV ICH - VO ICES FROM CHERNOBYL : THE ORAL H ISTORY OF A
NUCLEAR D ISASTER
It is possible to collect too much information, or
present it badly.
SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. (…) ONE OF THE LESSONS OF COMPLEX SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.
- CHARLES PERROW - NORMAL ACC IDENTS : L I V ING WITH H IGH -R ISK
TECHNOLOG IES
Instrumentation is not a
panacea.
Instruments may be misleading.
Must know some Mathematics.
Too much information hampers interpretation.
Instruments may be
inaccurate.
Instruments may be ignored.
Instrumentation may be used for undesirable purposes.
What can we do?
Write documentation!
Context reduces misinterpretations.Misleading Instruments
Procedure manuals and visualizations reduce the need for math background.
Must Know Math
The more contextual layers you add, the more you reduce “big boards of blinky lights”.
Too Much Information
INSTRUMENTATION IS LIKE A SUIT. IT NEEDS TO FIT YOUR OWN MIND.
VALENT INO VOLONGH I
Cross-checks and documented error margins mitigate instrument inaccuracy.
Inaccuracy
IF YOU DON'T TRUST A COMPUTER BECAUSE SOMETIMES IT DOESN'T TELL YOU THE TRUTH, TELLING IT TO TELL YOU TO TRUST IT IS ASKING IT TO LIE TO YOU SOMETIMES.
MIKE SASSAK , CURBS IDE
Checklists with references to instrumentation at decision points.
May be Ignored
Collaborative Workplaces, Cooperatives, Unions, Laws etc.
Undesirable Purposes
I PROPOSE THAT MEN AND WOMEN BE RETURNED TO WORK AS CONTROLLERS OF MACHINES, AND THAT THE CONTROL OF PEOPLE BY MACHINES BE CURTAILED. I PROPOSE, FURTHER, THAT THE EFFECTS OF CHANGES IN TECHNOLOGY AND ORGANIZATION ON LIFE PATTERNS BE TAKEN INTO CAREFUL CONSIDERATION, AND THAT THE CHANGES BE WITHHELD OR INTRODUCED ON THE BASIS OF THIS CONSIDERATION.
KURT VONNEGUT PLAYER P IANO
Instrumentation addresses the problems of documentation, documentation the problems of instrumentation.
TL;DR
Complex Systems need them both.
How do I get started?
Exometer
Dropwizard’s Metrics
Scales
DataDog NewRelic Librato
Questions?
Thanks! <3
@bltroutwine
Recommended