Upload
legendofklang
View
397
Download
0
Tags:
Embed Size (px)
Citation preview
√FAILURE The Good Parts
Viktor Klang Director of Engineering
�2
Build powerful, concurrent, resilient & distributed
software more easily.
”“
FAILURE The Bad Parts
Ariane 5 - 4 June 1996
๏ 10 years of research
๏ $7 billion invested
๏ Exploded within a minute of take-off
๏ Loss estimate $370 million
๏ Why?
๏ Trying to stuff a 64-bit float into a16-bit int
๏ o_O + wat
Failure is an option. A Some(failure)
to be exact. – me “”
Failure Recovery
#define Failure#undef Failure
Software fails
Runtime๏VM (OpenJDK Issue Tracker)
๏OS
๏Drivers
๏Firmware
Runtime๏Overload/Exhaustion
๏Stack
๏Heap
๏FDs
๏…
๏Starvation
Hardware fails
CPUs
"Related instructions that are affected by the bug are
FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1.
The instructions FPTAN and FPATAN are also susceptible"
http://en.wikipedia.org/wiki/Pentium_FDIV_bug
RAM
DRAM Errors in the Wild: A Large-Scale Field Study
Bianca Schroeder Dept. of Computer Science
University of Toronto Toronto, Canada
Eduardo Pinheiro Google Inc.
Mountain View, CA
Wolf-Dietrich Weber Google Inc.
Mountain View, CA
DRAM Errors in the wild๏Memory errors were between
15-120 times (!) more common than had previously been assumed.
๏More than 90% of the problems with a given platform were caused by about 20% of the machines who had errors.
DRAM Errors in the wild
(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)!http://news.cnet.com/8301-30685_3-10370026-264.html
DRAM Errors in the wild
๏Temperature didn't seem to make a big difference.
๏Irreparable problems were more common than transient problems.
๏Increased number of errors with age, setting in as early as 10-18 months in the field.
HDDs
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso ´
Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043
{edpin,wolf,luiz}@google.com
Failure Trends by age
Failure Trends by utilization and age
The Network is ReliableLOL
Kyle Kingsbury's blog: !
http://aphyr.com/posts/288-the-network-is-reliable
Wetware fails
An expert is a man who has made all
the mistakes which can be made, in a
narrow field. – Niels Bohr
“”
Assumptions are bad
Quiz
val result = something(x,y)
๏ Failure is unintentional๏ Validation is intentional
Validation vs Failure
Flows of information
๏ Results &Validation
๏ Failures & Recovery
๏ Don't complect them!
Attribution:
The Little
Vending Machine
That Could
Failure ValidationHandled
Outcome awareness
Known-Unknowns Unknown-Unknowns
Known-Knowns Unknown-Knowns
Failure awareness
Known-Unknowns Unknown-Unknowns
Known-Knowns Unknown-Knowns
๏ Result
๏ Invalid input
๏ Illegal value
๏ Illegal value combination
๏ Capability/Dependency violation
๏ Nothing
๏ Uninvoked
๏ Response lost
Possibilities
Program testing can be used to show the presence of bugs, but never to show their absence! !
– Edsger Dijkstra
“”
Testing & Checking๏ Testing is good for
๏ Known-Knowns
๏ Checking is good for
๏ Unknown-Knowns
๏ Known-Unknowns
๏ Unknown-Unknowns
๏ Conclusion
๏Use both!
Quiz
val result = println(x,y)
Death & Delay & Distributed Programs
๏ There is no apparent difference between death and delay in a distributed system
๏ "Distributed programming is all about retries and timeouts"
๏ Without distribution you'll always have a SPOF
๏ … but the more hardware you have, the higher the risk of failures
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
!– Leslie Lamport
“”
Traditional Blocking RPC
๏What if: Request is lost
๏What if: Response is lost
๏Caller is held hostage by the Callee
๏… Stockholm Syndrome anyone?
http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf
Defensive programming๏ "Paranoid programming"
๏ Mixes concerns
๏ Unclear responsibilities
๏ At best gives sense of false security
๏ Yields systems that fail extraordinarily
!
try { val breakfast = try { prepare(new Breakfast) } catch { case ex: OutOfJamError => … } finally { … } eat(breakfast) } catch { case ex: BreakfastOverflowError => … } finally { … }
Yes We Can
Make Failure
Management Fun
Distribution
Replication & Failover
CircuitBreakers
CircuitBreakers
๏Benefits
๏Relieves pressure on failing parts
๏Are self-healing
๏Can be operated manually
Supervisors
๏ Components dealing with the failure of subcomponents
๏ Decouples failure from validation
๏ Makes it obvious who is responsible for what
Service
Superviso
Input
Result/Validation
Failures / Recovery
Supervisors
Quis custodiet ipsos custodes? – Decimus Iunius Iuvenalis “”
Supervision
Bulkheading
๏Compartmentalization
๏Prevent failures from cascading
๏Plays well with redundancy & failover
An escalator can never break: it can only become stairs. You should never see an Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the convenience. !
– Mitch Hedberg
“”
Graceful degradation
My crystal ball
Microservices๏ Does one thing well
๏ Concurrent & Compartmentalized
๏ Location transparent
๏ Typed endpoints producing typed streams of data
๏ Exhibit compositionality
๏ Are async and non-blocking
๏ Support backpressure & flow control
Summary๏Failure management
๏… is not Validation
๏… need not be boring
๏… is not optional
๏There are real consequences
๏… and there are ways to avoid them!
“”
Don't worry—be happy. – Bobby McFerrin
Attribution: Steve Jurvetson
Thank you!๏ @viktorklang on Twitter
๏ Want to know more?
๏ http://akka.io
๏ http://typesafe.com
๏ http://reactivemanifesto.org√
End of transmission…