Failure the-good-parts

√FAILURE The Good Parts

Viktor Klang Director of Engineering

�2

Build powerful, concurrent, resilient & distributed

software more easily.

”“

FAILURE The Bad Parts

Ariane 5 - 4 June 1996

๏ 10 years of research

๏ $7 billion invested

๏ Exploded within a minute of take-off

๏ Loss estimate $370 million

๏ Why?

๏ Trying to stuff a 64-bit float into a16-bit int

๏ o_O + wat

Failure is an option. A Some(failure)

to be exact. – me “”

Failure Recovery

#define Failure#undef Failure

Software fails

Runtime๏VM (OpenJDK Issue Tracker)

๏OS

๏Drivers

๏Firmware

Runtime๏Overload/Exhaustion

๏Stack

๏Heap

๏FDs

๏…

๏Starvation

Hardware fails

CPUs

"Related instructions that are affected by the bug are

FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1.

The instructions FPTAN and FPATAN are also susceptible"

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

RAM

DRAM Errors in the Wild: A Large-Scale Field Study

Bianca Schroeder Dept. of Computer Science

University of Toronto Toronto, Canada

[email protected]

Eduardo Pinheiro Google Inc.

Mountain View, CA

Wolf-Dietrich Weber Google Inc.

Mountain View, CA

mailto:[email protected]

DRAM Errors in the wild๏Memory errors were between

15-120 times (!) more common than had previously been assumed.

๏More than 90% of the problems with a given platform were caused by about 20% of the machines who had errors.

DRAM Errors in the wild

(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)!http://news.cnet.com/8301-30685_3-10370026-264.html

http://news.cnet.com/8301-30685_3-10370026-264.html

DRAM Errors in the wild

๏Temperature didn't seem to make a big difference.

๏Irreparable problems were more common than transient problems.

๏Increased number of errors with age, setting in as early as 10-18 months in the field.

HDDs

Failure Trends in a Large Disk Drive Population

Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso ´

Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043

{edpin,wolf,luiz}@google.com

Failure Trends by age

Failure Trends by utilization and age

The Network is ReliableLOL

Kyle Kingsbury's blog: !

http://aphyr.com/posts/288-the-network-is-reliable

http://aphyr.com/posts/288-the-network-is-reliable

Wetware fails

An expert is a man who has made all

the mistakes which can be made, in a

narrow field. – Niels Bohr

“”

Assumptions are bad

Quiz

val result = something(x,y)

๏ Failure is unintentional๏ Validation is intentional

Validation vs Failure

Flows of information

๏ Results &Validation

๏ Failures & Recovery

๏ Don't complect them!

Attribution:

http://en.wikipedia.org/wiki/User:RobChafer

http://en.wikipedia.org/

The Little

Vending Machine

That Could

Failure ValidationHandled

Outcome awareness

Known-Unknowns Unknown-Unknowns

Known-Knowns Unknown-Knowns

Failure awareness

Known-Unknowns Unknown-Unknowns

Known-Knowns Unknown-Knowns

๏ Result

๏ Invalid input

๏ Illegal value

๏ Illegal value combination

๏ Capability/Dependency violation

๏ Nothing

๏ Uninvoked

๏ Response lost

Possibilities

Program testing can be used to show the presence of bugs, but never to show their absence! !

– Edsger Dijkstra

“”

Testing & Checking๏ Testing is good for

๏ Known-Knowns

๏ Checking is good for

๏ Unknown-Knowns

๏ Known-Unknowns

๏ Unknown-Unknowns

๏ Conclusion

๏Use both!

Quiz

val result = println(x,y)

Death & Delay & Distributed Programs

๏ There is no apparent difference between death and delay in a distributed system

๏ "Distributed programming is all about retries and timeouts"

๏ Without distribution you'll always have a SPOF

๏ … but the more hardware you have, the higher the risk of failures

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

!– Leslie Lamport

“”

Traditional Blocking RPC

๏What if: Request is lost

๏What if: Response is lost

๏Caller is held hostage by the Callee

๏… Stockholm Syndrome anyone?

http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf

http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf

Defensive programming๏ "Paranoid programming"

๏ Mixes concerns

๏ Unclear responsibilities

๏ At best gives sense of false security

๏ Yields systems that fail extraordinarily

!

try { val breakfast = try { prepare(new Breakfast) } catch { case ex: OutOfJamError => … } finally { … } eat(breakfast) } catch { case ex: BreakfastOverflowError => … } finally { … }

Yes We Can

Make Failure

Management Fun

Distribution

Replication & Failover

CircuitBreakers

CircuitBreakers

๏Benefits

๏Relieves pressure on failing parts

๏Are self-healing

๏Can be operated manually

Supervisors

๏ Components dealing with the failure of subcomponents

๏ Decouples failure from validation

๏ Makes it obvious who is responsible for what

Service

Superviso

Input

Result/Validation

Failures / Recovery

Supervisors

Quis custodiet ipsos custodes? – Decimus Iunius Iuvenalis “”

Supervision

Bulkheading

๏Compartmentalization

๏Prevent failures from cascading

๏Plays well with redundancy & failover

An escalator can never break: it can only become stairs. You should never see an Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the convenience. !

– Mitch Hedberg

“”

Graceful degradation

My crystal ball

Microservices๏ Does one thing well

๏ Concurrent & Compartmentalized

๏ Location transparent

๏ Typed endpoints producing typed streams of data

๏ Exhibit compositionality

๏ Are async and non-blocking

๏ Support backpressure & flow control

Summary๏Failure management

๏… is not Validation

๏… need not be boring

๏… is not optional

๏There are real consequences

๏… and there are ways to avoid them!

“”

Don't worry—be happy. – Bobby McFerrin

Attribution: Steve Jurvetson

Thank you!๏ @viktorklang on Twitter

๏ [email protected]

๏ Want to know more?

๏ http://akka.io

๏ http://typesafe.com

๏ http://reactivemanifesto.org√

mailto:[email protected]

http://akka.io

http://typesafe.com

End of transmission…

Technology

Failure the-good-parts