84
Monkeys in Lab Coats Automating Failure Testing Research at

Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Monkeys in Lab CoatsAutomating Failure Testing Research at

Page 2: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

The whole is greater than the sum of its parts.

- Aristotle[Metaphysics]

Page 3: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

The Professor vs The Practitioner

Peter Alvaro

Ex-Berkeley, Ex-Industry

Assistant Prof @ Santa Cruz

Misses the calm of PhD life

Likes prototyping stuff

Kolton Andrus

Ex-Netflix, Ex-Amazon

‘Chaos’ Engineer

Misses his actual pager

Likes breaking stuff

Page 4: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Measures of Success

Academic

H-Index

Grant warchest

Department ranking

Industry

Availability (i.e. 99.99% uptime)

Number of Incidents

Reduce Operational Burden

Page 5: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

An Unlikely Team?

Page 6: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 7: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

but ... it’s manual

Works Great!

Page 8: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Surely there is a better way ...

Page 9: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 10: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Free lunch?

Page 11: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

The End?

(Academia + Industry)

Page 12: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Let’s build it“Can we, pretty please?”

Page 13: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Freedom and Responsibility

Core Value

Page 14: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Responsibility

Academic Industry

Prove that it works

Show that it scales

Find real bugs

Page 15: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

The Big Idea Lineage Driven Fault Injection

Page 16: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

What could possibly go wrong?

Consider computation involving 100 services

Search Space:2100 executions

Page 17: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

“Depth” of bugs

Single Faults Search Space:100 executions

Page 18: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

“Depth” of bugs

Combination of 4 faults Search Space:3M executions

Page 19: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

“Depth” of bugs

Combination of 7 faults Search Space:16B executions

Page 20: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Random Search

Search Space:2100 executions

Page 21: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Engineer-guided Search

Search Space:???

Page 22: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Fault-tolerance “is just” redundancy

Page 23: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

How do we find the redundancy?

Could a bad ‘thing’ ever happen?

Why did a good ‘thing’ happen?

Page 24: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lineage-driven fault injection

Why did a good thing happen?

Consider its lineage.

The write is stable

Stored on RepA

Stored on RepB

Bcast1 Bcast2

Client Client

Page 25: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lineage-driven fault injection

Why did a good thing happen?

Consider its lineage.

What could have gone wrong?

Faults are cuts in the lineage graph.

Is there a cut that breaks all supports?

The write is stable

Stored on RepA

Stored on RepB

Bcast1 Bcast2

Client Client

Page 26: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lineage-driven fault injection

Why did a good thing happen?

Consider its lineage.

What could have gone wrong?

Faults are cuts in the lineage graph.

Is there a cut that breaks all supports?

The write is stable

Stored on RepA

Stored on RepB

Bcast1 Bcast2

Client Client

Page 27: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

What would have to go wrong?

(RepA OR Bcast1)

The write is stable

Stored on RepA

Stored on RepB

Bcast2

Client Client

Bcast1

Page 28: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

What would have to go wrong?

(RepA OR Bcast1)

AND (RepA OR Bcast2)

The write is stable

Stored on RepA

Stored on RepB

Bcast1 Bcast2

Client Client

Page 29: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

What would have to go wrong?

(RepA OR Bcast1)

AND (RepA OR Bcast2)

AND (RepB OR Bcast2)

The write is stable

Stored on RepA

Stored on RepB

Bcast1

Client Client

Bcast2

Page 30: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

What would have to go wrong?

(RepA OR Bcast1)

AND (RepA OR Bcast2)

AND (RepB OR Bcast2)

AND (RepB OR Bcast1)

The write is stable

Stored on RepA

Stored on RepB

Bcast1 Bcast2

Client Client

Page 31: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lineage-driven fault injection The write is stable

Stored on RepA

Stored on RepB

Bcast1 Bcast2

Client Client

Hypothesis: {Bcast1, Bcast2}

Page 32: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Search Space Reduction

Each Experiment finds a bug, OR

Reduces the Search space

Page 33: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

The prototype system “Molly”Recipe:

1. Start with a successful outcome. Work backwards.

2. Ask why it happened: Lineage3. Convert lineage to a boolean

formula and solve4. Lather, rinse, repeat

2. Lineage 3. CNF

Fail1. Success

Why?

Encode

Solve

4. REPEAT

Page 34: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

The Big Idea Meets Production

Page 35: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

1. Start with a successful outcome

2. Lineage

3. CNF

Fail1. Success

Why?

Encode

Solve

4. REPEAT

Page 36: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

What is success?

Page 37: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 38: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

“Start with the customer and work backwards”

Leadership Principle

Page 39: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 40: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 41: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lesson 1

Work backwards from what you know

Page 42: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

2. Ask why it happened

2. Lineage

3. CNF

Fail1. Success

Why?

Encode

Solve

4. REPEAT

Page 43: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Request Tracing

Page 44: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 45: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Request Tracing

Page 46: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Alternate Execution

Page 47: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Evolution over time

Page 48: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Redundancy through History

Page 49: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lesson 2

Meet in the middle

Page 50: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

3. Solve

2. Lineage

3. CNF

Fail1. Success

Why?

Encode

Solve

4. REPEAT

Page 51: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

A “small” matter of code

Page 52: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

4. Lather, Rinse, Repeat

2. Lineage

3. CNF

Fail1. Success

Why?

Encode

Solve

4. REPEAT

Page 53: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Turn the crank, right?

Page 54: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Idempotence

Page 55: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 56: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 57: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Bins and Balls

Request

Class 1

Class 2

Class 3

Class n

[...]

r’ r

Page 58: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Class n

Predicting Request Graphs

Request

Page 59: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Class n

Predicting Request Graphs

Request

Some function f: Requests → Classes

Page 60: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

F( ) = Class n

Request

Predicting Request Graphs

Page 61: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 62: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Solve the Machine Learning problem?or the Failure Testing one?

Page 63: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Simplest thing that will work?

Page 64: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

["bookmarks”, “recent”]

["playlist", 0, “name”]

["ratings"]

Falcor Path Mapping

=> “bookmarks,playlist,ratings”

Page 65: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lesson 3

Adapt the theory to the reality

Page 66: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Many moons passed...

Page 67: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Does it work? YES!

Page 68: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Case study: “Netflix AppBoot”

Services ~100

Search space (executions) 2100 (1,000,000,000,000,000,000,000,000,000,000)

Experiments performed 200

Critical bugs found 11

Page 69: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Future Work

Richer device metrics

Request class creation

Better experiment selection

Search prioritization

Richer lineage collection

Exploring temporal interleavings

Page 70: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 71: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 72: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 73: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 74: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 75: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 76: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 77: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 78: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs
Page 79: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Lessons

Work backwards from what you know

Meet in the middle

Adapt the theory to the reality

Page 80: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Academia + Industry

Page 81: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Academia + Industry

Academia Industry

Page 82: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Thank You!

Peter Alvaro

@palvaro

[email protected]

Kolton Andrus

@KoltonAndrus

[email protected]

Page 83: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

References● Netflix Blog on ‘Automated Failure Testing’

http://techblog.netflix.com/2016/01/automated-failure-testing.html

● Netflix Blog on ‘Failure Injection Testing’ techblog.netflix.com/2014/10/fit-failure-injection-testing.html

● ‘Lineage Driven Fault Injection’http://people.ucsc.edu/~palvaro/molly.pdf

● ‘Automating Failure Testing Research at Scale’https://people.ucsc.edu/~palvaro/socc16.pdf

Page 84: Monkeys in Lab Coats - QCon · Monkeys in Lab Coats Automating Failure Testing Research at. The whole is greater than the sum of its parts.-Aristotle [Metaphysics] The Professor vs

Photo Credits● http://etc.usf.edu/clipart/4000/4048/children_7_lg.gif● http://cdn.c.photoshelter.com/img-get2/I0000MIN8fL0q8AA/fit=1000x750/taiw

an-hiking-river-tracing-walking.jpg● http://i.imgur.com/iWKad22.jpg● https://blogs.endjin.com/2014/05/event-stream-manipulation-using-rx-part-2/● http://youpivot.com/category/features/● https://www.cloudave.com/33427/boards-need-evolve-time/● https://www.linkedin.com/pulse/amelia-packager-missing-data-imputation-ram

prakash-veluchamy