64
A Cloud Outage Under the Lens of “Profound Knowledge” @botchagalupe Wednesday, October 31, 12 Welcome to Devopsdays NYC (first one) hell yeah... Normally I do the SOTU but I have done a few this year and there all about the same (on video) This morning I am going to Demingize you all by telling you a cloud outage story. Going to use something called the System of Profound Knowledge (sound Profound?) #### No apologies for spelling and grammar in the notes. If that kind of stuff annoys you please wait for the screen cast.

A Cloud Outage Under the Lens of “Profound Knowledge”

Embed Size (px)

DESCRIPTION

This was supposed to be the SOTU at NYC Devopsdays 11/1/2012. I will also be doing a screencast later today.

Citation preview

Page 1: A Cloud Outage Under the Lens of  “Profound Knowledge”

1

A Cloud OutageUnder the Lens of

“Profound Knowledge”

@botchagalupe

Wednesday, October 31, 12

Welcome to Devopsdays NYC (first one) hell yeah...Normally I do the SOTU but I have done a few this year and there all about the same (on video)

This morning I am going to Demingize you all by telling you a cloud outage story. Going to use something called the System of Profound Knowledge (sound Profound?)

#### No apologies for spelling and grammar in the notes. If that kind of stuff annoys you please wait for the screen cast.

Page 2: A Cloud Outage Under the Lens of  “Profound Knowledge”

GOALS

2

• Understanding Complexity • Overview of SoPK• Amazon’s Outage on 10/22/12

Wednesday, October 31, 12

Goody we are going to talk about big bad old Amazon’s outage last week...

Page 3: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

3

Wednesday, October 31, 12

An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y)x-> y (Y is the dependent variable)

Page 4: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

3

Wednesday, October 31, 12

An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y)x-> y (Y is the dependent variable)

Page 5: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

3

Wednesday, October 31, 12

An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y)x-> y (Y is the dependent variable)

Page 6: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

3

Wednesday, October 31, 12

An Improvement .. might be an upgrade, a bug fix, an emergency change a new product.. One variable X will change the outcome (y)x-> y (Y is the dependent variable)

Page 7: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

4

Wednesday, October 31, 12

In real life you get many variables (messiness of life)There are direct effects against the dependent var (y)

Page 8: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

5

T1 T2

Wednesday, October 31, 12

You also get time dependent variables

Page 9: A Cloud Outage Under the Lens of  “Profound Knowledge”

SoPK - Understanding Complexity

6

T1 T2

Wednesday, October 31, 12

There are also indirect effects on the dependent variables (y)for example X1 in concert with X4 conjointly effect the dependent var Y as does X3->X4This is a different model that X->Y

Page 10: A Cloud Outage Under the Lens of  “Profound Knowledge”

System of Profound Knowledge (SoPK)

7

Wednesday, October 31, 12

Do we have any photographers in the audience? Use a camera lens as a metaphor for SoPKThey call this the exposure triangle. To take a perfect picture of an event you must have a good lens and understand how it works.The ISO must be understood for sensitivity to lightThe Aperture must be understood for DOF (a portrait or an area)The Shutter Seed to understand motion

Page 11: A Cloud Outage Under the Lens of  “Profound Knowledge”

System of Profound Knowledge (SoPK)

8

• Appreciation of a system• Knowledge of variation• Theory of knowledge• Knowledge of psychology

Wednesday, October 31, 12

Well Dr. Deming gave such a lens to break down complexity (the real world just like a camera does)Let’s say a lens for improvement of something (an enhancement, a bug fix, new product idea)An outcome X->yDr Deming gave us a tool called “The System of Profound Knowledge”

SoPK is a Lens to break down complexity and give ourselves an advantage to not over simplify what we are trying to do. In otherwise clear up the messiness of real life just like a camera lens does.

(S) Appreciation of a System - Systems thinking - Deming would say understanding the AIM of a system. Deming said every system must have an AIM. Is your AIM to keep a server up or keep a protect a customer SLA (they might not be the same thing as we will soon see)Eli Goldarat (TOC) would say Global optimization over local optimization even if suboptimization is sub optimal. Understanding subsystems and dependent systems.

(V) Variation - Not understanding Variation is the root of all evil. Deming would get mad at ppl. Knee jerk reactions due to not understanding the kind of variation. How do you understand varation? Statistics (primarily STD and and it’s relationship to a process i.e., it’s distribution)Give you an example. A large cloud provider rates API calls at 100 per (x). for Most customers that’s fine, however, others they get treated as DDOS. Where did they get 100? It had to be a guess. If they understood SPC (variation) they might come up with the number and have a CI process in place when they found special variation.

(K) is the simplest but hardest to understand by most ppl. Simply put it is using Scientific method to everything you do. Deming says you must have Theory to have knowledge and you can’t have knowledge with out prediction and you predication with out a test is useless. PDSA others call it (AMC) AIM,Measure (a.what process u gonna change b.measure if the change worked), Change. You have to test any improvement to see if it worked, failed or did nothing. Imagine someone staring a failover system with automation but not testing to see if it really worked (could never happen).

(P) Another easy one but hardest to implement. Understanding behavior. Why ppl do the things they do. Tribal behavior. Things that are important to one group might not be important to other groups. Understanding Human Behavior (another lens factor). Worldview's. Imagine a server that has software on it from two totally different dev groups. Further imagine these to group’s worldview are so far apart. One does agile CI, TDD, BDD, CD the other has never even hear of those things.

Page 12: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

9

Wednesday, October 31, 12

Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training purposes)An EBS Services outage correct? The way it’s supposed to work...Fleet Monitor (hardware monitoring)Failover for both Fleet monitor and EBS ServerDNS of courseMetics, performance monitoring to disk from agent... Of course humans.. remember they are part of of every systemWe will give Amazon the benefit of the doubt that it’s in their value stream map\However, a lot of orgs do not have this “human” process in the VSMPre automation .. remediation systems (complex adaptive systems). or just plane old humans

LENS OF SOPK## This is a system not just the EBS Service Hint: Why did I say it’s just one system?

Page 13: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

9

Wednesday, October 31, 12

Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training purposes)An EBS Services outage correct? The way it’s supposed to work...Fleet Monitor (hardware monitoring)Failover for both Fleet monitor and EBS ServerDNS of courseMetics, performance monitoring to disk from agent... Of course humans.. remember they are part of of every systemWe will give Amazon the benefit of the doubt that it’s in their value stream map\However, a lot of orgs do not have this “human” process in the VSMPre automation .. remediation systems (complex adaptive systems). or just plane old humans

LENS OF SOPK## This is a system not just the EBS Service Hint: Why did I say it’s just one system?

Page 14: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

9

This is one System

Wednesday, October 31, 12

Disclaimers (Amazons true Story, my knowledge of SoPK, Literary license for training purposes)An EBS Services outage correct? The way it’s supposed to work...Fleet Monitor (hardware monitoring)Failover for both Fleet monitor and EBS ServerDNS of courseMetics, performance monitoring to disk from agent... Of course humans.. remember they are part of of every systemWe will give Amazon the benefit of the doubt that it’s in their value stream map\However, a lot of orgs do not have this “human” process in the VSMPre automation .. remediation systems (complex adaptive systems). or just plane old humans

LENS OF SOPK## This is a system not just the EBS Service Hint: Why did I say it’s just one system?

Page 15: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

10

Wednesday, October 31, 12

Monitor Server has a failure (system down)

X0 - Fleet Management monitoring server fails

Page 16: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

10

Wednesday, October 31, 12

Monitor Server has a failure (system down)

X0 - Fleet Management monitoring server fails

Page 17: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

10

X0 -> Server Failure

Wednesday, October 31, 12

Monitor Server has a failure (system down)

X0 - Fleet Management monitoring server fails

Page 18: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

11

Wednesday, October 31, 12

X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).

Page 19: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

11

Wednesday, October 31, 12

X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).

Page 20: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

11

X1 -> Failover

Wednesday, October 31, 12

X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).

Page 21: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s EBS Outage 10/22/2012

11

X1 -> Failover

Wednesday, October 31, 12

X1 - Fleet Management failover - anyone see this first issue?

Lens #1 (S) Not having a systems view- Not seeing this as dependent systems. You might say surely they had automation to DNS. However I would say no. Because ..Lens #2 (K) Theory can not be an un measured guess. Whoever, did the failover (automation or manual) apparently didn’t have a proper measure for success. Should have verified that the they were actually using the new server (duh).

Page 22: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

12

Wednesday, October 31, 12

DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.

Page 23: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

12

Wednesday, October 31, 12

DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.

Page 24: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

12

X2 -> DNS Failure

Wednesday, October 31, 12

DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.

Page 25: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

12

X2 -> DNS Failure

Wednesday, October 31, 12

DNS does not propagate....

X2 - DNS propagation failure

Lens #1 (P) We could argue that maybe because the fleet servers are managed by hardware guys and DNS is by systems guys and may they’ are different cultural tribes and don’t understand the importance of each. Maybe they don’t go to lunch together.

Page 26: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

13

Wednesday, October 31, 12

So now we have fixed the first problem of the bad server Seemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

X3 is the HW agent devs bug.

Page 27: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

13

Wednesday, October 31, 12

So now we have fixed the first problem of the bad server Seemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

X3 is the HW agent devs bug.

Page 28: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

13

X4 -> Memory Leak

Wednesday, October 31, 12

So now we have fixed the first problem of the bad server Seemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

X3 is the HW agent devs bug.

Page 29: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

13

X4 -> Memory Leak

Wednesday, October 31, 12

So now we have fixed the first problem of the bad server Seemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

X3 is the HW agent devs bug.

Page 30: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

13

X4 -> Memory Leak

((X0->X1->X2)->X4)

Wednesday, October 31, 12

So now we have fixed the first problem of the bad server Seemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

X3 is the HW agent devs bug.

Page 31: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

13

X4 -> Memory Leak

((X0->X1->X2)->X4)X3->X4

Wednesday, October 31, 12

So now we have fixed the first problem of the bad server Seemed like a flawless operation right?But DNS still says the offline server is the monitor serverThe agent on the EBS server (services) keeps trying to send to hw data to the original fleet monitor server. However, it is by design fault tolerant not to screw w/production if fails..The hardware guys probably don’t even know about this. They probably don’t monitor things like server or process memory.

X4- Memory Leak in the hardware agent on the EBS server

Lens #1 (S) The hardware guys should know that they are part of a bigger system other than just hardware monitor. Was there a systems view for QA and smoke testing of agent code changes?Lens #2 (P) Hardware developers (agent) are they doing TDD/CD like EBS devs do they have the same Theories. Do the EBS guys do CD Smoke testing with hardware monitoring agents.

X3 is the HW agent devs bug.

Page 32: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

14

Wednesday, October 31, 12

The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory

X5 Out of Memory

Page 33: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

14

Wednesday, October 31, 12

The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory

X5 Out of Memory

Page 34: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

14

X5 -> Out of Memory

Wednesday, October 31, 12

The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory

X5 Out of Memory

Page 35: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

14

X5 -> Out of Memory

(X3, X4)->X5

Wednesday, October 31, 12

The fault tolerant code has a memory leak masks an issue with (creates) Low memory on the EBS Servers... EBS server starts to run out of memory

X5 Out of Memory

Page 36: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

15

Wednesday, October 31, 12

The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DBThe fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (local optimization)

X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .

Page 37: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

15

Wednesday, October 31, 12

The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DBThe fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (local optimization)

X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .

Page 38: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

15

X6 -> Throttling(X->Y)

Wednesday, October 31, 12

The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DBThe fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (local optimization)

X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .

Page 39: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

15

X6 -> Throttling(X->Y)

Wednesday, October 31, 12

The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DBThe fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (local optimization)

X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .

Page 40: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

15

X6 -> Throttling(X->Y)

X5->X6

Wednesday, October 31, 12

The memory low wakes up the humans (yellow) seeing low memory from the systems monitoring DBThe fucking humans get involved and all hell breaks lose. The humans see something is wrong with memory low on EBS servers.They start to throttle API calls due to low memory (that they don’t know why (local optimization)

X6 Throttling System guys see this as a X->Y issue (Low memory therefore throttle)However, really it’s (X0,X2,X4) in concert with (X3) conjoined to cause X6) Chances are they might not even know about the the fleet server failover. The hardware guys have no idea that there is a DNS (X2->X3) (X4,X5)

Lens #1 (S) (X->Y) Humans try to correct the memory issue with throtteling and they don’t understand hardware monitoring as a sub systemLens #2 (V) The systems guys don’t understand common vs special cause variation .. they react to a “S” that should of been a “C”. Turns out Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #4 (P) Tribal understanding of behavior differences between Hardware guys and Systems guys. .

Page 41: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

16

Wednesday, October 31, 12

Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse for the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)

Page 42: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

16

Wednesday, October 31, 12

Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse for the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)

Page 43: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

16

X7 -> API Issues

Wednesday, October 31, 12

Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse for the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)

Page 44: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

16

X7 -> API Issues

Wednesday, October 31, 12

Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse for the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)

Page 45: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

16

X7 -> API Issues

(X6->X7)

Wednesday, October 31, 12

Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse for the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)

Page 46: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

16

X7 -> API Issues

(X6->X7)(X5->X7)

Wednesday, October 31, 12

Things continue to get worse....Some customers (yellow) were already getting issues but throttling makes it worse for the customers.

X7 API Issues

Lens #3 (K) Measures with out results are not fixes (throttling). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?

X7 is caused by both X6 and X5 independently (X6 just made things worse)

Page 47: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

17

Wednesday, October 31, 12

The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

X8 is customer hammering the services...

Page 48: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

17

Wednesday, October 31, 12

The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

X8 is customer hammering the services...

Page 49: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

17

X9 -> EBS Failover

Wednesday, October 31, 12

The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

X8 is customer hammering the services...

Page 50: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

17

X9 -> EBS Failover

Wednesday, October 31, 12

The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

X8 is customer hammering the services...

Page 51: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

17

X9 -> EBS Failover

(X7->X9)

Wednesday, October 31, 12

The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

X8 is customer hammering the services...

Page 52: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

17

X9 -> EBS Failover

(X7->X9)(X8->X9)

Wednesday, October 31, 12

The still can’t pin the problem and they decide to force a failover (red) EBS ServerAgin local optimization

X9 EBS Failover

Lens #1 (K) Measures with out results are not fixes (failover). They should have looked at the results.Three potential outcomes a) get better b) Stays the same c) Gets worse. What do you think happened?Lens #2 (P) Not understanding customer behavior.. Customers increase there actions (API) calls testing services, qa, smoke.

X8 is customer hammering the services...

Page 53: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

18

X10 ->Twitter Effect

Wednesday, October 31, 12

Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey)Load goes up...

#X10 Twitter effect

Introducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non customers..

Page 54: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

18

X10 ->Twitter Effect

Wednesday, October 31, 12

Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey)Load goes up...

#X10 Twitter effect

Introducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non customers..

Page 55: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

18

X10 ->Twitter Effect

Wednesday, October 31, 12

Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey)Load goes up...

#X10 Twitter effect

Introducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non customers..

Page 56: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

18

X10 ->Twitter Effect

(X8->X10)

Wednesday, October 31, 12

Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey)Load goes up...

#X10 Twitter effect

Introducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non customers..

Page 57: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

18

X10 ->Twitter Effect

(X8->X10)(X9->X10)

Wednesday, October 31, 12

Meanwhile .. the throttling effects a bigger problem.. The twitter effect kicks in... ppl start hammering AWS API web services to test availability. PPl start testing out all sorts of aaS services (smoke tests start). Curiosity tests start..Come on admit it how many of you tried to spank the money last monday...(Netflix Chaos Monkey)Load goes up...

#X10 Twitter effect

Introducing another external subsystem.Lens #1 (S) Not including this whole other subsystem (admittedly this is hard) Lens #2 (P) Not understanding the Twitter effect other outside eco systems... They might have handled this but for the purposes of this presentation it’s fun to assume they didn’t as a leanring exercise.

X9 is an aggregate effect from from one customer to other customer and non customers..

Page 58: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

19

X11 -> FA Server Dies

Wednesday, October 31, 12

The system becomes a systemic breakdown...Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...

Page 59: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

19

X11 -> FA Server Dies

Wednesday, October 31, 12

The system becomes a systemic breakdown...Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...

Page 60: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

19

X11 -> FA Server Dies

(X6->X11)

Wednesday, October 31, 12

The system becomes a systemic breakdown...Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...

Page 61: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

19

X11 -> FA Server Dies

(X6->X11)(X10->X11)

Wednesday, October 31, 12

The system becomes a systemic breakdown...Now the backup (FA) server fails..

X11 Failover server fails

Could be X6 (Throttling) Might be X10 Twitter effect.. who the f knows...

Page 62: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

20

Systemic Outage

Wednesday, October 31, 12

The whole system is hosed... The complexity was masked To bad they had not read deming...

Page 63: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

20

Systemic Outage

Wednesday, October 31, 12

The whole system is hosed... The complexity was masked To bad they had not read deming...

Page 64: A Cloud Outage Under the Lens of  “Profound Knowledge”

Amazon’s Outage 10/22/2012

20

Systemic Outage

X->Y

Wednesday, October 31, 12

The whole system is hosed... The complexity was masked To bad they had not read deming...