Gray Failure: The Achilles’ Heel of Cloud-Scale …huang/talk/hotos17_talk.pdfGray Failure: The...

Preview:

Citation preview

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch,

Yingnong Dang, Murali Chintalapati, Randolph Yao

Outline

Background & the gray failure problem

Real-world gray failure cases in Azure

A model and a definition for gray failure

differential observability

Potential future directions

26

Rapid Growth of Cloud System Infra

» Software user shift

direct: e.g., office 365, Google Drive

indirect: e.g., Netflix on AWS

» Workload diversity

website, workflow, big data, machine learning

» Internal composition

more data centers, larger cluster, special h/w

containerization, micro-services

27

Demanding Requirement on Availability

» Users intolerant of service downtime

» Failure more costly

SLA violation, reputation hit, customer loss, engineering resource waste

» New availability bar

3 nines to 5 or 6 nines

28

Key: Embrace Fault-tolerance!

29

Redundancy CorrectnessDecomposition

Rich history since 1950s Steps:

Key: Embrace Fault-tolerance!

30

Redundancy CorrectnessDecomposition

process pair

RAID

Primary/backup

state machine replication

transaction

checkpoint chain replication

virtual synchrony Paxos Zab PBFT

Gossip

Zyzzyva

N-version2PC

triple modular redundancy

erasure coding

Rich history since 1950s

Steps:

Building block:

Status Quo

31

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

32

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

33

scale & complexity

availability?t

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

34

scale & complexity

availability?t

A common theme:

the overlooked gray

failure problem

The Elephant in the Cloud - Gray Failure

35

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

virtual synchrony

Zab

Fail-stop

A component either

works correctly or stops

The Elephant in the Cloud - Gray Failure

36

Byzantine

A component may

behave arbitrarily

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

PBFT Zyzzyva

virtual synchrony

UpRight

Q/U BAR Gossip

Aliph

Zab

Fail-stop

A component either

works correctly or stops

The Elephant in the Cloud - Gray Failure

37

ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component may

behave arbitrarily

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

PBFT Zyzzyva

virtual synchrony

UpRight

Q/U BAR Gossip

Aliph

Zab

Fail-stop

A component either

works correctly or stops

The Elephant in the Cloud - Gray Failure

38

Fail-stop ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

The Elephant in the Cloud - Gray Failure

39

Fail-stop ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom

The Elephant in the Cloud - Gray Failure

40

• across s/w and h/w stack in the infra due to various defects

• behind most service incidents we’ve seen in Azure

occurrence

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom

Fail-stop Byzantine

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

Gray failure

The Elephant in the Cloud - Gray Failure

41

• across s/w and h/w stack in the infra due to various defects

• behind most service incidents we’ve seen in Azure

occurrence

• fault-tolerance ineffective or counterproductive

• faults take engineers & designers huge efforts to nail down

• teams play the blame game with each other

danger

Fail-stop Byzantine

A component appears to be still working

but is in fact experiencing severe issue

A component either

works correctly or stops

A component may

behave arbitrarily

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom

Gray failure

Real-world Gray Failure Cases in Azure

42

Case I: Redundancy in Datacenter Network (1)

43

Core

Aggregation

ToR

A B

r1 r2

Case I: Redundancy in Datacenter Network (1)

44

Core

Aggregation

ToR

A B

r1 r2

Case I: Redundancy in Datacenter Network (1)

45

Core

Aggregation

ToR

A B

r1 r2

crash

Case I: Redundancy in Datacenter Network (1)

46

Core

Aggregation

ToR

A B

r1 r2

crash

Case I: Redundancy in Datacenter Network (1)

47

Core

Aggregation

ToR

A B

r1 r2

increasing # of core switches helps with availability

crash

Case I: Redundancy in Datacenter Network (2)

48

Core

Aggregation

ToR

A B

r1 r2

Workload: single round trip

random

packet

drop

𝒑

Case I: Redundancy in Datacenter Network (2)

49

Core

Aggregation

ToR

A B

r1 r2

Workload: single round trip

• packets will not be re-routed application glitches or increased latency

• increasing # of core switches may not affect chance of being affected

random

packet

drop

𝒑

Case I: Redundancy in Datacenter Network (3)

50

Core

Aggregation

ToR

A B

r1 r2

Workload: send multiple requests

wait for all to finish (e.g., search)

r3 r4

C D E

• high chance to involve every core switches for each front-end request

• gray failure at any core switch will cause delay

• more core switches worse tail latencies 𝟏 − (𝟏 − 𝒑)𝒏

random

packet

drop

Case II: Failure Detector in Compute Service

51

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

Hierarchical agents to catch failure in different layers

Case II: Failure Detector in Compute Service

52

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

crash!

Hierarchical agents to catch failure in different layers

Case II: Failure Detector in Compute Service

53

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

VM3 dead

crash!

Hierarchical agents to catch failure in different layers

Case II: Failure Detector in Compute Service

54

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

reboot

Hierarchical agents to catch failure in different layers

Case II: Failure Detector in Compute Service

55

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

connectivity

issue

Hierarchical agents to catch failure in different layers

Case II: Failure Detector in Compute Service

56

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

connectivity

issue

VM3 good

Hierarchical agents to catch failure in different layers

No action needed

Case II: Failure Detector in Compute Service

57

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

connectivity

issue

VM3 good

Can’t SSH/RDP

Hierarchical agents to catch failure in different layers

No action needed

Case III: Recovery in Storage Service

58

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

Case III: Recovery in Storage Service

59

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

Case III: Recovery in Storage Service

60

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN1,EN2,EN3

healthy

Case III: Recovery in Storage Service

61

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN1,EN2,EN3

healthy

Case III: Recovery in Storage Service

62

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN1,EN2,EN3

healthy

crash

write

Case III: Recovery in Storage Service

63

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Case III: Recovery in Storage Service

64

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Case III: Recovery in Storage Service

65

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

Case III: Recovery in Storage Service

66

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN2,EN3,EN4

healthy

Case III: Recovery in Storage Service

67

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

low free blocks

EN2,EN3,EN4

healthy

Case III: Recovery in Storage Service

68

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN2,EN3,EN4

healthy

crash

write

Case III: Recovery in Storage Service

69

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Case III: Recovery in Storage Service

70

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End

EN3 is

down

crash

Case III: Recovery in Storage Service

71

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

Case III: Recovery in Storage Service

72

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

Case III: Recovery in Storage Service

73

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

re-replication,

fragmentation

Case III: Recovery in Storage Service

74

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …

Front End Front End Front EndEN3 is

broken

re-replication,

fragmentation

Understanding Gray Failure

75

The Many Faces of Gray Failure

76

So, what is a gray failure?

The Many Faces of Gray Failure

77

So, what is a gray failure?

A performance issue.

The Many Faces of Gray Failure

78

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

The Many Faces of Gray Failure

79

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The Many Faces of Gray Failure

80

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The system is failing slowly, e.g., memory leak.

The Many Faces of Gray Failure

81

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The system is failing slowly, e.g., memory leak.

There is an increasing number of transient errors in the system,

which results in reduced system capacity even if the system still

manages to continue working.

The Many Faces of Gray Failure

82

So, what is a gray failure?

A performance issue.

A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.

A Heisenbug, sometimes it occurs and sometimes it does not.

The system is failing slowly, e.g., memory leak.

There is an increasing number of transient errors in the system,

which results in reduced system capacity even if the system still

manages to continue working.

What is a formal way to define and

study gray failure, one that potentially

sheds light on how to address it?

System Core

System

An Abstract Model

83

Note: these are logical entities

System Core

System

An Abstract Model

84

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

Note: these are logical entities

System Core

System

An Abstract Model

85

App1

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app

Note: these are logical entities

System Core

System

An Abstract Model

86

App1 App2

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics

Note: these are logical entities

System Core

System

An Abstract Model

87

App1 App2 App3

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics system2

Note: these are logical entities

System Core

System

An Abstract Model

88

App1 App2 App3

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

System Core

System

An Abstract Model

89

Observer

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

System Core

System

An Abstract Model

90

Observer

Reactor

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

System Core

System

An Abstract Model

91

Observer

Reactor

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

System Core

System

An Abstract Model

92

Observer

Reactor

App1 App2 App3

probe

report

• distributed storage system

• IaaS platform

• data center network

• search engine

• …

web app analytics user/operator

…Appn

system2

Note: these are logical entities

Gray Failure Trait: Differential Observability

93

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

Gray Failure Trait: Differential Observability

94

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Gray Failure Trait: Differential Observability

95

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

different entities come into different conclusions

about whether a system is working or not

difference

Gray Failure Trait: Differential Observability

96

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

difference

Gray Failure Trait: Differential Observability

97

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

difference

Healthy or w/ minor

latent fault

Gray Failure Trait: Differential Observability

98

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

Fault tolerance at play,

or a false positive

difference

Healthy or w/ minor

latent fault

Gray Failure Trait: Differential Observability

99

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

❸ ❹ Crash, fail-stop

Fault tolerance at play,

or a false positive

difference

Healthy or w/ minor

latent fault

Applying the Model

100

Case I

term example

system data center network

observer switch peers

reactor routing protocols

app1 simple web server

app2 search engine

gray failure app2 observed glitches but neighbors of

the bad switch (and app1) didn’t

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

Applying the Model

101

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

Applying the Model

102

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

Applying the Model

103

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

Applying the Model

104

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

Applying the Model

105

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

Applying the Model

106

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

… - EN3 crashed & removed

- No observation difference

Applying the Model

107

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

… - EN3 crashed & removed

- No observation difference

…- More ENs affected

- More observation difference

Applying the Model

108

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

- EN3 degraded

- Observation difference

- EN3 crashed & rebooted

- No observation difference

- EN3 degraded

- Observation difference

… - EN3 crashed & removed

- No observation difference

…- More ENs affected

- More observation difference

- Observation difference

completely gone, too late

Direction 1: Close Observation Gap

109

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Traditional failure detector multi-dimensional health monitor

Direction 1: Close Observation Gap

110

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Doctors can’t use heartbeat as the only signal of a patient’s health

Traditional failure detector multi-dimensional health monitor

Direction 1: Close Observation Gap

111

Doctors can’t use heartbeat as the only signal of a patient’s health

Heartbeat-based

hierarchical failure detector

In-VM performance counters before and

during the gray failure incident (case II)

Traditional failure detector multi-dimensional health monitor

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Direction 2: Approximate Application View

112

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage

Direction 2: Approximate Application View

113

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage

pod

podsetSpine

Leaf

ToR

Servers

PingMesh [SIGCOMM ’15]

Direction 2: Approximate Application View

114

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage

pod

podsetSpine

Leaf

ToR

Servers

Failure in Spine

PingMesh [SIGCOMM ’15]

Direction 3: Leverage Power of Scale

115

Break the observation silos to complement each other

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Direction 3: Leverage Power of Scale

Observable vs. Detectable

although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference

example solution: infer signals from many network devices to localize fault

Blame game

A thinks B is problematic, B thinks A is problematic

example solution: correlate VM failure events with topology info to judge

116

Break the observation silos to complement each other

Direction 3: Leverage Power of Scale

Observable vs. Detectable

although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference

example solution: infer signals from many network devices to localize fault

Blame game

A thinks B is problematic, B thinks A is problematic

example solution: correlate VM failure events with topology info to judge

117

Break the observation silos to complement each other

Direction 4: Harness Temporal Patterns

118

Time series and trend analysis of key health signals

time

observability

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Direction 4: Harness Temporal Patterns

119

Storage side observed availability issue

Time series and trend analysis of key health signals

Conclusion

•Cloud system are adept at handling crash and fail-stop failures

» decades of efforts and research advances have paid off

•Gray failure is a major challenge moving forward

» behind many service incidents

» an acute pain for system designers and engineers

» fault tolerance 1.0 2.0

•A first attempt to define and explore this problem domain

» differential observability is a fundamental trait

» addressing this trait is key to tackle gray failure

» potential future directions with open challenges

120

Discussion

•Why does differential observability occur?

•Should (can) academia work on this problem?

121

• practitioners have been troubled by these issues for quite a while,

relying on intuition, workaround, resources and process

• hungry for principled approach to understand and tackle the problem

• many issues exist in open-source distributed system stack as well

• simplistic model and assumption about component behavior

• modularity principle, unexpected dependency, improper error handling

• focus on narrow point availability but dismissed app availability

Recommended