Download pdf - Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

1

Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi

2

q Growing complexity of … §  Technology scaling §  Manufacturing §  Design logic §  Usage §  Operating environment

q … makes HW fail differently §  Complete fail-stop §  Fail partial §  Corruption §  Performance degradation?

Rich literature

3

“… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps,

this slow machine caused a chain reaction upstream in such a way that the performance of entire workload for a 100 node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes.” – Borthakur of Facebook

Degraded NIC! (1000000x)

Cascading impact!

More stories in the paper

4

q Does HW degrade? Yes §  Limpware: Hardware whose performance degrades

significantly compared to its specification

q Is this a destructive failure mode? Yes §  Cascading failures, no “fail in place”

q No systematic analysis on its impact

5

q 56 experiments that benchmark 5 systems §  Hadoop, HDFS, Zookeeper, Cassandra, HBase §  22 protocols §  8 hours under normal scenarios §  207 hours under limpware scenarios

q Unearth many limpware-intolerant designs

Our findings: A single piece of limpware (e.g. NIC) causes severe impact on a whole cluster

6

q Introduction

q System analysis

q Limplock

q Limpware-Tolerant Systems

q Conclusion

7

q “The performance of a 100 node cluster was crawling at a snail's pace” – Facebook

q But, … why?

8

q Goals §  Measure system-level impacts §  Find design flaws

q Methodology §  Target cloud systems (e.g., HDFS, Hadoop, ZooKeeper) §  Inject load + limpware

-  E.g. slow a NIC to 1 Mbps, 0.1 Mbps, etc.

§  White-box analysis (internal probes) -  Find design flaws

9

q Run a distributed protocol §  E.g., 3-node write in HDFS

q Measure slowdowns under: §  No failure, crash, a degraded NIC

workload 10 Mbps

NIC

1Mbps NIC

0.1 Mbps NIC

1

10x slower

100x slower

1000x slower

Execution slowdown

10

q Introduction

q System analysis §  Hadoop case study

q Limplock

q Limpware Tolerant Cloud Systems

q Conclusion

11

q Hadoop tail-tolerant? §  Why speculative exec is not triggered?

q Consider degraded NIC on a map node §  Task M2’s speed = M1 and M3 §  Input data is local!

q But all reducers are slow §  Straggler: slow vs. others of same job §  No straggler detected!

q Flaws §  Task-level straggler detection §  Single point of failure!

Wordcount on Hadoop

Mappers Reducers

M1

M2

M3

1

10

12

q A degraded NIC à degraded tasks §  (Degraded tasks are slower by orders of magnitude)

q Slow tasks use up slots à degraded node §  Default: 2 mappers and 2 reducers per node §  If all slots are used à node is “unavailable”

q All nodes in limp mode à degraded cluster

M M

R R

Healthy node in limp mode

Node with slow NIC

110

1001000

Exec

utio

nSl

owdo

wn

No failureDisk crash

8 MB/s0.8 MB/s

0.08 MB/sF1. Logging

(Master)F2. Write

(Data)F3. Read

(Data)F4. Read/Logging

(Master)F5. Checkpoint

(Secondary)

110

1001000

Exec

utio

nSl

owdo

wn

No failureNode crash

10 Mbps1 Mbps

0.1 MbpsF6. Write

(Data)F7. Read

(Data)F8. Regeneration

(Data)F9. Regeneration(Data-S/Data-D)

F10. Balancing(Data-O/Data-U)

110

1001000

Exec

utio

nSl

owdo

wn

F11. Decommission(Data-L/Data-R)

H1. Spec. Exec.(Mapper)

H2. Spec. Exec.(Reducer)

H3. Spec. Exec.(Job Tracker)

H4. Spec. Exec.(Task Node)

Z1. Get(Leader)

110

1001000

Exec

utio

nSl

owdo

wn

Z2. Get(Follower)

Z3. Set(Leader)

Z4. Set(Follower)

Z5. Set(Follower)

C1. Write(quorum)(Data)

C2. Read(quorum)(Data)

110

1001000

Exec

utio

nSl

owdo

wn

C3. Get(one) + Write(all)

(Data)B1. Put

(Region Server)B2. Get

(Region Server)B3. Scan

(Region Server)B4. Cache Get/Put

(Data-H)

Figure 1: Limpbench Results. Each graph represents the result of each experiment (e.g., F1) described in Table 1. They-axis plots the slowdowns (in log scale) of an experiment under various limpware scenarios. In the first row, a limping disk isinjected. In the rest, a limping NIC is injected. The graphs show that cloud systems are crash tolerant, but not limpware tolerant.

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200

Prog

ress

Sco

re

Time (second)

(a) Limplocked Reducers

Normal reducerLimplocked reducer 1Limplocked reducer 2Limplocked reducer 3 0

200 400 600 800

1000 1200

0 50 100 150 200 250 300 350

# of

Job

s Fi

nish

ed

Time (minute)

Job Throughput

Normalw/ 1 limpware

0

20

40

60

80

100

0 50 100 150 200 250 300 350

Lim

ploc

ked

Node

s (%

)

Time (minute)

Cascading Limplock

Figure 2: Hadoop Limplock. The graphs show (a) the progress scores of limplocked reducers of a job in experiment H1 (anormal reducer is shown for comparison), (b) cascades of node limplock due to a single limpware, and (c) a throughput collapseof a Hadoop cluster due to a limpware. For Figures (b) and (c), we ran a Facebook workload [1] on a 30-node cluster.

0 0.2 0.4 0.6 0.8

1

5 20 40 60 80 100

Prob

abilit

y

# Nodes

a) Read

r = 40r = 20r = 10r = 5r = 1

0 0.2 0.4 0.6 0.8

1

5 20 40 60 80 100# Nodes

b) Write

r = 40r = 20r = 10r = 5r = 1

0 0.2 0.4 0.6 0.8

1

5 20 40 60 80 100# Nodes

c) Block Regeneration

b = 3200b = 1600b = 800b = 400 0

0.2 0.4 0.6 0.8

1

5 20 40 60 80 100# Nodes

d) Cluster Regeneration

b = 6400b = 3200b = 1600b = 800b = 400

Figure 3: HDFS Limplock Probabilities. The figures plot the probabilities of (a) read limplock/Prl, (b) write limplock/Pwl,(c) block limplock/Pbl, (d) and cluster regeneration limplock/Pcl, as defined in Table 2. The x-axis plots cluster size.

3

13

q Macrobenchmark: Facebook workload §  30-node cluster §  One node w/ degraded NIC (0.1 Mbps)

Cluster collapse! Why?

1 job/hour

14

Fail-stop tolerant, but not limpware tolerant (no failover recovery)

15

q Introduction

q System analysis

q Formalizing the problem: Limplock §  Definitions and causes

q Limpware-Tolerant Systems

q Conclusion

16

q Definition §  The system progresses slowly due to limpware

and is not capable of failing over to healthy components

§  (i.e., the system is “locked” in limping mode)

q 3 levels of limplock §  Operation §  Node §  Cluster

17

q Operation Limplock §  Operation involving limpware is “locked” in limping

mode; no failover

q Node Limplock §  A situation where operations that must be served by

this node experience limplock, although the operations do not involve limpware

q Cluster Limplock §  The whole cluster is in limplock due to limpware

18

q  Operation Limplock §  Single point of failure

-  Hadoop slow map task -  HBase “Gateway”

M1

M2

M3

Mappers Reducers

19

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  (more in the paper)

512MB write to HDFS

1Slow

dow

n

10

100

Reason: No timeout is triggered Coarse-grained timeout in HDFS 60 second timeout on every 64 KB Could limp almost to 1 KB/s

20

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  …

q  Node Limplock §  Bounded multi-purpose thread pool

In-memory meta reads Master

Meta writes

In-memory meta reads Master

In-memory reads > 100x slower than normal

1

Slow

dow

n

10

100

Resource exhaustion by limplocked operation In-memory metadata reads are blocked

21


q  Node Limplock §  Bounded multi-purpose thread pool §  Bounded multi-purpose queue

messages

messages

22


q  Node Limplock §  Bounded multi-purpose thread pool §  Bounded multi-purpose queue §  Unbounded thread pool/queue

-  Ex: Backlogged queue at leader -  Node limplock at leader because garbage collection works hard -  Quorum write: 10x slowdown

Stress load 20 seconds

1

Slow

dow

n

10

ZooKeeper Leader

Client quorum write

Followers

Stress load 600 seconds


q  Node Limplock §  Bounded multi-purpose thread pool §  Bounded multi-purpose queue §  Unbounded thread pool/queue

q  Cluster Limplock §  All nodes in limplock

-  Ex: resource exhaustion in Hadoop, HDFS Regeneration §  Master limplock in master-slave architecture

-  Ex: cases in ZooKeeper, HDFS

24

q Found 15 protocols that exhibit limplock §  8 in HDFS §  1 in Hadoop §  2 in ZooKeeper §  4 in HBase

Limplock happens in almost all systems we have analyzed

25

q Introduction

q System analysis

q Limplock

q Limpware-Tolerant Cloud Systems

q Conclusion

q Anticipation §  Limpware-tolerant design

patterns §  Limpware static analysis §  Limpware statistics

-  Existing work: memory failure, disk failure, etc.

q Detection §  Performance degradation à

implicit (no hard errors) §  Study explicit causes (e.g.

block remapping, error correcting)

q Recovery §  How to “fail in place”? §  Better to fail-stop than

fail-slow? §  Quarantine?

q Utilization §  Fail-stop: fail or

working §  Limpware: degrade

1-100%

26

27

q New failure modes à transform systems

q Limpware is a “new”, destructive failure mode §  Orders of magnitude slowdown §  Cascading failures §  No “fail in place” in current systems

A need for Limpware-Tolerant Systems

28

Download pdf - Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

Download pdf - Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,