1
Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi
2
q Growing complexity of … § Technology scaling § Manufacturing § Design logic § Usage § Operating environment
q … makes HW fail differently § Complete fail-stop § Fail partial § Corruption § Performance degradation?
Rich literature
3
“… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps,
this slow machine caused a chain reaction upstream in such a way that the performance of entire workload for a 100 node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes.” – Borthakur of Facebook
Degraded NIC! (1000000x)
Cascading impact!
More stories in the paper
4
q Does HW degrade? Yes § Limpware: Hardware whose performance degrades
significantly compared to its specification
q Is this a destructive failure mode? Yes § Cascading failures, no “fail in place”
q No systematic analysis on its impact
5
q 56 experiments that benchmark 5 systems § Hadoop, HDFS, Zookeeper, Cassandra, HBase § 22 protocols § 8 hours under normal scenarios § 207 hours under limpware scenarios
q Unearth many limpware-intolerant designs
Our findings: A single piece of limpware (e.g. NIC) causes severe impact on a whole cluster
6
q Introduction
q System analysis
q Limplock
q Limpware-Tolerant Systems
q Conclusion
7
q “The performance of a 100 node cluster was crawling at a snail's pace” – Facebook
q But, … why?
8
q Goals § Measure system-level impacts § Find design flaws
q Methodology § Target cloud systems (e.g., HDFS, Hadoop, ZooKeeper) § Inject load + limpware
- E.g. slow a NIC to 1 Mbps, 0.1 Mbps, etc.
§ White-box analysis (internal probes) - Find design flaws
9
q Run a distributed protocol § E.g., 3-node write in HDFS
q Measure slowdowns under: § No failure, crash, a degraded NIC
workload 10 Mbps
NIC
1Mbps NIC
0.1 Mbps NIC
1
10x slower
100x slower
1000x slower
Execution slowdown
10
q Introduction
q System analysis § Hadoop case study
q Limplock
q Limpware Tolerant Cloud Systems
q Conclusion
11
q Hadoop tail-tolerant? § Why speculative exec is not triggered?
q Consider degraded NIC on a map node § Task M2’s speed = M1 and M3 § Input data is local!
q But all reducers are slow § Straggler: slow vs. others of same job § No straggler detected!
q Flaws § Task-level straggler detection § Single point of failure!
Wordcount on Hadoop
Mappers Reducers
M1
M2
M3
1
10
12
q A degraded NIC à degraded tasks § (Degraded tasks are slower by orders of magnitude)
q Slow tasks use up slots à degraded node § Default: 2 mappers and 2 reducers per node § If all slots are used à node is “unavailable”
q All nodes in limp mode à degraded cluster
M M
R R
Healthy node in limp mode
Node with slow NIC
110
1001000
Exec
utio
nSl
owdo
wn
No failureDisk crash
8 MB/s0.8 MB/s
0.08 MB/sF1. Logging
(Master)F2. Write
(Data)F3. Read
(Data)F4. Read/Logging
(Master)F5. Checkpoint
(Secondary)
110
1001000
Exec
utio
nSl
owdo
wn
No failureNode crash
10 Mbps1 Mbps
0.1 MbpsF6. Write
(Data)F7. Read
(Data)F8. Regeneration
(Data)F9. Regeneration(Data-S/Data-D)
F10. Balancing(Data-O/Data-U)
110
1001000
Exec
utio
nSl
owdo
wn
F11. Decommission(Data-L/Data-R)
H1. Spec. Exec.(Mapper)
H2. Spec. Exec.(Reducer)
H3. Spec. Exec.(Job Tracker)
H4. Spec. Exec.(Task Node)
Z1. Get(Leader)
110
1001000
Exec
utio
nSl
owdo
wn
Z2. Get(Follower)
Z3. Set(Leader)
Z4. Set(Follower)
Z5. Set(Follower)
C1. Write(quorum)(Data)
C2. Read(quorum)(Data)
110
1001000
Exec
utio
nSl
owdo
wn
C3. Get(one) + Write(all)
(Data)B1. Put
(Region Server)B2. Get
(Region Server)B3. Scan
(Region Server)B4. Cache Get/Put
(Data-H)
Figure 1: Limpbench Results. Each graph represents the result of each experiment (e.g., F1) described in Table 1. They-axis plots the slowdowns (in log scale) of an experiment under various limpware scenarios. In the first row, a limping disk isinjected. In the rest, a limping NIC is injected. The graphs show that cloud systems are crash tolerant, but not limpware tolerant.
0
0.2
0.4
0.6
0.8
1
0 200 400 600 800 1000 1200
Prog
ress
Sco
re
Time (second)
(a) Limplocked Reducers
Normal reducerLimplocked reducer 1Limplocked reducer 2Limplocked reducer 3 0
200 400 600 800
1000 1200
0 50 100 150 200 250 300 350
# of
Job
s Fi
nish
ed
Time (minute)
Job Throughput
Normalw/ 1 limpware
0
20
40
60
80
100
0 50 100 150 200 250 300 350
Lim
ploc
ked
Node
s (%
)
Time (minute)
Cascading Limplock
Figure 2: Hadoop Limplock. The graphs show (a) the progress scores of limplocked reducers of a job in experiment H1 (anormal reducer is shown for comparison), (b) cascades of node limplock due to a single limpware, and (c) a throughput collapseof a Hadoop cluster due to a limpware. For Figures (b) and (c), we ran a Facebook workload [1] on a 30-node cluster.
0 0.2 0.4 0.6 0.8
1
5 20 40 60 80 100
Prob
abilit
y
# Nodes
a) Read
r = 40r = 20r = 10r = 5r = 1
0 0.2 0.4 0.6 0.8
1
5 20 40 60 80 100# Nodes
b) Write
r = 40r = 20r = 10r = 5r = 1
0 0.2 0.4 0.6 0.8
1
5 20 40 60 80 100# Nodes
c) Block Regeneration
b = 3200b = 1600b = 800b = 400 0
0.2 0.4 0.6 0.8
1
5 20 40 60 80 100# Nodes
d) Cluster Regeneration
b = 6400b = 3200b = 1600b = 800b = 400
Figure 3: HDFS Limplock Probabilities. The figures plot the probabilities of (a) read limplock/Prl, (b) write limplock/Pwl,(c) block limplock/Pbl, (d) and cluster regeneration limplock/Pcl, as defined in Table 2. The x-axis plots cluster size.
3
13
q Macrobenchmark: Facebook workload § 30-node cluster § One node w/ degraded NIC (0.1 Mbps)
Cluster collapse! Why?
1 job/hour
14
Fail-stop tolerant, but not limpware tolerant (no failover recovery)
15
q Introduction
q System analysis
q Formalizing the problem: Limplock § Definitions and causes
q Limpware-Tolerant Systems
q Conclusion
16
q Definition § The system progresses slowly due to limpware
and is not capable of failing over to healthy components
§ (i.e., the system is “locked” in limping mode)
q 3 levels of limplock § Operation § Node § Cluster
17
q Operation Limplock § Operation involving limpware is “locked” in limping
mode; no failover
q Node Limplock § A situation where operations that must be served by
this node experience limplock, although the operations do not involve limpware
q Cluster Limplock § The whole cluster is in limplock due to limpware
18
q Operation Limplock § Single point of failure
- Hadoop slow map task - HBase “Gateway”
M1
M2
M3
Mappers Reducers
19
q Operation Limplock § Single point of failure § Coarse-grained timeout § (more in the paper)
512MB write to HDFS
1Slow
dow
n
10
100
Reason: No timeout is triggered Coarse-grained timeout in HDFS 60 second timeout on every 64 KB Could limp almost to 1 KB/s
20
q Operation Limplock § Single point of failure § Coarse-grained timeout § …
q Node Limplock § Bounded multi-purpose thread pool
In-memory meta reads Master
Meta writes
In-memory meta reads Master
In-memory reads > 100x slower than normal
1
Slow
dow
n
10
100
Resource exhaustion by limplocked operation In-memory metadata reads are blocked
21
q Operation Limplock § Single point of failure § Coarse-grained timeout § …
q Node Limplock § Bounded multi-purpose thread pool § Bounded multi-purpose queue
messages
messages
22
q Operation Limplock § Single point of failure § Coarse-grained timeout § …
q Node Limplock § Bounded multi-purpose thread pool § Bounded multi-purpose queue § Unbounded thread pool/queue
- Ex: Backlogged queue at leader - Node limplock at leader because garbage collection works hard - Quorum write: 10x slowdown
Stress load 20 seconds
1
Slow
dow
n
10
ZooKeeper Leader
Client quorum write
Followers
Stress load 600 seconds
q Operation Limplock § Single point of failure § Coarse-grained timeout § …
q Node Limplock § Bounded multi-purpose thread pool § Bounded multi-purpose queue § Unbounded thread pool/queue
q Cluster Limplock § All nodes in limplock
- Ex: resource exhaustion in Hadoop, HDFS Regeneration § Master limplock in master-slave architecture
- Ex: cases in ZooKeeper, HDFS
24
q Found 15 protocols that exhibit limplock § 8 in HDFS § 1 in Hadoop § 2 in ZooKeeper § 4 in HBase
Limplock happens in almost all systems we have analyzed
25
q Introduction
q System analysis
q Limplock
q Limpware-Tolerant Cloud Systems
q Conclusion
q Anticipation § Limpware-tolerant design
patterns § Limpware static analysis § Limpware statistics
- Existing work: memory failure, disk failure, etc.
q Detection § Performance degradation à
implicit (no hard errors) § Study explicit causes (e.g.
block remapping, error correcting)
q Recovery § How to “fail in place”? § Better to fail-stop than
fail-slow? § Quarantine?
q Utilization § Fail-stop: fail or
working § Limpware: degrade
1-100%
26
27
q New failure modes à transform systems
q Limpware is a “new”, destructive failure mode § Orders of magnitude slowdown § Cascading failures § No “fail in place” in current systems
A need for Limpware-Tolerant Systems
28