Upload
gomer
View
15
Download
0
Tags:
Embed Size (px)
DESCRIPTION
DStore: An Easy-to-Manage Persistent State Store. Andy Huang and Armando Fox Stanford University. Outline. Project overview Consistency guarantees Failure detection Benchmarks Next steps and bigger picture. Background: Scalable CHTs. LAN. LAN. Frontends. App Servers. DBs. - PowerPoint PPT Presentation
Citation preview
DStore: An Easy-to-Manage DStore: An Easy-to-Manage Persistent State StorePersistent State StoreAndy Huang and Armando FoxAndy Huang and Armando FoxStanford UniversityStanford University
ROC Retreat – June 2004 © 2004 Andy Huang
OutlineOutline
Project overviewProject overview
Consistency guaranteesConsistency guarantees
Failure detectionFailure detection
BenchmarksBenchmarks
Next steps and bigger pictureNext steps and bigger picture
ROC Retreat – June 2004 © 2004 Andy Huang
Background: Scalable CHTsBackground: Scalable CHTs
FrontendsApp Servers
DBs
LAN
LAN
Cluster hash tables (CHTs)
Single-key-lookup dataSingle-key-lookup data
• Yahoo! user profilesYahoo! user profiles
• Amazon catalog metadataAmazon catalog metadata
Underlying storage layerUnderlying storage layer
• Inktomi:Inktomi:wordID wordID docID list docID listdocID docID document document metadatametadata
• DDS/Ninja:DDS/Ninja:atomic compare-and-swapatomic compare-and-swap
ROC Retreat – June 2004 © 2004 Andy Huang
• Our online repartitioning Our online repartitioning algorithm lowers scaling algorithm lowers scaling costcost
• Reactive scaling adjusts Reactive scaling adjusts capacity to match current capacity to match current loadload
• Lowers the cost of acting Lowers the cost of acting on false positiveon false positive
• Effective failure detection Effective failure detection not contingent on not contingent on accuracyaccuracy
DStore: An easy-to-manage CHTDStore: An easy-to-manage CHT
Capacity planningCapacity planning
• High scaling costs High scaling costs necessitate accurate load necessitate accurate load predictionprediction
Failure detectionFailure detection
• FastFast detection is at odds detection is at odds with with accurateaccurate detection detection
Cheap recoveryCheap recovery
Predictably fast and predictably small impact on availability/performancePredictably fast and predictably small impact on availability/performance
C H
A L
L E
N G
E S
B E
N E
F I T
S
Manage like stateless frontendsManage like stateless frontends
ROC Retreat – June 2004 © 2004 Andy Huang
• Sacrifice some Sacrifice some consistency: Well-defined consistency: Well-defined guarantees that provide guarantees that provide consistent orderingconsistent ordering
• Higher replication factor: Higher replication factor: 2N+1 bricks to tolerate N 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA)failures (vs. N+1 in ROWA)
Single-phase writesSingle-phase writes
• No locking and No locking and transactional loggingtransactional logging
QuorumsQuorums
• No recovery code to freeze No recovery code to freeze writes & copy missed writes & copy missed updatesupdates
Cheap recovery: Principles and costsCheap recovery: Principles and costsC
O S
T S
T E
C H
N I Q
U E
S
Trade storage and consistency for cheap recoveryTrade storage and consistency for cheap recovery
Write: send to all, wait for majority
Read: read from majority
dlib dlib
ROC Retreat – June 2004 © 2004 Andy Huang
Nothing new under the sun, but…Nothing new under the sun, but…
Ease of managementEase of managementScalable performanceScalable performanceCHTCHT
Cheap recovery (but Cheap recovery (but that’s just the start…)that’s just the start…)
High availability and High availability and performance (end goal)performance (end goal)
ResultResult
Availability and Availability and performance while performance while nodes are unavailablenodes are unavailable
Relaxed Relaxed consistencyconsistency
Availability during Availability during failures and recoveryfailures and recovery
Availability during Availability during network partitions network partitions and Byzantine faultsand Byzantine faults
QuorumsQuorums
DStoreDStorePrior workPrior workTechniqueTechnique
ROC Retreat – June 2004 © 2004 Andy Huang
Cheap recovery simplifies state Cheap recovery simplifies state managementmanagement
[Future work][Future work][RAID][RAID]Data Data reconstructionreconstruction
Manage state with Manage state with techniques used for techniques used for stateless frontendsstateless frontends
State management is State management is costly (administration- costly (administration- and availability-wise)and availability-wise)
ResultResult
Scale reactively Scale reactively based on current loadbased on current load
Predict future loadPredict future loadCapacity Capacity planningplanning
Duration and impact Duration and impact is predictably smallis predictably small
Relatively new area Relatively new area [Aqueduct][Aqueduct]
Online Online repartitioningrepartitioning
Effective even if it is Effective even if it is not highly accuratenot highly accurate
Difficult to make fast Difficult to make fast and accurateand accurate
Failure Failure detectiondetection
DStoreDStorePrior workPrior workChallengeChallenge
ROC Retreat – June 2004 © 2004 Andy Huang
OutlineOutline
Project overviewProject overview
Consistency guaranteesConsistency guarantees
Failure detectionFailure detection
BenchmarksBenchmarks
Next steps and bigger pictureNext steps and bigger picture
ROC Retreat – June 2004 © 2004 Andy Huang
Consistency guaranteesConsistency guarantees
Usage model:Usage model:
Guarantee: For a key k, DStore enforces a global order of Guarantee: For a key k, DStore enforces a global order of operations that is consistent with the order seen by individual operations that is consistent with the order seen by individual clients.clients.
CC11 issues w issues w11(k, v(k, vnewnew) to replace current hash table entry (k, v) to replace current hash table entry (k, voldold))
ww11 returns SUCCESS: subsequent reads return v returns SUCCESS: subsequent reads return vnewnew
ww11 returns FAIL: subsequent reads return v returns FAIL: subsequent reads return voldold
ww11 return UNKNOWN (due to Dlib failure): two cases return UNKNOWN (due to Dlib failure): two cases
dlibc
1.1. A client issues a requestA client issues a request2.2. Request forwarded to a random Request forwarded to a random
DlibDlib3.3. Dlib issues quorum r/w on Dlib issues quorum r/w on
bricksbricks• Assumption: Clients share data, Assumption: Clients share data,
but otherwise act but otherwise act independentlyindependently
ROC Retreat – June 2004 © 2004 Andy Huang
Case 1: Another user UCase 1: Another user U22 performs a performs a readread
U1 B1 B2 B3
(k1,vold)
U2
Dlib failure can cause a partial write, violating the quorum property
If timestamps differ, read-repair restores majority invariant
Delayed commit
w1(k1,vnew)
vold
r1(k1)
vnew
w2(k1,vnew)
r2(k1)
U2 r(k1) returns:
vold – no user has read vnew
vnew – no user will later read vold
ROC Retreat – June 2004 © 2004 Andy Huang
Case 2: UCase 2: U11 performs a read performs a read
B1 B2 B3U1 U2
vnew
r1(k1)
w2(k1,vnew)
A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read
(k1,vold)
w1(k1,vnew)
U1 r(k1): write is immediately committed or aborted – all future readers see either vold or vnew
ROC Retreat – June 2004 © 2004 Andy Huang
Consistency guaranteesConsistency guarantees
CC11 issues w issues w11(k, v(k, vnewnew) to replace current hash table entry (k, ) to replace current hash table entry (k,
vvoldold))
ww11 returns SUCCESS: subsequent reads return v returns SUCCESS: subsequent reads return vnewnew
ww11 returns FAIL: subsequent reads return v returns FAIL: subsequent reads return voldold
ww11 return UNKNOWN (due to Dlib failure): return UNKNOWN (due to Dlib failure):
UU11 reads – w reads – w11 is immediately committed or aborted is immediately committed or aborted
UU22 reads – if v reads – if voldold is returned, no user has read v is returned, no user has read vnewnew
if vif vnewnew is returned, no user will later read v is returned, no user will later read voldold
ROC Retreat – June 2004 © 2004 Andy Huang
Two-phase commit vs. single phase Two-phase commit vs. single phase writeswrites
No special-case recoveryNo special-case recoveryRead log to complete inRead log to complete in progress transactions progress transactions
RecoveryRecovery
Read-repair (spreads outRead-repair (spreads out the cost of 2-PC to make the cost of 2-PC to make common case faster) common case faster)Write-in-progress cookieWrite-in-progress cookie (spreads out the (spreads out the responsibility of 2-PC) responsibility of 2-PC)
NoneNoneOther costsOther costs
1 synchronous update1 synchronous update1 roundtrip1 roundtrip
2 synchronous log writes2 synchronous log writes2 roundtrips2 roundtrips
PerformancPerformancee
No lockingNo lockingLocking may causeLocking may cause request to block request to block during failures during failures
AvailabilityAvailability
Consistent orderingConsistent orderingSequential consistencySequential consistencyConsistencyConsistency
Single-phase writesSingle-phase writes2-phase commit2-phase commitPropertyProperty
ROC Retreat – June 2004 © 2004 Andy Huang
Recovery behaviorRecovery behavior
0
25
50
75
100
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
25
50
75
100
0 5 10 15 20 25 30
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
1K
2K
3K
4K
GET
req/
sec
0K
1K
2K
3K
4K
GET
req/
sec
Run at 100% capacity
Typically, run at 60-70% max utilization
Predictably fast and small impactPredictably fast and small impact
Recovery
ROC Retreat – June 2004 © 2004 Andy Huang
Application-generic failure detectionApplication-generic failure detection
Operating statistics (CPU load, requests processed, etc.)
Beacon listener
Median absolute deviation
Tarzan algorithm
Anomalies
Failure detection techniques
> treshold
reboot
Simple detection techniques “work” because resolution mechanism is cheap
ROC Retreat – June 2004 © 2004 Andy Huang
Failure detection and repartitioning Failure detection and repartitioning behaviorbehavior
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50
100
150
200
0 5 10 15
PUT
req/
sec
Time (minutes)
0
50Repairs/sec
0
50Repairs/sec
0K
4K
8K
GET
req/
sec
0K
4K
8K
GET
req/
sec
Aggressive failure detection
0
60
120
0 5 10 15 20 25 30 35 40
PUT
req/
sec
Time (minutes)
0
60
120
0 5 10 15 20 25 30 35 40
PUT
req/
sec
Time (minutes)
0
25
50Repairs/sec
0
25
50Repairs/sec
0K
2.5K
5K
GET
req/
sec
# bricks
3 4 5 6
0K
2.5K
5K
GET
req/
sec
# bricks
3 4 5 6
Online repartitioning
Low scaling costLow scaling costLow cost of acting on false positivesLow cost of acting on false positives
Fail-stutter
ROC Retreat – June 2004 © 2004 Andy Huang
reboot
Bigger picture: What is “self-Bigger picture: What is “self-managing”?managing”?
Brick performanceIndicator
Monitoring
Treatment
a sign of system health
tests for potential problems
low-impact
resolution mechanis
m
ROC Retreat – June 2004 © 2004 Andy Huang
Bigger picture: What is “self-Bigger picture: What is “self-managing”?managing”?
Brick performance
System load
Disk failures
ROC Retreat – June 2004 © 2004 Andy Huang
Bigger picture: What is “self-Bigger picture: What is “self-managing”?managing”?
Brick performance
System load
Disk failuresKey: low-
cost mechanism
s
Simple detection mechanis
ms & policies
Constant “recover
y”
reboot
repartition
reco
nstru
ctio
n