StarFish: highly-available block storage Eran Gabber Jeff Fellin Michael Flaster Fengrui Gu Bruce...

Preview:

Citation preview

StarFish: highly-available block storageEran Gabber

Jeff Fellin

Michael Flaster

Fengrui Gu

Bruce Hillyer

Wee Teck Ng

Banu O¨ zden

Elizabeth Shriver

2003 USENIX Annual Technical Conference

Presenter: D00922019 林敬棋

IntroductionImportant data need to be

protected.◦Making replicas.

Replication on remote sites◦Reduce the amount of data lost in

failure.◦Decrease the time required to

recover from catastrophic site failure.

StarFishA highly-available geographically-

dispersed block storage system.◦Does not require expensive

dedicated communication lines to all replicas to achieve highly-available .

◦Achieves good performance even during recovery from a replica failure.

◦Single-owner access semantics.

ArchitectureStarFish consists of

◦One Host Element(HE) Provides storage virtualization and read

cache.

◦N Storage Element(SE) Q: write quorum size. Synchronous updates to a quorum of Q

SEs, and asynchronous updates to the rest.

Recommended Setup

N = 3, Q = 2

MAN : Metropolitan Area NetworkWAN :Wide Area Network

Another Deployment

SE RecoveryWrite log

◦HE keeps a circular buffer of recent writes.

◦Each SE maintains a circular buffer of recent writes on a log disk.

Three types of recovery◦Quick recovery◦Replay recovery◦Full recovery

Availability and ReliabilityAssume that the failure and

recovery processes of the network links and SEs are i.i.d Poisson processes with combined mean failure and recovery rates of λ and μ per second.

Similarly, the HE has Poisson-distributed λhe and μhe .

AvailabilityThe steady-state probability that

at least Q SEs are available.

Derived from the standard machine repairman mode.

NQ

i

N

NQAN

QN

i

i

1,10,

)1(),( 0

Machine Repairman Model

Availability(cont.)

Availability(cont.)

X ★ 9 : the number of 9s in an availability measure.

Achieve a much higher availability when N = 2Q + 1.

For fixed N, availability decrease with larger quorum size.◦Increasing quorum size trades off

availability for reliability.

ReliabilityThe probability of no data loss.The reliability increases with

larger Q.Two approaches

◦Make Q > floor(N/2) and at least Q SEs are available. Reduce availability and performance.

◦Read-only consistency

Read-only ConsistencyAvailable in read-only mode

during failure.◦Read-only mode obviates the need

for Q SEs to be available to handle updates.

◦Increase availability

Qhe

iQ

ihe

Nhe

iN

iadOnly

i

Q

i

N

NQA)1)(1(

)(

)1)(1(

)(),(

1

0

1

0Re

he

he

headOnly

QANANQA

1

),1(

1

),1(),(Re

Availability with Read-only Consistency

ObservationsIf ρhe = 0, availability is

independent of Q.◦Can always recover from HE.

If ρhe increase, availability increase with Q.

Largest increase occurs from Q = 1 to Q = 2, and bounded by 3/16 when ρ = 1.◦Diminishing gain after Q = 2.◦Suggest Q = 2 in practical system.

Implementation

Performance MeasurementsCompares with a direct-attached

RAID unit.

SettingsDifferent network delays

◦1, 2, 4, 8, 23, 36, 65 msDifferent bandwidth limitations

◦31, 51, 62, 93, 124 Mb/s.Benchmark:

◦Micro-benchmark Read hit Read miss Write

◦PostMark

Effects of network delays and HE cache size

Near SE delay: 4ms; Far SE delay: 8msNo cache miss if HE cache size = 400

MB

ObservationLarge HE cache improves

performance.◦HE can respond to more read

requests without communicating with SE. Does not change write requests.

◦Especially beneficial when local SE has significant delays.

Q = 2 and 400MB cache size is not influenced by the delay to local SE.◦Depend on near SE.

Normal Operation and placement of the far SE

1-8: 1, 2, 4, 8 ms; 4-12: 4, 8, 12 ms 23-65: 23, 36, 65 ms; 31-124:

31,51,62,93,124 Mbps Local SE delay: 0ms

N = 3

Normal Operation and placement of the far SE(Cont.)

N = 3 8 threads

Normal Operation and placement of the far SE(Cont.)

ObservationPerformance is influenced mostly

by two parameters◦Write quorum size◦Delay to the SE.

StarFish can provide adequate performance when one of the SEs is placed in a remote location.◦At least 85% of the performance of a

direct-attached RAID.

Recovery

Performance degrades more during full recovery.

ConclusionThe StarFish system reveals

significant benefits from a third copy of the data at an intermediate distance.

A StarFish system with 3 replicas, a write quorum size of 2, and read-only consistency yields better than 99.9999% availability assuming individual Storage Element availability of 99%.

Recommended