52
NFSv4 Replication for Grid Computing Peter Honeyman Center for Information Technology Integration University of Michigan, Ann Arbor

NFSv4 Replication for Grid Computing

Embed Size (px)

DESCRIPTION

We develop a consistent mutable replication extension for NFSv4 tuned to meet the rigorous demands of large-scale data sharing in global collaborations. The system uses a hierarchical replication control protocol that dynamically elects a primary server at various granularities. Experimental evaluation indicates a substantial performance advantage over a single server system. With the introduction of the hierarchical replication control, the overhead of replication is negligible even when applications mostly write and replication servers are widely distributed.

Citation preview

Page 1: NFSv4 Replication for Grid Computing

NFSv4 Replication for Grid Computing

Peter HoneymanCenter for Information Technology IntegrationUniversity of Michigan, Ann Arbor

Page 2: NFSv4 Replication for Grid Computing

Acknowledgements

Joint work with Jiaying Zhang UM CSE doctoral candidate Defending later this month

Partially supported by NSF/NMI GridNFS DOE/SciDAC Petascale Data Storage Institute Network Appliance, Inc. IBM ARC

Page 3: NFSv4 Replication for Grid Computing

Outline

Background Consistent replication

Fine-grained replication control Hierarchical replication control

Evaluation Durability revisited NEW! Conclusion

SKIP SKIP SKIP

SKIP SKIP SKIP SKIP SKIP

Page 4: NFSv4 Replication for Grid Computing

Grid computing

Emerging global scientific collaborations require access to widely distributed data that is reliable, efficient, and convenient

Grid Computing

SKIP SKIP SKIP

Page 5: NFSv4 Replication for Grid Computing

GridFTP

Advantages Automatic negotiation of TCP options Parallel data transfer Integrated Grid security Easy to install and support across a broad

range of platforms Drawbacks

Data sharing requires manual synchronization

SKIP SKIP SKIP

Page 6: NFSv4 Replication for Grid Computing

NFSv4

Advantages Traditional, well-understood file system

semantics Supports multiple security mechanisms Close-to-open consistency

Reader is is guaranteed to see data written by the last writer to close the file

Drawbacks Wide-area performanceSKIP SKIP SKIP

Page 7: NFSv4 Replication for Grid Computing

NFSv4.r

Research prototype developed at CITI Replicated file system build on NFSv4 Server-to-server replication control

protocol High performance data access Conventional file system semantics

SKIP SKIP SKIP

Page 8: NFSv4 Replication for Grid Computing

Replication in practice

Read-only replication Clumsy manual release model Lacks complex data sharing (concurrent

writes) Optimistic replication

Inconsistent consistency

SKIP SKIP SKIP

Page 9: NFSv4 Replication for Grid Computing

Consistent replication

Problem: state of the practice in file system replication does not satisfy the requirements of global scientific collaborations

How can we provide Grid applications efficient and reliable data access?

Consistent replicationConsistent replication

SKIP SKIP SKIP

Page 10: NFSv4 Replication for Grid Computing

Design principles

Optimal read-only behavior Performance must be identical to un-replicated

local system Concurrent write behavior

Ordered writes, i.e., one-copy serializability Close-to-open semantics

Fine-grained replication control The granularity of replication control is a single

file or directorySKIP SKIP SKIP

Page 11: NFSv4 Replication for Grid Computing

wopen

client

Replication control

When a client opens a file for writing, the selected server temporarily becomes the primary for that fileOther replication servers are instructed to forward client requests for that file to the primary if concurrent writes occur

SKIP SKIP SKIP

Page 12: NFSv4 Replication for Grid Computing

write

client

Replication control

The primary server asynchronously distributes updates to other servers during file modification

SKIP SKIP SKIP

Page 13: NFSv4 Replication for Grid Computing

close

client

Replication controlWhen the file is closed and all replication servers are synchronized, the primary server notifies the other replication servers that it is no longer the primary server for the file

SKIP SKIP SKIP

Page 14: NFSv4 Replication for Grid Computing

Directory updates

Prohibit concurrent updates A replication server waits for the primary to

relinquish its role Atomicity for updates that involve multiple

objects (e.g. rename) A server must become primary for all objects Updates are grouped and processed together

SKIP SKIP SKIP

Page 15: NFSv4 Replication for Grid Computing

Close-to-open semantics

Server becomes primary after it collects votes from a majority of replication servers

Use a majority consensus algorithm Cost is dominated by the median RTT from the primary

server to other replication servers Primary server must ensure that every

replication server has acknowledged its election when a written file is closed

Guarantees close-to-open semantics Heuristic: allow a new file to inherit the primary server

that controls its parent directory for file creationSKIP SKIP SKIP

Page 16: NFSv4 Replication for Grid Computing

Durability guarantee

“Active view” update policy Every server keeps track of the liveness of other servers

(active view) Primary server removes from its active view any server

that fails to respond to its request Primary server distributes updates synchronously and in

parallel Primary server acknowledges a client write after a

majority of replication servers reply Primary sends other servers its active view with file

close A failed replication server must synchronize with the up-

to-date copy before it can rejoin the active group I suppose this is expensiveSKIP SKIP SKIP

Page 17: NFSv4 Replication for Grid Computing

What I skipped

Not the Right Stuff GridFTP: manual synchronization NFSv4.r: write-mostly WAN performance AFS, Coda, et al.: sharing semantics

Consistent replication for Grid computing Ordered writes too weak Strict consistency too strong Open-to-close just right

Page 18: NFSv4 Replication for Grid Computing

NFSv4.r in brief

View-based replication control protocol Based on (provably correct) El-Abbadi, Skeen,

and Cristian Dynamic election of primary server

At the granularity of a single file or directory Majority consensus on open (for

synchronization) Synchronous updates to a majority (for

durability) Total consensus on close (for close-to-

open)

Page 19: NFSv4 Replication for Grid Computing

Write-mostly WAN performance

Durability overhead Synchronous updates

Synchronization overhead Consensus management

Page 20: NFSv4 Replication for Grid Computing

Asynchronous updates

Consensus requirement delays client updates Median RTT between the primary server and

other replication servers is costly Synchronous write performance is worse

Solution: asynchronous update Let application decide whether to wait for

server recovery or regenerate the computation results

OK for Grid computations that checkpoint Revisit at end with new ideas

Page 21: NFSv4 Replication for Grid Computing

Hierarchical replication control

Synchronization is costly over WAN Hierarchical replication control

Amortizes consensus management A primary server can assert control at

different granularities

Page 22: NFSv4 Replication for Grid Computing

Shallow & deep control

/usr

bin local

/usr

bin local

A server with a shallow control on a file or directory is the primary server for that single objectA server with a deep control on a directory is the primary server for everything in the subtree rooted at that directory

Page 23: NFSv4 Replication for Grid Computing

Primary server election

Allow deep control for a directory D if D has no descendent is controlled by another server

Grant a shallow control request for object L from peer server P if L is not controlled by a server other than P

Grant a deep control request for directory D from peer server P if D is not controlled by a server other than P No descendant of D is controlled by a server

other than PSKIP SKIP SKIP

Page 24: NFSv4 Replication for Grid Computing

Ancestry table

/root

a b

c f2 d2controlled by S1

controlled by S0 controlled by S0

controlled by S2

……

Idcounter array

S0 S1 S2

root 2 1 1

a 2 1 0

b 0 0 1

c 2 0 0

Ancestry Table

The data structure of entries in the ancestry table

d1f1

Ancestry Entry an ancestry entry has the following attributesid = unique identifier of the directoryarray of counters = set of counters recording which servers controls

the directory’s descendants

Page 25: NFSv4 Replication for Grid Computing

Primary election

S0 and S1 succeed in their primary server elections

S2’s election fails due to conflicts Solution - S2 then re-tries by asking for

shallow control of a

a

b c

S0 S1

S2

control bcontrol c

control b

deep control a

cont

rol c

deep

con

trol a

S0 S1

S2

SKIP SKIP SKIP

Page 26: NFSv4 Replication for Grid Computing

Performance vs. concurrency

Associate a timer with deep control Reset the timer with subsequent updates Release deep control when timer expires A small timer value captures bursty updates

Issue a separate shallow control for a file written under a deep controlled directory Still process the write request immediately Subsequent writes on the file do not reset the

timer of the deep controlled directorySKIP SKIP SKIP SKIP

Page 27: NFSv4 Replication for Grid Computing

Performance vs. concurrency

Increase concurrency when the system consists of multiple writers Send a revoke request upon concurrent writes The primary server shortens releasing timer

Optimally issues a deep control request for a directory that contains many updates in single writer cases

SKIP SKIP SKIP

Page 28: NFSv4 Replication for Grid Computing

Single remote NFS

1

10

100

1000

10000

0.2 5 10 20 30 40RTT between NFS server and client (ms)

SSH build time (s)

unpack configure build

N.B.: log scale

Page 29: NFSv4 Replication for Grid Computing

Deep vs. shallow

0

100

200

300

400

500

600

700

800

0.2 20 40 60 80 100 120 singlelocal

serverRTT between two replication servers (ms)

SSH build time (s)

unpack configure build

Shallow controls vs. deep + shallow controls

Page 30: NFSv4 Replication for Grid Computing

Deep control timer

0

40

80

120

160

200

0.220406080100120 0.220406080100120 0.220406080100120

RTT between two replication servers (ms)

SSH build time (s)

unpack configure build

0.1s timer 0.5s timer 1s timer

single local server

Page 31: NFSv4 Replication for Grid Computing

Durability revisited

Synchronization is expensive, but … When we abandon the durability

guarantee, we risk losing the results of the computation And may be forced to rerun it But it might be worth it

Goal: maximize utilization

NEW NEW NEW

Page 32: NFSv4 Replication for Grid Computing

Utilization tradeoffs

Adding synchronous replication servers enhances durability Which reduces the risk that results are lost And that the computation must be restarted Which benefits utilization

But increases run time Which reduces utilization

Page 33: NFSv4 Replication for Grid Computing

Placement tradeoffs

Nearby replication servers reduce the replication penalty Which benefits utilization

Nearby replication servers are more vulnerable to correlated failure Which reduces utilization

Page 34: NFSv4 Replication for Grid Computing

Run-time model

start

run

end

recover

ok k fail

fail

Page 35: NFSv4 Replication for Grid Computing

Parameters

F: failure free, single server run time C: replication overhead R: recovery time pfail: server failure precover: successful recovery

Page 36: NFSv4 Replication for Grid Computing

F: run time

Failure-free, single server run time Can be estimated or measured Our focus is on 1 to 10 days

Page 37: NFSv4 Replication for Grid Computing

C: replication overhead

Penalty associated with replication to backup servers

Proportional to RTT Ratio can be measured by running with a

backup server a few msec away

Page 38: NFSv4 Replication for Grid Computing

R: recovery time

Time to detect failure of the primary server and switch to a backup server

We assume R << F Arbitrary realistic value: 10 minutes

Page 39: NFSv4 Replication for Grid Computing

Failure distributions

Estimated by analyzing PlanetLab ping data 716 nodes, 349 sites, 25 countries All-pairs, 15 minute interval From January 2004 to June 2005

692 nodes were alive throughout

We ascribe missing pings to node failure and network partition

Page 40: NFSv4 Replication for Grid Computing

PlanetLab failure CDF

Page 41: NFSv4 Replication for Grid Computing

Same-site correlated failures

259 65 21 11

2 0.526 0.593 0.552 0.561

3 0.546 0.440 0.538

4 0.378 0.488

5 0.488

sites

nodes

Page 42: NFSv4 Replication for Grid Computing

Different-site correlated failures

0

0.05

0.1

0.15

0.2

0.25

0 20 40 60 80 100 120 140 160 180 200Maxium RTT (ms)

Average Failure Correlations

2nodes 3nodes 4nodes 5nodes

y = -0.0002x + 0.1955y = -0.0002x + 0.155y = -0.0002x + 0.1335y = -0.0002x + 0.1186

0.050.070.090.110.130.150.170.190.210.23

0 20 40 60 80 100 120 140 160 180 200Maxium RTT (ms)

Average Failure Correlations

Page 43: NFSv4 Replication for Grid Computing

Run-time model

Discrete event simulation yields expected run time E and utilization (F ÷ E)

start

run

end

recover

ok k fail

fail

Page 44: NFSv4 Replication for Grid Computing

Simulated utilizationF = one hour

C=0.1F C=0.01F C=0.001F C=0.0001F No backup

0.900.910.920.930.940.950.960.970.980.991.00

0 10 20 30 40 50 60

One backup server

0.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 200

0.900.910.920.930.940.950.960.970.980.991.00

0 10 20 30 40 50 60

Four backup servers

Page 45: NFSv4 Replication for Grid Computing

Simulation resultsF = one day

One backup server Four backup servers

0.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 200

0.800.820.840.860.880.900.920.940.960.981.00

0 10 20 30 40 50 600.800.820.840.860.880.900.920.940.960.981.00

0 10 20 30 40 50 60

C=0.1F C=0.01F C=0.001F C=0.0001F No backup

Page 46: NFSv4 Replication for Grid Computing

Simulation resultsF = ten days

One backup server Four backup servers

0.500.550.600.650.700.750.800.850.900.951.00

0 10 20 30 40 50 60

0.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 200

0.500.550.600.650.700.750.800.850.900.951.00

0 10 20 30 40 50 60

0.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 200

Page 47: NFSv4 Replication for Grid Computing

Simulation results discussion

For long-running jobs Replication improves utilization Distant servers improve utilization

For short jobs Replication does not improve utilization

In general, multiple backup servers don’t help much

Implications for checkpoint interval …

Page 48: NFSv4 Replication for Grid Computing

Checkpoint interval

0.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 200

F = one dayOne backup server20% checkpoint overhead

F = ten days, 2% checkpoint overhead

One backup server Four backup servers

0.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 2000.00.10.20.30.40.50.60.70.80.91.0

0 20 40 60 80 100 120 140 160 180 200

Page 49: NFSv4 Replication for Grid Computing

Next steps

Checkpoint overhead? Replication overhead?

Depends on amount of computation We measure < 10% for NAS Grid Benchmarks,

which do no computation Refine model

Account for other failures Because they are common

Other model improvements

Page 50: NFSv4 Replication for Grid Computing

Conclusions

Conventional wisdom holds thatconsistent mutable replication

in large-scale distributed systems is too expensive to consider

Our study proves otherwise

Page 51: NFSv4 Replication for Grid Computing

Conclusions

Consistent replication in large-scale distributed storage systems is

feasible and practical Superior performance Rigorous adherence to conventional

file system semantics Improves cluster utilization

Page 52: NFSv4 Replication for Grid Computing

Thank you for your attention!www.citi.umich.edu

Questions?