Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation

Wide-Area Cooperative Storage with CFS

Presented by Hakim WeatherspoonCS294-4: Peer-to-Peer Systems

Slides liberally borrowed from the SOSP 2001 CFS presentationAnd High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two by Rodrigo Rodrigues, Charles Blake and Barbara Liskov presented at HotOS 2003

By Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, *Ion Stoica

MIT and *Berkeley

node

node node

node

Internet

node

P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:2

Design Goals• Spread storage burden evenly (Avoid hot spots)• Tolerate unreliable participants• Fetch speed comparable to whole-file TCP• Avoid O(#participants) algorithms

– Centralized mechanisms [Napster], broadcasts [Gnutella]

• Simplicity – Does simplicity imply provable correctness?

• More precisely, could you build CFS correctly?

– What about performance?

• CFS attempts to solve these challenges– Does it?


CFS Summary

• CFS provides peer-to-peer r/o storage• Structure: DHash and Chord• Claims efficient, robust, and load-balanced

– Does CFS achieve any of these qualities?

• It uses block-level distribution• The prototype is as fast as whole-file TCP

• Storage promise Redundancy promise data must move as members leave! lower bound on bandwidth usage


Client-server interface

• Files have unique names• Files are read-only (single writer, many readers)• Publishers split files into blocks and place blocks into a

hash table.• Clients check files for authenticity [SFSRO]

FS Client serverInsert file f

Lookup file f

Insert block

Lookup block

node

server

node


Server Structure

• DHash stores, balances, replicates, caches blocks• DHash uses Chord [SIGCOMM 2001] to locate blocks•Why blocks instead of files?

•easier load balance (remember complexity of PAST)

DHash

Chord

Node 1 Node 2

DHash

Chord


CFS file system structure

• The root-block is identified by a public key– signed by corresponding private key

• Other blocks identified by hash of their contents• What is wrong with this organization?

– Path of blocks from data-block to root-block modified for every update. – This is okay because system is read-only.

root-blockpublic key

signature

H(D)

directoryblock

DH(F)

inodeblock

F

data block

B1data block

B2H(B2)

H(B1)


DHash/Chord Interface

• lookup() returns list with node IDs closer in ID space to block ID– Sorted, closest first

server

DHash

Chord

Lookup(blockID) List of <node-ID, IP address>

finger table with <node IDs, IP address>


DHash Uses Other Nodes to Locate Blocks

N40

N10

N5

N20

N110

N99

N80 N50

N60N68

Lookup(BlockID=45)

1.

2.

3.


Storing Blocks

• Long-term blocks are stored for a fixed time– Publishers need to refresh periodically

• Cache uses LRU

disk: cache Long-term block storage


Replicate blocks at r successors

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68• r = 2 log N• Node IDs are SHA-1 of IP Address• Ensures independent replica failure


Lookups find replicas

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

1.3.

2.

4.

Lookup(BlockID=17)

RPCs:1. Lookup step2. Get successor list3. Failed block fetch4. Block fetch


First Live Successor Manages Replicas

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

Copy of17

• Node can locally determine that it is the first live successor


DHash Copies to Caches Along Lookup Path

N40

N10

N5

N20

N110

N99

N80

N60

Lookup(BlockID=45)

N50

N68

1.

2.

3.

4.RPCs:1. Chord lookup2. Chord lookup3. Block fetch4. Send to cache


Virtual Nodes Allow Heterogeneity

• Hosts may differ in disk/net capacity• Hosts may advertise multiple IDs

– Chosen as SHA-1(IP Address, index)– Each ID represents a “virtual node”

• Host load proportional to # v.n.’s• Manually controlled• Sybil attach!

Node A

N60N10 N101

Node B

N5


Experiment (12 nodes)!(pre-planetlab)

• One virtual node per host• 8Kbyte blocks• RPCs use UDP

CA-T1CCIArosUtah

CMU

To vu.nlLulea.se

MITMA-CableCisco

Cornell

NYU

OR-DSL

To vu.nl lulea.se ucl.uk

To kaist.kr, .ve

• Caching turned off• Proximity routing

turned off


CFS Fetch Time for 1MB File

• Average over the 12 hosts• No replication, no caching; 8 KByte blocks

Fetc

h T

ime (

Seco

nd

s)

Prefetch Window (KBytes)


Distribution of Fetch Times for 1MB

Fract

ion

of

Fetc

hes

Time (Seconds)

8 Kbyte Prefetch

24 Kbyte Prefetch40 Kbyte Prefetch


CFS Fetch Time vs. Whole File TCP

Fract

ion

of

Fetc

hes

Time (Seconds)

40 Kbyte Prefetch

Whole File TCP


Robustness vs. Failures

Faile

d L

ooku

ps

(Fra

ctio

n)

Failed Nodes (Fraction)

(1/2)6 is 0.016

Six replicasper block;


Revisit Assumptions • P2P Purist Ideals

– Cooperation, Symmetry, Decentralized

• How realistic are these assumptions?• In what domains are they valid?

FastestFlaky

Slower Flaky

Slow

Faster

Stable

Fast Slowtest

10 .. 100s GB/Node of Idle Cheap Disk

Distributed Data Store w/ all the *ilities:High AvailabilityGood ScalabilityHigh ReliabilityMaintainabilityFlexibilityFault-Tolerant DHT


BW for Redundancy Maintenance

• Assume average system size, N, stable– P(Leave)/Time = Leaves/Time/N = 1/Lifetime– Join = Leave forever rate = 1/Lifetime– Leaves induce redundancy replacement

• replacement size x replacement rate

– Joins cost the same

Maintenance BW > 2 x Space/Lifetime• Space/node < ½ BW/node x Lifetime

• Quality WAN storage scales with WAN BW and member quality


BW for Redundancy Maintenance II

• maintenance BW 200 Kbps• lifetime = Median 2001-Gnutella

session = 1 hour

• served space = 90 MB/node << donatable storage!


Peer Dynamics

• The peer-to-peer “dream”– Reliable storage from many unreliable

components

• Robust lookup perceived as critical• Bandwidth to maintain redundancy is the

hard problem [Blake and Rodrigues 03]


Need Too Much BW toMaintain Redundancy

• 10M users; 25% avail.; 1 week membership; 100G donation => 50 kbps

• Wait! It gets worse… HW trends

HighAvailability

ScalableStorage

DynamicMembership

Must Pick Two


Proposal #1

• Server-to-Server DHTs• Reduce to a Solved Problem? Not

really…– Self-Configuration– Symmetry– Scalability– Dynamic Load balance


Proposal #2

• Complete routing information• Possible complications:

– Memory Requirements– Bandwidth [Gupta,Liskov,Rodrigues 03]– Load balance

• Multi-hop optimization makes sense only when many very dynamic members serve a little data– (multi-hop not required if N < per host / 40

bytes)


Proposal #3

• Decouple networking layer from data layer• Layer of indirection

– a.k.a. distributed directory, location pointers, pointers, etc

• Combines a little of proposal #1 and 2– DHT no longer decides who, what, when, where,

why, and how for storage maintenance.– Separate policy from mechanism.


Appendix: Chord Hashes aBlock ID to its Successor

N32

N10

N100

N80

N60

CircularID Space

• Nodes and blocks have randomly distributed IDs• Successor: node with next highest ID

B33, B40, B52

B11, B30

B112, B120, …, B10

B65, B70

B100

Block ID Node ID


Appendix: Basic Lookup

N32

N10

N5

N20

N110

N99

N80

N60

N40

“Where is block 70?”

“N80”

• Lookups find the ID’s predecessor• Correct if successors are correct


Appendix: Successor Lists Ensure Robust Lookup

N32

N10

N5

N20

N110

N99

N80

N60

• Each node stores r successors, r = 2 log N• Lookup can skip over dead nodes to find blocks

N40

10, 20, 32

20, 32, 40

32, 40, 60

40, 60, 80

60, 80, 99

80, 99, 110

99, 110, 5

110, 5, 10

5, 10, 20


Appendix: Chord Finger Table Allows O(log N) Lookups

N80

½¼

1/8

1/161/321/641/128

• See [SIGCOMM 2000] for table maintenance

Documents

Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation