Upload
rosanna-farmer
View
214
Download
0
Embed Size (px)
Citation preview
Wide-Area Cooperative Storage with CFS
Presented by Hakim WeatherspoonCS294-4: Peer-to-Peer Systems
Slides liberally borrowed from the SOSP 2001 CFS presentationAnd High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two by Rodrigo Rodrigues, Charles Blake and Barbara Liskov presented at HotOS 2003
By Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, *Ion Stoica
MIT and *Berkeley
node
node node
node
Internet
node
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:2
Design Goals• Spread storage burden evenly (Avoid hot spots)• Tolerate unreliable participants• Fetch speed comparable to whole-file TCP• Avoid O(#participants) algorithms
– Centralized mechanisms [Napster], broadcasts [Gnutella]
• Simplicity – Does simplicity imply provable correctness?
• More precisely, could you build CFS correctly?
– What about performance?
• CFS attempts to solve these challenges– Does it?
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:3
CFS Summary
• CFS provides peer-to-peer r/o storage• Structure: DHash and Chord• Claims efficient, robust, and load-balanced
– Does CFS achieve any of these qualities?
• It uses block-level distribution• The prototype is as fast as whole-file TCP
• Storage promise Redundancy promise data must move as members leave! lower bound on bandwidth usage
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:4
Client-server interface
• Files have unique names• Files are read-only (single writer, many readers)• Publishers split files into blocks and place blocks into a
hash table.• Clients check files for authenticity [SFSRO]
FS Client serverInsert file f
Lookup file f
Insert block
Lookup block
node
server
node
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:5
Server Structure
• DHash stores, balances, replicates, caches blocks• DHash uses Chord [SIGCOMM 2001] to locate blocks•Why blocks instead of files?
•easier load balance (remember complexity of PAST)
DHash
Chord
Node 1 Node 2
DHash
Chord
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:6
CFS file system structure
• The root-block is identified by a public key– signed by corresponding private key
• Other blocks identified by hash of their contents• What is wrong with this organization?
– Path of blocks from data-block to root-block modified for every update. – This is okay because system is read-only.
root-blockpublic key
signature
H(D)
directoryblock
DH(F)
inodeblock
F
data block
B1data block
B2H(B2)
H(B1)
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:7
DHash/Chord Interface
• lookup() returns list with node IDs closer in ID space to block ID– Sorted, closest first
server
DHash
Chord
Lookup(blockID) List of <node-ID, IP address>
finger table with <node IDs, IP address>
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:8
DHash Uses Other Nodes to Locate Blocks
N40
N10
N5
N20
N110
N99
N80 N50
N60N68
Lookup(BlockID=45)
1.
2.
3.
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:9
Storing Blocks
• Long-term blocks are stored for a fixed time– Publishers need to refresh periodically
• Cache uses LRU
disk: cache Long-term block storage
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:10
Replicate blocks at r successors
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68• r = 2 log N• Node IDs are SHA-1 of IP Address• Ensures independent replica failure
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:11
Lookups find replicas
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68
1.3.
2.
4.
Lookup(BlockID=17)
RPCs:1. Lookup step2. Get successor list3. Failed block fetch4. Block fetch
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:12
First Live Successor Manages Replicas
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68
Copy of17
• Node can locally determine that it is the first live successor
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:13
DHash Copies to Caches Along Lookup Path
N40
N10
N5
N20
N110
N99
N80
N60
Lookup(BlockID=45)
N50
N68
1.
2.
3.
4.RPCs:1. Chord lookup2. Chord lookup3. Block fetch4. Send to cache
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:14
Virtual Nodes Allow Heterogeneity
• Hosts may differ in disk/net capacity• Hosts may advertise multiple IDs
– Chosen as SHA-1(IP Address, index)– Each ID represents a “virtual node”
• Host load proportional to # v.n.’s• Manually controlled• Sybil attach!
Node A
N60N10 N101
Node B
N5
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:15
Experiment (12 nodes)!(pre-planetlab)
• One virtual node per host• 8Kbyte blocks• RPCs use UDP
CA-T1CCIArosUtah
CMU
To vu.nlLulea.se
MITMA-CableCisco
Cornell
NYU
OR-DSL
To vu.nl lulea.se ucl.uk
To kaist.kr, .ve
• Caching turned off• Proximity routing
turned off
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:16
CFS Fetch Time for 1MB File
• Average over the 12 hosts• No replication, no caching; 8 KByte blocks
Fetc
h T
ime (
Seco
nd
s)
Prefetch Window (KBytes)
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:17
Distribution of Fetch Times for 1MB
Fract
ion
of
Fetc
hes
Time (Seconds)
8 Kbyte Prefetch
24 Kbyte Prefetch40 Kbyte Prefetch
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:18
CFS Fetch Time vs. Whole File TCP
Fract
ion
of
Fetc
hes
Time (Seconds)
40 Kbyte Prefetch
Whole File TCP
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:19
Robustness vs. Failures
Faile
d L
ooku
ps
(Fra
ctio
n)
Failed Nodes (Fraction)
(1/2)6 is 0.016
Six replicasper block;
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:20
Revisit Assumptions • P2P Purist Ideals
– Cooperation, Symmetry, Decentralized
• How realistic are these assumptions?• In what domains are they valid?
FastestFlaky
Slower Flaky
Slow
Faster
Stable
Fast Slowtest
10 .. 100s GB/Node of Idle Cheap Disk
Distributed Data Store w/ all the *ilities:High AvailabilityGood ScalabilityHigh ReliabilityMaintainabilityFlexibilityFault-Tolerant DHT
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:21
BW for Redundancy Maintenance
• Assume average system size, N, stable– P(Leave)/Time = Leaves/Time/N = 1/Lifetime– Join = Leave forever rate = 1/Lifetime– Leaves induce redundancy replacement
• replacement size x replacement rate
– Joins cost the same
Maintenance BW > 2 x Space/Lifetime• Space/node < ½ BW/node x Lifetime
• Quality WAN storage scales with WAN BW and member quality
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:22
BW for Redundancy Maintenance II
• maintenance BW 200 Kbps• lifetime = Median 2001-Gnutella
session = 1 hour
• served space = 90 MB/node << donatable storage!
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:23
Peer Dynamics
• The peer-to-peer “dream”– Reliable storage from many unreliable
components
• Robust lookup perceived as critical• Bandwidth to maintain redundancy is the
hard problem [Blake and Rodrigues 03]
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:24
Need Too Much BW toMaintain Redundancy
• 10M users; 25% avail.; 1 week membership; 100G donation => 50 kbps
• Wait! It gets worse… HW trends
HighAvailability
ScalableStorage
DynamicMembership
Must Pick Two
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:25
Proposal #1
• Server-to-Server DHTs• Reduce to a Solved Problem? Not
really…– Self-Configuration– Symmetry– Scalability– Dynamic Load balance
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:26
Proposal #2
• Complete routing information• Possible complications:
– Memory Requirements– Bandwidth [Gupta,Liskov,Rodrigues 03]– Load balance
• Multi-hop optimization makes sense only when many very dynamic members serve a little data– (multi-hop not required if N < per host / 40
bytes)
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:27
Proposal #3
• Decouple networking layer from data layer• Layer of indirection
– a.k.a. distributed directory, location pointers, pointers, etc
• Combines a little of proposal #1 and 2– DHT no longer decides who, what, when, where,
why, and how for storage maintenance.– Separate policy from mechanism.
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:28
Appendix: Chord Hashes aBlock ID to its Successor
N32
N10
N100
N80
N60
CircularID Space
• Nodes and blocks have randomly distributed IDs• Successor: node with next highest ID
B33, B40, B52
B11, B30
B112, B120, …, B10
B65, B70
B100
Block ID Node ID
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:29
Appendix: Basic Lookup
N32
N10
N5
N20
N110
N99
N80
N60
N40
“Where is block 70?”
“N80”
• Lookups find the ID’s predecessor• Correct if successors are correct
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:30
Appendix: Successor Lists Ensure Robust Lookup
N32
N10
N5
N20
N110
N99
N80
N60
• Each node stores r successors, r = 2 log N• Lookup can skip over dead nodes to find blocks
N40
10, 20, 32
20, 32, 40
32, 40, 60
40, 60, 80
60, 80, 99
80, 99, 110
99, 110, 5
110, 5, 10
5, 10, 20
P2P Systems 2003 ©2003 Hakim Weatherspoon/UC Berkeley CFS:31
Appendix: Chord Finger Table Allows O(log N) Lookups
N80
½¼
1/8
1/161/321/641/128
• See [SIGCOMM 2000] for table maintenance