Distributed hash tables Protocols and applications

Distributed hash tablesProtocols and applications

Jinyang Li

Peer to peer systems• Symmetric in node functionalities

– Anybody can join and leave– Everybody gives and takes

• Great potentials– Huge aggregate network capacity, storage etc.– Resilient to single point of failure and attack

• E.g. Gnapster, Gnutella, Bittorrent, Skype, Joost

Motivations: the lookup problem• How to locate something given its

name?• Examples:

– File sharing– VoIP– Content distribution networks (CDN)

Strawman #1: a central search service

• Central service poses a single point of failure or attack

• Not scalable(?) Central service needs $$$

Improved #1: Hierarchical lookup

• Performs hierarchical lookup (like DNS)• More scalable than a central site• Root of the hierarchy is still a single

point of vulnerability

Strawman #2: Flooding

• Symmetric design: No special servers needed• Too much overhead traffic• Limited in scale• No guarantee for results

Where is item K?

Improved #2: super peers

• E.g. Kazaa, Skype• Lookup only floods super peers• Still no guarantees for lookup results

DHT approach: routing• Structured:

– put(key, value); get(key, value)– maintain routing state for fast lookup

• Symmetric: – all nodes have identical roles

• Many protocols: – Chord, Kademlia, Pastry, Tapestry, CAN,

Viceroy

DHT ideas• Scalable routing needs a notion of “closeness”• Ring structure [Chord]

Different notions of “closeness”• Tree-like structure, [Pastry, Kademlia, Tapestry]

01100101

0110011101100101 0110011001100100

011011**011001** 011010**011000**

Distance measured as the length of longest matching prefix with the lookup key

Different notions of “closeness”• Cartesian space [CAN]

0,0.5, 0.5,1 0.5,0.5, 1,1

0,0, 0.5,0.5

0.5,0.25,0.75,0.5

0.5,0,0.75,0.25

0.75,0, 1,0.5

(0,0)

(1,1)

Distance measured as geometric distance

DHT: complete protocol design

1. Name nodes and data items2. Define data item to node assignment3. Define per-node routing table contents4. Handle churn

• How do new nodes join?• How do nodes handle others’ failures?

Chord: Naming• Each node has a unique flat ID

– E.g. SHA-1(IP address)• Each data item has a unique flat ID

(key)– E.g. SHA-1(data content)

Data to node assignment• Consistent hashing

Predecessor is “closest” to

Successor isresponsiblefor

Key-based lookup (routing)

• Correctness– Each node must know its correct successor

Key-based lookup• Fast lookup

– A node has O(log n) fingers exponentially “far” away

Node x’s i-th finger is at x+2i away

Why is lookup fast?

Handling churn: join• New node w joins an existing network

– Issue a lookup to find its successor x– Set its successor: w.succ = x– Set its predecessor: w.pred = x.pred– Obtain subset of responsible items from x– Notify predecessor: x.pred = w

Handling churn: stabilization• Each node periodically fixes its

state– Node w asks its successor x for x.pred– If x.pred is “closer” to itself, set

w.succ = x.pred

Starting with any connected graph, stabilization eventually makes all nodes find correct successors

Handling churn: failures• Nodes leave without notice

– Each node keeps s (e.g. 16) successors

– Replicate keys on s successors

Handling churn: fixing fingers• Much easier to maintain fingers

– Any node at [2i, 2i+1] distance away will do– Geographically closer fingers --> lower

latency– Periodically flush dead fingers

Using DHTs in real systems

• Amazon’s Dynamo key-value storage [SOSP’07]

• Serve as persistent state for applications– shopping carts, user preferences

• How does it apply DHT ideas?– Assign key-value pairs to responsible nodes– Automatic recovery from churn

• New challenges?– Manage read-write data consistency

Using DHTs in real systems

• Keyword-based searches for file sharing – eMule, Azureus

• How to apply a DHT?– User has file 1f3d… with name jingle bell Britney– Insert mappings: SHA-1(jingle)1f3d, SHA-1(bell)

1f3d, SHA-1(britney) 1f3d– How to answer query “jingle bell”, “Britney”?

• Challenges?– Some keywords are much more popular than others– RIAA inserts a billion “Britney spear xyz”crap?

Using DHTs in real systems CoralCDN

Motivation: alleviating flash crowd

OriginServer

Browser

BrowserBrowser

Browser

BrowserBrowser

Browser

Browserhttp proxy

http proxy

http proxy

http proxy

http proxy

• Proxies handle most client requests• Proxies cooperate to fetch content from each other

Getting content with CoralCDN

OriginServer

1.Server selectionWhat CDN node

should I use?

• Clients use CoralCDN via modified domain namenytimes.com/file → nytimes.com.nyud.net/file

2.Lookup(URL)What nodes are

caching the URL?

3.Content transmissionFrom which caching nodes

should I download file?

Coral design goals• Don’t control data placement

– Nodes cache based on access patterns• Reduce load at origin server

– Serve files from cached copies whenever possible

• Low end-to-end latency– Lookup and cache download optimize for locality

• Scalable and robust lookups

• Given a URL: – Where is the data cached?– Map name to location: URL {IP1, IP2, IP3, IP4}

–lookup(URL) Get IPs of nearby caching nodes

–insert(URL,myIP) Add me as caching URL

Lookup for cache locations

,TTL)for TTL seconds

Isn’t this what a distributed hash table is for?

URL1={IP1,IP2,IP3,IP4}

SHA-1(URL1)

insert(URL1,myIP)

A straightforward use of DHT

• Problems– No load balance for a single URL– All insert and lookup go to the same node

(cannot be close to everyone)

#1 Solve load balance problem: relax hash table semantics

• DHTs designed for hash-table semantics– Insert and replace: URL IPlast

– Insert and append: URL {IP1, IP2, IP3, IP4}

• Each Coral lookup needs only few values– lookup(URL) {IP2, IP4}– Preferably ones close in network

Prevent hotspots in index1 2 3# hops:

• Route convergence– O(b) nodes are 1 hop from root– O(b2) nodes are 2 hops away from root …

Leaf nodes (distant IDs)Root node

(closest ID)

Prevent hotspots in index1 2 3# hops:

• Request load increases exponentially towards root

URL={IP1,IP2,IP3,IP4}

Root node (closest ID)

Leaf nodes (distant IDs)

Rate-limiting requests1 2 3# hops:

• Refuse to cache if already have max # “fresh” IPs / URL, – Locations of popular items pushed down tree




URL={IP3,IP4}

URL={IP5}

Rate-limiting requests1 2 3# hops:




• Refuse to cache if already have max # “fresh” IPs / URL, – Locations of popular items pushed down tree

• Except, nodes leak through at most β inserts / min / URL– Bound rate of inserts towards root, yet root stays fresh

URL={IP3,IP4}

URL={IP5}

Load balance results

• Aggregate request rate: ~12 million / min• Rate-limit per node (β): 12 / min• Root has fan-in from 7 others

494 nodes on PlanetLab

3 β

2 β

1 β

7 β

#2 Solve lookup locality problem

• Cluster nodes hierarchically based on RTT• Lookup traverses up hierarchy

– Route to “closest” node at each level

Preserve locality through hierarchy000… 111…Distance to key

None

< 60 ms

< 20 ms

Thresholds

• Minimizes lookup latency • Prefer values stored by nodes within faster clusters

Clustering reduces lookup latency

Reduces median latby factor of 4

Putting all together: Coral reduces server load

Local disk caches begin to handle most requests

Most hits in20-ms Coral

cluster

Few hits to origin

400+ nodes provide 32 Mbps100x capacity of origin

Documents

Distributed hash tables Protocols and applications