Download pdf - Scaling Data Servers via Cooperative Caching

Scaling Data Servers via Cooperative Caching

Siddhartha Annapureddy

New York University

Committee:David Mazieres, Lakshminarayanan Subramanian, Jinyang Li,

Dennis Shasha, Zvi Kedem

Siddhartha Annapureddy Thesis Defense

Motivation

New functionality over the Internet

I Large number of users consume many (large) filesI YouTube, Azureus etc. for movies – DVDs no moreI Data dissemination to large distributed farms


Current approaches

Server

Client Client Client

I Scalability ↓ – Server’s uplink saturates

I Operation takes long time to finish


Current approaches

Server

Client Client Client

I Scalability ↓ – Server’s uplink saturatesI Operation takes long time to finish


Current approaches (contd.)

I Server farms replace central serverI Scalability problem persistsI Notoriously expensive to maintain

I Content Distribution Networks(CDNs)

I Still quite expensive

I Peer-to-peer (P2P) systemsI A new development


Peer-to-peer (P2P) systems

I Low-end central server

I Large client population

I File usually divided into chunks/blocksI Use client resources to disseminate file

I Bandwidth, disk space etc.

I Clients organized into a connected overlayI Each client knows only a few other clients

How to design P2P systems to scale data servers?


P2P scalability – Design principles

In order to achieve scalability, need to avoid hotspots

I Keep bw load on server to a minimumI Server gives data only to few users

I Leverage participants to distribute to other users

I Maintenance overhead with users to be kept low

I Keep bw load on users to a minimumI Each node interacts with only a small subset of users


P2P systems – Themes

I Usability – Exploiting P2P for a range of systemsI File systems, user-level apps etc.

I Security issues in such wide-area P2P systemsI Providing data integrity, privacy, authentication

I Cross file(system) sharing – Use other groupsI Make use of all available sources

I Topology management – Arrange nodes efficientlyI Underlying network (locality properties)I Node capacities (bandwidth, disk space etc.)I Application-specific (certain nodes together work well)

I Scheduling dissemination of chunks efficientlyI Nodes only have partial information























Shark – Motivation

Scenario – Want to test a new application on a hundred nodes

Workstation

Workstation

Workstation

Workstation

WorkstationWorkstation

Workstation Workstation

WorkstationWorkstation

Problem – Need to push software to all nodes and collect results


Current approaches

Distribute from a central server

Client

Client

Client

Network A Network B

Server

rsync, unison, scp

– Server’s network uplink saturates

– Wastes bandwidth along bottleneck links


Current approaches

File distribution mechanisms

BitTorrent, Bullet

+ Scales by offloading burden on server

– Client downloads from half-way across the world


Inherent problems with copying files

Users have to decide a priori what to ship

I Ship too much – Waste bandwidth, takes long time

I Ship too little – Hassle to work in a poor environment

Idle environments consume disk space

I Users are loath to cleanup ⇒ Redistribution

I Need low cost solution for refetching files

Manual management of software versions

I Point /usr/local to a central server

Illusion of having development environmentPrograms fetched transparently on demand


Networked file systems

+ Know how to deploy these systems

+ Know how to administer such systems

+ Simple accountability mechanisms

Eg: NFS, AFS, SFS etc.


Problem: Scalability

0

100

200

300

400

500

600

700

800

0 20 40 60 80 100

Tim

e to

fini

sh R

ead

(sec

)

Unique nodes

40 MB readSFS

Very slow ≈ 775sMuch better !!! ≈ 88s (9x better)


Problem: Scalability

0

100

200

300

400

500

600

700

800

0 20 40 60 80 100

Tim

e to

fini

sh R

ead

(sec

)

Unique nodes

40 MB readSFSShark

Very slow ≈ 775sMuch better !!! ≈ 88s (9x better)


Let’s design this new filesystem

ServerClient

Vanilla SFS

I Central server model


Scalability – Client-side caching

ServerClient

Cache

Large client cache a la AFS

I Whole file caching

I Scales by avoiding redundant data transfers

I Leases to ensure NFS-style cache consistency


Scalability – Cooperative caching

Clients fetch data from each other and offload burden from server

Client

Client

Client

Server

ClientClient

I With multiple clients, must address bandwidth concerns




Client

Client

Client

Server

ClientClient

I Shark clients maintain distributed index




Client

Client

Client

Server

ClientClient


I Fetch a file from multiple other clients in parallel




Client

Client

Client

Server

ClientClient


I Fetch a file from multiple other clients in parallel

I Read-heavy workloads – Writes still go to server



I Usability – Exploiting P2P for a range of systems

I Cross file(system) sharing – Use other groups

I Security issues in such wide-area P2P systems

I Topology management – Arrange nodes efficiently

I Scheduling dissemination of chunks efficiently


Cross file(system) sharing – Application

Linux distribution with LiveCDs

Sharkcd

Sharkcd

Sharkcd

Sharkcd

Knoppix

Knoppix Mirror

Knoppix

Knoppix

Slax

Slax

Slax

I LiveCD – Run an entire OS without using hard diskI But all your programs must fit on a CD-ROMI Download dynamically from server but scalability problemsI Knoppix and Slax users form global cache – Relieve servers


Cross file(system) sharing

Global cooperative cache regardless of origin servers

Client

ClientClient

Client

Client

Client

ClientClient

Client

Client

Slax Knoppix

I Two groups of clients accessing Slax/Knoppix servers

I Client groups share a large amount of software

I Such clients automatically form a global cache




Client

ClientClient

Client

Client

Client

ClientClient

Client

Client

KnoppixSlax



I Such clients automatically form a global cache




Client

ClientClient

Client

Client

Client

ClientClient

Client

Client

KnoppixSlax



I BitTorrent supports only per-file groups


Content-based chunking

Client

Client

Client

Server

ClientClient

I Single overlay for all filesystems and content-based chunksI Chunks – Variable-sized blocks based on content

I Chunks are better – Exploit commonalities across filesI LBFS-style chunks preserved across versions, concatenations

[Muthitacharoen et al. A Low-Bandwidth Network File System. SOSP ’01]









Security issues – Client authentication

Client Client

Client

Server

Client

Client

TraditionalAuthentication

I Traditionally, server authenticated read requests using uids

I Challenge – Interaction between distrustful clients


Security issues – Client authentication

Client Client

Client

Server

Client

Client

TraditionalAuthentication From Token

Authenticator Derived

I Traditionally, server authenticated read requests using uids

I Challenge – Interaction between distrustful clients


Security issues – Client communication

I Client should be able to check integrity of downloaded chunk

I Client should not send chunks to other unauthorized clients

I An eavesdropper shouldn’t be able to obtain chunk contents

I Accomplish this without a complex PKI

I Keep the number of rounds to a minimum


File metadata

Client ServerMetadata:(fh,offset,count)GET_TOKEN

Chunk tokens

I Possession of token implies server permissions to read

I Tokens are a shared secret between authorized clients

I Tokens can be used to check integrity of fetched data

Given a chunk B...

I Chunk token TB = H(B)

I H is a collision resistant hash function


File metadata


Chunk tokens




Given a chunk B...




File metadata


Chunk tokens




Given a chunk B...




File metadata


Chunk tokens




Given a chunk B...




File metadata


Chunk tokens




Given a chunk B...




Security protocol

Client

Receiver

Client

Source

I RC,RP – Random nonces to ensure freshness

I AuthC – Authenticator to prove receiver has tokenI AuthC = MAC (TB, “Auth C”, C,P,RC,RP)

I K – Key to encrypt chunk contentsI K = MAC (TB, “Encryption”, C,P,RC,RP)


Security protocol

Client

Receiver

Client

Source

RC

RP





Security protocol

Client

Receiver

Client

Source

RC

RP

IB, AuthC

EK(B)





Security protocol

Client

Receiver

Client

Source

RC

RP

IB, AuthC

EK(B)





Security properties

Client

Receiver

Client

Source

RC

RP

IB, AuthC

EK(B)

Client can check integrity of downloaded chunk

I Client checks H(Downloaded chunk)?= TB

Source should not send chunks to unauthorized clients

I Malicious clients cannot send correct AuthC

Eavesdropper shouldn’t get chunk contents

I All communication encrypted with K









Topology management

Client Client

Client

Server

Client

Client

Fetch Metadata

I Fetch metadata from the server

I Look up clients caching needed chunks in overlay

I Overlay is common to all file systems

I Overlay organized like a DHT – Get and PutSiddhartha Annapureddy Thesis Defense

Topology management

Client Client

Client

Server

Client

Client

Lookup Clients





Topology management

Client Client

Client

Server

Client

Client

Lookup Clients





Topology management

Client Client

Client

Server

Client

Client

Lookup Clients





Overlay – Get operation

Client Client

Client

Server

Client

Client

Discover Clients

I For every chunk B, there’s indexing key IBI IB used to index clients caching BI Cannot set IB = TB, as TB is a secret

I IB = MAC (TB, “Indexing Key”)


Overlay – Put operation

Register as a source

I Client now becomes a source for the downloaded chunk

I Client registers in distributed index – PUT(IB,Addr)

Chunk Reconciliation

I Reuse TCP connection to download more chunks

I Exchange mutually needed chunks w/o indexing overhead


Overlay – Locality awareness

Client

Client

Client

Client

Network A Network B

I Overlay organized as clusters based on latency

I Preferentially returns sources in same cluster as the client

I Hence, chunks usually transferred from nearby clients

[Freedman et al. Democratizing content publication with Coral. NSDI ’04]


Topology management – Shark vs. BitTorrent

BitTorrent

I Neighbours “fixed”

I Makes cross file sharing difficult

Shark

I Chunk-based overlay

I Neighbours based on data

I Enables cross file sharing

Server

Client

ClientClient

Client


Evaluation

I How does Shark compare with SFS? With NFS?

I How scalable is the server?

I How fair is Shark across clients?

I Which order is better? Random or Sequential


Emulab – 100 Nodes on LAN

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800

Perc

enta

ge c

ompl

eted

with

in ti

me

Time since initialization (sec)

40 MB readShark, rand, negotiationNFSSFS

I Shark – 88sI SFS – 775s (≈ 9x better), NFS – 350s (≈ 4x better)I SFS less fair because of TCP backoffs


PlanetLab – 185 Nodes

0

20

40

60

80

100

0 500 1000 1500 2000 2500

Perc

enta

ge c

ompl

eted

with

in ti

me


40 MB readSharkSFS

I Shark ≈ 7 min – 95th Percentile

I SFS ≈ 39 min – 95th Percentile (5x better)

I NFS – Triggered kernel panics in server


Data pushed by server

SFS Shark0.0

0.5

1.0

Nor

mal

ized

ban

dwid

th

7400

954

I Shark vs SFS – 23 copies vs 185 copies (8x better)I Overlay not responsive enoughI Torn connections, timeouts force going to server


Data served by Clients

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140 160 180

Band

widt

h tra

nsm

itted

(MB)

Unique Shark proxies

PlanetLab hostsShark, 40MB

I Maximum contribution ≈ 3.5 copies

I Median contribution ≈ 0.75 copies









Fetching Chunks – Order Matters

I Scheduling problem – What order to fetch chunks of file?

I Natural choices – Random or Sequential

Intuitively, when many clients start simultaneouslyI Random

I All clients fetch independent chunksI More chunks become available in the cooperative cache

I SequentialI Better disk I/O scheduling on the serverI Client that downloads most chunks alone adds to cache


Emulab – 100 Nodes on LAN

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800

Perc

enta

ge c

ompl

eted

with

in ti

me


40 MB readShark, randShark, seq

I Random – 133sI Sequential – 203sI Random Wins !!! – 35% better


Performance characteristics of workloads

I For Shark, what matters is throughputI Operation should finish as soon as

possible

Utilit

y →

Throughput →

I Another interesting class of applicationsI Video streaming – RedCarpet

Utilit

y →

Throughput →

I Study the scheduling problem for Video-on-DemandI Press a button, wait a little while, and watch playback


Challenges for VoD

I Near VoDI Small setup time to play

videos

I High sustainable goodputI Largest slope osculating block

arrival curveI Highest video encoding rate

system can support

Goal: Do block scheduling to achieve low setup time and highgoodput


System design – Outline

I Components based on BitTorrentI Central server (tracker + seed)I Users interested in the data

I Central serverI Maintains a list of nodes

I Each node finds neighboursI Contacts the server for thisI Another option – Use gossip

protocols

I Exchange data with neighboursI Connections are bi-directional


Simulator – Outline

I All nodes have unit bandwidth capacityI 500 nodes in most of our experiments

I Each node has 4-6 neighboursI Simulator is round-based

I Block exchanges done at each roundI System moves as a whole to the next round

I File divided into 250 blocksI Segment is a set of consecutive blocksI Segment size = 10

I Maximum possible goodput – 1 block/round


Why 95th percentile?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

50

100

150

200

250

300

350

400

450

500

Goodput (in blocks/round)

Node

s

I (x, y)I y nodes have a

goodput atleast x

I Red line at topshows 95th

percentile

I 95th percentile agood indicator ofgoodput

I Most nodeshave at leastthat goodput


Feasibility of VoD – What does block/round mean?

I 0.6 blocks/round with 35 rounds of setup time

I Two independent parametersI 1 round ≈ 10 secondsI 1 block/round ≈ 512 kbps

I Consider 35 rounds a good setup timeI 35 rounds ≈ 6 min

I Goodput = 0.6 blocks/roundI Max encoding rate = 0.6 ∗ 512 > 300 kbps

I Size of video = 250/0.6 ≈ 70 min


Naıve scheduling policies

I Segment-Random policyI Divide file into segmentsI Fetch blocks in random order inside same segmentI Download segments in order


Naıve approaches – Random

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Setup Time (in rounds)

Goo

dput

(blo

cks/

roun

d)

I Each node fetches ablock at random

I Throughput – Highas nodes fetchdisjoint blocks

I More opportunitiesfor blockexchanges

I Goodput – Low asnodes do not getblocks in order

I 0 blocks/round


Naıve approaches – Sequential

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Goo

dput

(blo

cks/

roun

d)

0.2 blocks/round

SequentialRandom

I Each node fetches blocks in order of playbackI Throughput – Low as fewer opportunities for exchange

I Need to increase this


Naıve approaches – Segment-Random policy

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Goo

dput

(blo

cks/

roun

d)

0.4 blocks/round

Segment-RandomSequentialRandom

I Segment-RandomI Fetch random block from the segment the earliest block falls in

I Increases the number of blocks propagating in the system


NetCoding – How it helps?

I A has blocks 1 and 2

I B gets 1 or 2 withequal prob. from A

I C gets 1 in parallel

I If B downloaded 1,link B-C becomesuseless

I Network coding routinely sends 1⊕ 2


Network coding – Mechanics

I Coding over blocksB1,B2, . . . ,Bn

I Choose coefficientsc1, c2, . . . , cn from finitefield

I Generate encoded block:E1 =

∑ni=1 ci × Bi

I Without n coded blocks, decoding cannot be doneI Setup time at least the time to fetch a segment

[Gkantsidis et al. Network Coding for Large Scale Content Distribution. InfoCom ’05]


Benefits of network coding

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Goo

dput

(blo

cks/

roun

d)

Segment-NetcodingSegment-RandomSequentialRandom

I GoodputI ≈ 0.6 blocks/round with netcodingI 0.4 blocks/round with seg-rand


Segment scheduling

I Netcoding good for fetching blocks in a segment, but cannotbe used across segments

I Because decoding requires all encoded blocks

I Problem: How to schedule fetching of segments?I Until now, sequential segment schedulingI Results in rare segments when nodes arrive at different times

I Segment scheduling algorithm to avoid rare segmentsI Evaluation with C# prototype


Segment scheduling

I A has 75% of the file

I Flash crowd of 20 Bs joinwith empty caches

I Server shared between Aand Bs

I Throughput to Adecreases, as server isbusy serving Bs

I Initial segments servedrepeatedly

I Throughput of Bs lowbecause the samesegments are fetchedfrom server


Segment scheduling – Problem

I 8 segments in the video; A has 75% = 6 segments, Bs havenone

I Green is popularity 2, Red is popularity 1I Popularity = # full copies in the system



I Instead of serving A...



I The server gives segments to Bs



0

10

20

30

40

50

60

70

80

90

100

0 25 50 75 100 125 150 175 200 225 250

# Da

ta B

lock

s

Time (in sec)

Naive -- #Blks Downloaded (A)Naive -- #Blks Downloaded (B1..n)

I A

I Bs

I A’s goodput plummetsI Get the best of both worlds – Improve A and Bs

I Idea: Each node serves only rarest segments in system


Segment scheduling – Algorithm

I When dst node connects to the src node...I Here: dst is B, src is server



I src node sorts segments in order of popularityI Segments 7, 8 least popular at 1I Segments 1-6 equally popular at 2



I src considers segments in sorted order one-by-one and servesdst

I Either completely available at src, orI First segment required by dst



I Server injects blocks from segment needed by A into systemI Avoids wasting bandwidth in serving initial segments multiple

times


Segment scheduling – Popularities

I How does src figure out popularities?I Centrally available at the server

I Our implementation uses this technique

I Each node maintains popularities for segmentsI Could use a gossiping protocol for aggregation


Segment scheduling

0

10

20

30

40

50

60

70

80

90

100

0 25 50 75 100 125 150 175 200 225 250

# Da

ta B

lock

s

Time (in sec)

Naive -- #Blks Downloaded (A)Naive -- #Blks Downloaded (B1..n)

#Blks Downloaded (A)#Blks Downloaded (B1..n)

I Note that goodput of both A and Bs improvesSiddhartha Annapureddy Thesis Defense

Conclusions

I Problem: Designing scalable data-intensive applicationsI Enabled low-end servers to scale to large numbers of users

I Shark: Scales file server for read-heavy workloadsI RedCarpet: Scales video server for near-Video-on-Demand

I Exploited P2P for file systems, solved security issuesI Techniques to improve scalability

I Cross file(system) sharingI Topology management – Data-based, locality awarenessI Scheduling of chunks – Network coding