View
37
Download
2
Category
Tags:
Preview:
DESCRIPTION
Bandwidth and latency optimizations. Jinyang Li w/ speculator slides from Ed Nightingale. What we’ve learnt so far. Programming tools Consistency Fault tolerance Security Today: performance boosting techniques Caching Leases Group commit Compression Speculative execution. - PowerPoint PPT Presentation
Citation preview
Bandwidth and latency optimizations
Jinyang Li
w/ speculator slides from Ed Nightingale
What we’ve learnt so far
• Programming tools• Consistency• Fault tolerance• Security• Today: performance boosting techniques
– Caching– Leases– Group commit– Compression– Speculative execution
Performance metrics
• Throughput– Measures the achievable rate (ops/sec) – Limited by the bottleneck resource
• 10Mbps link: max ~150 ops/sec for writing 8KB blocks
– Increase tput by using less bottleneck resource
• Latency– Measures the latency of a single client response– Reduce latency by pipelining multiple operations
Caching (in NFS)• NFS clients cache file content and directory name
mappings• Caching saves network bandwidth, improves latency
GETATTR
READ
READ
data
Leases (not in NFS)• Leases eliminate latency in freshness check, at the cost of keeping
extra state at the server
READ
READ fh1
LEASE fh1, data
fh1: C1fh1: C1 INVAL fh1
WR
ITE
fh1
OK
Group commit (in NFS)
• Group commit reduces the latency of a sequence of writes
COMMIT
WRITE
WRITE
Two cool tricks
• Further optimization for b/w and latency is necessary for wide area– Wide area network challenges
• Low bandwidth (10~100Mbps)• High latency (10~100ms)
• Promising solutions:– Compression (LBFS)– Speculative execution (Speculator)
Low Bandwidth File System
• Goal: avoid redundant data transfer between clients and the server
• Why isn’t caching enough?– A file with duplicate content duplicate
cache blocks– Two files that share content duplicate
cache blocks– A file that’s modified previous cache is
useless
LBFS insights: name by content hash
• Traditional cache naming: (fh#, offset)• LBFS naming: SHA-1(cached block)Same contents have the same name
– Two identical files share cached blocks
Cached blocks keep the same names despite file changes
Naming granularity • Name each file by its SHA-1 hash
– It’s rare for two files to be exactly identical– No cache reuse across file modifications
• Cut a file into 8KB blocks, name each [x*8K,(x+1)*8K) range by hash
– If block boundaries misalign, two almost identical files could share no common block
– If block boundaries misalign, a new file could share no common block with its old version
SHA-1(8K) SHA-1(8K)
Align boundaries across different files
• Idea: determine boundary based on the actual content– If two boundaries have the same 48-byte
content, they probably correspond to the same position in a contiguous region of identical content
Align boundaries across different files
ab9f..0a
ab9f..0a
87e6b..f5
87e6b..f5
LBFS content-based chunking
• Examine every sliding window of 48-bytes• Compute a 2-byte Rabin fingerprint f of 48-
byte window• If the lower 13-bit of f is equal to v, f
corresponds to a breakpoint• 2 consecutive breakpoints define a “chunk”• Average chunk size?
LBFS chunking
• Two files with the same but misaligned content of x bytes• How many fingerprints for each x-byte content? How
many breakpoints? Breakpoints aligned?
f1f2
f3
f4
f1f2
f3f4
Why Rabin fingerprints?
• Why not use the lower 13 bit of every 2-byte sliding window for breakpoints?– Data is not random, resulting in extremely
variable chunk size
• Rabin fingerprints computes a random 2-byte value out of 48-bytes data
Rabin fingerprint is fast
• Treat 48-byte data D as a 48 digit radix-256 number
• f47 = fingerprint of D[0…47]
= ( D[47] + 256*D[46] + … + 25646*D[1]+
…+25647*D[0] ) % q• f48 = fingerprint of D[1..48]
= ((f47 - D[0]*25647)* 256 + D[48] ) % q
A new fingerprint is computed from the old fingerprint and the new shifted-in byte
LBFS reads
GETHASHFile not in cache
(h1, size1,
h2, size2,
h3, size3)
READ(h1,size1)
READ(h2,size2)
Ask for missing Chunks: h1, h2
Reconstruct fileas h1,h2,h3
Fetching missing chunksOnly saves b/w by reusing common cached blocks across different files or different versions of the same file
LBFS writes
MKTMPFILE(fd)
CONDWRITE(fd, h1,size1, h2,size2, h3,size3)
HASHNOTFOUND(h1,h2)
TMPWRITE(fd, h1)TMPWRITE(fd, h2)
COMMITTMP(fd, target_fhandle)
Create tmp file fd
Reply with missing chunksh1, h2
Construct tmp file fromh1,h2,h3
copy tmp file contentto target file
Transferring missing chunkssaves b/w if different files or different versions of the same file have pieces of identical content
LBFS evaluations
• In practice, there are lots of content overlap among different files and different version of the same file– Save a Word document– Recompile after a header change– Different versions of a software package
• LBFS results in ~1/10 b/w use
Speculative Execution in a Distributed File System
Nightingale et al.
SOSP’05
How to reduce latency in FS?
• What are potentially “wasteful” latencies?• Freshness check
– Client issues GETATTR before reading from cache– Incurs an extra RTT for read– Why wasteful? Most GETATTRs confirm freshness ok
• Commit ordering– Client waits for commit on modification X to finish before
starting modification Y– No pipelining of modifications on X & Y– Why wasteful? Most commits succeed!
RPC Req
Client
RPC Resp
• Guarantees without blocking I/O!
Server
Block!2) Speculate!
1) Checkpoint
Key Idea: Speculate on RPC responses
3) Correct? Yes: discard ckpt.No: restore process & re-execute
RPC Req
RPC Resp
RPC Req
RPC Resp
Conditions of useful speculation
• Operations are highly predictable
• Checkpoints are cheaper than network I/O– 52 µs for small process
• Computers have resources to spare– Need memory and CPU cycles for speculation
Undo log
Implementing SpeculationProcess
Checkpoint Spec
1) System call2) Create speculation
Time
Speculation Success
Undo log
Checkpoint
1) System call2) Create speculation
Process
3) Commit speculation
Time
Spec
Speculation Failure
Undo log
Checkpoint
1) System call2) Create speculation
Process
3) Fail speculation
Process
Time
Spec
Ensuring Correctness
• Speculative processes hit barriers when they need to affect external state– Cannot roll back an external output
• Three ways to ensure correct execution– Block– Buffer– Propagate speculations (dependencies)
• Need to examine syscall interface to decide how to handle each syscall
Handle systems calls• Block calls that externalize state
– Allow read-only calls (e.g. getpid)– Allow calls that modify only task state (e.g. dup2)
• File system calls -- need to dig deeper– Mark file systems that support Speculator
getpid
reboot
mkdir
Call sys_getpid()
Block until specs resolved
Allow only if fs supports Speculator
Output Commits
“stat worked”
“mkdir worked”
Undo log
Checkpoint
Checkpoint
Spec(stat)
Spec(mkdir)
1) sys_stat 2) sys_mkdir
Process
Time
3) Commit speculation
Multi-Process Speculation
• Processes often cooperate– Example: “make” forks children to compile, link, etc.– Would block if speculation limited to one task
• Allow kernel objects to have speculative state– Examples: inodes, signals, pipes, Unix sockets, etc.– Propagate dependencies among objects– Objects rolled back to prior states when specs fail
Spec 1Spec 1
Multi-Process Speculation
Spec 2
pid 8001
Checkpoint
Checkpoint
inode 3456
Chown-1
Write-1
pid 8000
CheckpointCheckpoint
Checkpoint
Chown-1
Write-1
Multi-Process Speculation
• What’s handled:– DFS objects, RAMFS, Ext3, Pipes & FIFOs– Unix Sockets, Signals, Fork & Exit
• What’s not handled (i.e. block)– System V IPC– Multi-process write-shared memory
Example: NFSv3 LinuxClient 1 Client 2Server
Open BGetattr
Modify BWrite
Commit
Example: SpecNFS
Modify B
speculate
Getattr
Open Bspeculate
Open BGetattrspeculate
Write+Commit
Client 1 Client 2Server
Problem: Mutating Operations
• bar depends on speculative execution of “cat foo”• If bar’s state could be speculative, what does client
2 view in bar?
Client 1
1. cat foo > bar
Client 2
2. cat bar
Solution: Mutating Operations• Server determines speculation success/failure
– State at server is never speculative
• Clients send server hypothesis speculation based on– List of speculations an operation depends on
• Server reports failed speculations
• Server performs in-order processing of messages
Server checks speculation’s status
Client 1 Server
Cat foo>bar
Write+Commit
Foo v=1
Check if foo indeedhas version=1, if no
fail
Group Commit
• Previously sequential ops now concurrent
• Sync ops usually committed to disk
• Speculator makes group commit possible
write
writecommit
commit
ClientClient Server Server
Putting it Together: SpecNFS
• Apply Speculator to an existing file system
• Modified NFSv3 in Linux 2.4 kernel– Same RPCs issued (but many now asynchronous)– SpecNFS has same consistency, safety as NFS– Getattr, lookup, access speculate if data in cache– Create, mkdir, commit, etc. always speculate
Putting it Together: BlueFS• Design a new file system for Speculator
– Single copy semantics– Synchronous I/O
• Each file, directory, etc. has version number– Incremented on each mutating op (e.g. on write)– Checked prior to all operations.– Many ops speculate and check version async
Apache Benchmark
• SpecNFS up to 14 times faster
0
50
100
150
200
250
300
No delay
Time (seconds)
NFS
SpecNFS
BlueFS
ext3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
30 ms delay
Rollback cost is small
• All files out of date SpecNFS up to 11x faster
0
20
40
60
80
100
120
140
NFS SpecNFS ext3
No delay
Time (seconds)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
NFS SpecNFS ext3
30ms delay
No files invalid10% files invalid
50% files invalid100% files invalid
What we’ve learnt today
• Traditional Performance boosting techniques– Caching– Group commit– Leases
• Two new techniques– Content-based hash and chunking– Speculative execution
Recommended