Upload
conan-brewer
View
44
Download
0
Embed Size (px)
DESCRIPTION
Everest: scaling down peak loads through I/O off-loading. D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK. hanzler666 @ UKClimbing.com. Problem: I/O peaks on servers. Short, unexpected peaks in I/O load This is not about predictable trends - PowerPoint PPT Presentation
Citation preview
Everest:scaling down peak loads through
I/O off-loading
D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron
Microsoft Research Cambridge, UK
hanzler666 @ UKClimbing.com
Problem: I/O peaks on servers
• Short, unexpected peaks in I/O load– This is not about predictable trends
• Uncorrelated across servers in data center– And across volumes on a single server
• Bad I/O response times during peaks2Everest: write off-loading for I/O peaks
Example: Exchange server
• Production mail server– 5000 users, 7.2 TB across 8 volumes
• Well provisioned– Hardware RAID, NVRAM, over 100
spindles
• 24-hour block-level I/O trace– Peak load, response time is 20x mean– Peaks are uncorrelated across volumes
3Everest: write off-loading for I/O peaks
Exchange server load
4Everest: write off-loading for I/O peaks
14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100
1000
10000
100000
Time of day
Load
(req
s/s/
volu
me)
Everest client
Write off-loading
5Everest: write off-loading for I/O peaks
Reads and Writes
No off-loadingOff-loading
Reads
Writes
Volume
Reclaiming
Reclaims
Everest store
Everest store
Everest store
Exploits workload properties
• Peaks uncorrelated across volumes– Loaded volume can find less-loaded
stores
• Peaks have some writes– Off-load writes reads see less
contention
• Few foreground reads on off-loaded data– Recently written hence in buffer cache– Can optimize stores for writes
6Everest: write off-loading for I/O peaks
Challenges
• Any write anywhere– Maximize potential for load balancing
• Reads must always return latest version– Split across stores/base volume if
required
• State must be consistent, recoverable– Track both current and stale versions
• No meta-data writes to base volume7Everest: write off-loading for I/O peaks
Design features
• Recoverable soft state• Write-optimized stores• Reclaiming off-loaded data• N-way off-loading• Load-balancing policies
8Everest: write off-loading for I/O peaks
Recoverable soft state
• Need meta-data to track off-loads– block ID <location, version>– Latest version as well as old (stale)
versions
• Meta-data cached in memory– On both clients and stores
• Off-loaded writes have meta-data header– 64-bit version, client ID, block range 9Everest: write off-loading for I/O peaks
Recoverable soft state (2)
• Meta-data also persisted on stores– No synchronous writes to base volume– Stores write data+meta-data as one
record
• “Store set” persisted base volume– Small, infrequently changing
• Client recovery contact store set• Store recovery read from disk
10Everest: write off-loading for I/O peaks
Everest stores
• Short-term, write-optimized storage– Simple circular log– Small file or partition on existing volume– Not LFS: data is reclaimed, no cleaner
• Monitors load on underlying volume– Only used by clients when lightly loaded
• One store can support many clients
11Everest: write off-loading for I/O peaks
Everest client
Reclaiming in the background
12Everest: write off-loading for I/O peaks
“Read any”
Volume
Everest store
Everest store
Everest store
<block range, version, data>
delete(block range, version)
• Multiple concurrent reclaim “threads”– Efficient utilization of disk/network
resources
write
Correctness invariants
• I/O on off-loaded range always off-loaded– Reads: sent to correct location– Writes: ensure latest version is
recoverable– Foreground I/Os never blocked by
reclaim
• Deletion of a version only allowed if– Newer version written to some store, or– Data reclaimed and older versions
deleted
• All off-loaded data eventually reclaimed
13Everest: write off-loading for I/O peaks
Evaluation
14Everest: write off-loading for I/O peaks
• Exchange server traces• OLTP benchmark• Scaling• Micro-benchmarks• Effect of NVRAM• Sensitivity to parameters• N-way off-loading
Exchange server workload
• Replay Exchange server trace– 5000 users, 8 volumes, 7.2 TB, 24 hours
• Choose time segments with peaks– extend segments to cover all reclaim
• Our server: 14 disks, 2 TB – can fit 3 Exchange volumes
• Subset of volumes for each segment
15Everest: write off-loading for I/O peaks
Trace segment selection
16Everest: write off-loading for I/O peaks
14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100
1000
10000
100000
1000000
Time of day
Tota
l I/O
rate
(r
eqs/
s)
Trace segment selection
17Everest: write off-loading for I/O peaks
14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100
1000
10000
100000
1000000
Time of day
Tota
l I/O
rate
(r
eqs/
s)
Peak 1Peak 2
Peak 3Peak 1Peak 2
Peak 3
Three volumes/segment
18Everest: write off-loading for I/O peaks
Trace
Trace
Trace
Trace
Trace
Trace
Trace
Trace
min
max
medianstore (3%)
store (3%)
store (3%)
client
client
client
Mean response time
19Everest: write off-loading for I/O peaks
Peak 1 reads
Peak 2 reads
Peak 3 reads
Peak 1 writes
Peak 2 writes
Peak 3 writes
0
50
100
150
200No off-loadOff-load
Mea
n re
sp ti
me
(ms)
99th percentile response time
20Everest: write off-loading for I/O peaks
Peak 1 reads
Peak 2 reads
Peak 3 reads
Peak 1 writes
Peak 2 writes
Peak 3 writes
0
500
1000
1500
2000No off-loadOff-load
99%
resp
tim
e (m
s)
Exchange server summary
• Substantial improvement in I/O latency– On a real enterprise server workload– Both reads and writes, mean and 99th pc
• What about application performance?– I/O trace cannot show end-to-end effects
• Where is the benefit coming from?– Extra resources, log structure, ...?
21Everest: write off-loading for I/O peaks
OLTP benchmark
22Dushyanth Narayanan
LogData
Store
Everest client
SQL Server binary
LAN
OLTP client
Detours DLL redirection
• 10 min warmup• 10 min
measurement
OLTP throughput
23Everest: write off-loading for I/O peaks
No off-load Off-load Log struc-tured
2-disk striped
Striped + Log-
struc-tured
0
500
1000
1500
2000
2500
3000
Thro
ughp
ut (t
pm)
Extra disk
+
Log layout
2x disks,3x speedup?
Off-loading not a panacea
• Works for short-term peaks• Cannot use to improve perf 24/7• Data usually reclaimed while store
still idle– Long-term off-load eventual contention
• Data is reclaimed before store fills up– Long-term log cleaner issue
24Everest: write off-loading for I/O peaks
Conclusion
• Peak I/O is a problem• Everest solves this through off-
loading• By modifying workload at block level
– Removes write from overloaded volume– Off-loading is short term: data is
reclaimed
• Consistency, persistence are maintained– State is always correctly recoverable
25Everest: write off-loading for I/O peaks
Questions?
26Everest: write off-loading for I/O peaks
Why not always off-load?
27Dushyanth Narayanan
Data
Store
Data
Store
OLTP client OLTP client
Write
ReadWrite
ReadWrite
Read
SQL Server 1
Everest client
SQL Server 2
10 min off-load,10 min contention
28Everest: write off-loading for I/O peaks
Off-load Contention (server 1)
Contention (server 2)
0
1
2
3
4
Spee
dup
Mean and 99th pc (log scale)
29Everest: write off-loading for I/O peaks
Peak 1 reads
Peak 2 reads
Peak 3 reads
Peak 1 writes
Peak 2 writes
Peak 3 writes
1
10
100
1000
10000No off-load Off-load
Resp
onse
tim
e (m
s)
Read/write ratio of peaks
30Everest: write off-loading for I/O peaks
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
% of writes
Cum
ulati
ve fr
actio
n
Exchange server response time
31Everest: write off-loading for I/O peaks
14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100000
1000000
10000000
100000000
Time of day
Resp
onse
tim
e (s
)
Exchange server load (volumes)
32Everest: write off-loading for I/O peaks
14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100
1000
10000
100000Max Mean Min
Time of day
Load
(re
qs/s
)
Effect of volume selection
33Everest: write off-loading for I/O peaks
05000
10000150002000025000300003500040000
Peak 1
All Selected
Time of day
Load
(req
s/s/
volu
me)
Effect of volume selection
34Everest: write off-loading for I/O peaks
010000200003000040000500006000070000
Peak 2
All Selected
Time of day
Load
(req
s/s/
volu
me)
Effect of volume selection
35Everest: write off-loading for I/O peaks
02000400060008000
1000012000140001600018000
Peak 3
All Selected
Time of day
Load
(req
s/s/
volu
me)
Scaling with #stores
36Dushyanth Narayanan
LogData
Store
Everest client
SQL Server binary
OLTP client
Detours DLL redirection
Store
Store
LAN
Scaling: linear until CPU-bound
37Everest: write off-loading for I/O peaks
0 0.5 1 1.5 2 2.5 30
2
4
6
Number of stores
Spee
dup
Everest store: circular log layout
Head
Tail
Reclaim
Header block
Active log
Stale records
Delete
38Everest: write off-loading for I/O peaks
Exchange server load: CDF
39Everest: write off-loading for I/O peaks
100 1000 10000 1000000
0.2
0.4
0.6
0.8
1
Request rate per volume (reqs/s)
Cum
ulati
ve fr
actio
n
Unbalanced across volumes
40Everest: write off-loading for I/O peaks
100 1000 10000 1000000
0.2
0.4
0.6
0.8
1MinMeanMax
Request rate per volume (reqs/s)
Cum
ulati
ve fr
actio
n