40
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK hanzler666 @ UKClimbing.com

Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Everest:scaling down peak loads through

I/O off-loading

D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron

Microsoft Research Cambridge, UK

hanzler666 @ UKClimbing.com

Problem: I/O peaks on servers

• Short, unexpected peaks in I/O load– This is not about predictable trends

• Uncorrelated across servers in data center– And across volumes on a single server

• Bad I/O response times during peaks2Everest: write off-loading for I/O peaks

Example: Exchange server

• Production mail server– 5000 users, 7.2 TB across 8 volumes

• Well provisioned– Hardware RAID, NVRAM, over 100

spindles

• 24-hour block-level I/O trace– Peak load, response time is 20x mean– Peaks are uncorrelated across volumes

3Everest: write off-loading for I/O peaks

Exchange server load

4Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

Time of day

Load

(req

s/s/

volu

me)

Everest client

Write off-loading

5Everest: write off-loading for I/O peaks

Reads and Writes

No off-loadingOff-loading

Reads

Writes

Volume

Reclaiming

Reclaims

Everest store

Everest store

Everest store

Exploits workload properties

• Peaks uncorrelated across volumes– Loaded volume can find less-loaded

stores

• Peaks have some writes– Off-load writes reads see less

contention

• Few foreground reads on off-loaded data– Recently written hence in buffer cache– Can optimize stores for writes

6Everest: write off-loading for I/O peaks

Challenges

• Any write anywhere– Maximize potential for load balancing

• Reads must always return latest version– Split across stores/base volume if

required

• State must be consistent, recoverable– Track both current and stale versions

• No meta-data writes to base volume7Everest: write off-loading for I/O peaks

Design features

• Recoverable soft state• Write-optimized stores• Reclaiming off-loaded data• N-way off-loading• Load-balancing policies

8Everest: write off-loading for I/O peaks

Recoverable soft state

• Need meta-data to track off-loads– block ID <location, version>– Latest version as well as old (stale)

versions

• Meta-data cached in memory– On both clients and stores

• Off-loaded writes have meta-data header– 64-bit version, client ID, block range 9Everest: write off-loading for I/O peaks

Recoverable soft state (2)

• Meta-data also persisted on stores– No synchronous writes to base volume– Stores write data+meta-data as one

record

• “Store set” persisted base volume– Small, infrequently changing

• Client recovery contact store set• Store recovery read from disk

10Everest: write off-loading for I/O peaks

Everest stores

• Short-term, write-optimized storage– Simple circular log– Small file or partition on existing volume– Not LFS: data is reclaimed, no cleaner

• Monitors load on underlying volume– Only used by clients when lightly loaded

• One store can support many clients

11Everest: write off-loading for I/O peaks

Everest client

Reclaiming in the background

12Everest: write off-loading for I/O peaks

“Read any”

Volume

Everest store

Everest store

Everest store

<block range, version, data>

delete(block range, version)

• Multiple concurrent reclaim “threads”– Efficient utilization of disk/network

resources

write

Correctness invariants

• I/O on off-loaded range always off-loaded– Reads: sent to correct location– Writes: ensure latest version is

recoverable– Foreground I/Os never blocked by

reclaim

• Deletion of a version only allowed if– Newer version written to some store, or– Data reclaimed and older versions

deleted

• All off-loaded data eventually reclaimed

13Everest: write off-loading for I/O peaks

Evaluation

14Everest: write off-loading for I/O peaks

• Exchange server traces• OLTP benchmark• Scaling• Micro-benchmarks• Effect of NVRAM• Sensitivity to parameters• N-way off-loading

Exchange server workload

• Replay Exchange server trace– 5000 users, 8 volumes, 7.2 TB, 24 hours

• Choose time segments with peaks– extend segments to cover all reclaim

• Our server: 14 disks, 2 TB – can fit 3 Exchange volumes

• Subset of volumes for each segment

15Everest: write off-loading for I/O peaks

Trace segment selection

16Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

1000000

Time of day

Tota

l I/O

rate

(r

eqs/

s)

Trace segment selection

17Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

1000000

Time of day

Tota

l I/O

rate

(r

eqs/

s)

Peak 1Peak 2

Peak 3Peak 1Peak 2

Peak 3

Three volumes/segment

18Everest: write off-loading for I/O peaks

Trace

Trace

Trace

Trace

Trace

Trace

Trace

Trace

min

max

medianstore (3%)

store (3%)

store (3%)

client

client

client

Mean response time

19Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

0

50

100

150

200No off-loadOff-load

Mea

n re

sp ti

me

(ms)

99th percentile response time

20Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

0

500

1000

1500

2000No off-loadOff-load

99%

resp

tim

e (m

s)

Exchange server summary

• Substantial improvement in I/O latency– On a real enterprise server workload– Both reads and writes, mean and 99th pc

• What about application performance?– I/O trace cannot show end-to-end effects

• Where is the benefit coming from?– Extra resources, log structure, ...?

21Everest: write off-loading for I/O peaks

OLTP benchmark

22Dushyanth Narayanan

LogData

Store

Everest client

SQL Server binary

LAN

OLTP client

Detours DLL redirection

• 10 min warmup• 10 min

measurement

OLTP throughput

23Everest: write off-loading for I/O peaks

No off-load Off-load Log struc-tured

2-disk striped

Striped + Log-

struc-tured

0

500

1000

1500

2000

2500

3000

Thro

ughp

ut (t

pm)

Extra disk

+

Log layout

2x disks,3x speedup?

Off-loading not a panacea

• Works for short-term peaks• Cannot use to improve perf 24/7• Data usually reclaimed while store

still idle– Long-term off-load eventual contention

• Data is reclaimed before store fills up– Long-term log cleaner issue

24Everest: write off-loading for I/O peaks

Conclusion

• Peak I/O is a problem• Everest solves this through off-

loading• By modifying workload at block level

– Removes write from overloaded volume– Off-loading is short term: data is

reclaimed

• Consistency, persistence are maintained– State is always correctly recoverable

25Everest: write off-loading for I/O peaks

Questions?

26Everest: write off-loading for I/O peaks

Why not always off-load?

27Dushyanth Narayanan

Data

Store

Data

Store

OLTP client OLTP client

Write

ReadWrite

ReadWrite

Read

SQL Server 1

Everest client

SQL Server 2

10 min off-load,10 min contention

28Everest: write off-loading for I/O peaks

Off-load Contention (server 1)

Contention (server 2)

0

1

2

3

4

Spee

dup

Mean and 99th pc (log scale)

29Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

1

10

100

1000

10000No off-load Off-load

Resp

onse

tim

e (m

s)

Read/write ratio of peaks

30Everest: write off-loading for I/O peaks

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

% of writes

Cum

ulati

ve fr

actio

n

Exchange server response time

31Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100000

1000000

10000000

100000000

Time of day

Resp

onse

tim

e (s

)

Exchange server load (volumes)

32Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000Max Mean Min

Time of day

Load

(re

qs/s

)

Effect of volume selection

33Everest: write off-loading for I/O peaks

05000

10000150002000025000300003500040000

Peak 1

All Selected

Time of day

Load

(req

s/s/

volu

me)

Effect of volume selection

34Everest: write off-loading for I/O peaks

010000200003000040000500006000070000

Peak 2

All Selected

Time of day

Load

(req

s/s/

volu

me)

Effect of volume selection

35Everest: write off-loading for I/O peaks

02000400060008000

1000012000140001600018000

Peak 3

All Selected

Time of day

Load

(req

s/s/

volu

me)

Scaling with #stores

36Dushyanth Narayanan

LogData

Store

Everest client

SQL Server binary

OLTP client

Detours DLL redirection

Store

Store

LAN

Scaling: linear until CPU-bound

37Everest: write off-loading for I/O peaks

0 0.5 1 1.5 2 2.5 30

2

4

6

Number of stores

Spee

dup

Everest store: circular log layout

Head

Tail

Reclaim

Header block

Active log

Stale records

Delete

38Everest: write off-loading for I/O peaks

Exchange server load: CDF

39Everest: write off-loading for I/O peaks

100 1000 10000 1000000

0.2

0.4

0.6

0.8

1

Request rate per volume (reqs/s)

Cum

ulati

ve fr

actio

n

Unbalanced across volumes

40Everest: write off-loading for I/O peaks

100 1000 10000 1000000

0.2

0.4

0.6

0.8

1MinMeanMax

Request rate per volume (reqs/s)

Cum

ulati

ve fr

actio

n