51
B-trees, Shadowing, and Clones Ohad Rodeh B-trees, Shadowing, and Clones – p.1/51

B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

B-trees, Shadowing, andClones

Ohad Rodeh

B-trees, Shadowing, and Clones – p.1/51

Page 2: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Talk outlinePrefaceBasics of getting b-trees to work with shadowingPerformance resultsAlgorithms for cloning (writable-snapshots)

B-trees, Shadowing, and Clones – p.2/51

Page 3: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

IntroductionThe talk is about a free technique useful forfile-systems like ZFS and WAFLIt is appropriate for this forum due to the talked aboutport of ZFS to LinuxThe ideas described here were used in a researchprototype of an object-disk.A b-tree was used for the OSD catalog (directory), anextent-tree was used for objects (files).

B-trees, Shadowing, and Clones – p.3/51

Page 4: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

ShadowingSome file-systems use shadowing: WAFL, ZFS, . . .Basic idea:1. File-system is a tree of fixed-sized pages2. Never overwrite a pageFor every command:1. Write a short logical log entry2. Perform the operation on pages written off to the

side3. Perform a checkpoint once in a while

B-trees, Shadowing, and Clones – p.4/51

Page 5: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Shadowing IIIn case of crash: revert to previous stable checkpointand replay the logShadowing is used for: Snapshots, crash-recovery,write-batching, RAID

B-trees, Shadowing, and Clones – p.5/51

Page 6: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Shadowing IIIImportant optimizations1. Once a page is shadowed, it does not need to be

shadowed again until the next checkpoint2. Batch dirty-pages and write them sequentially to

disk

B-trees, Shadowing, and Clones – p.6/51

Page 7: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

SnapshotsTaking snapshots is easy with shadowingIn order to create a snapshot:1. The file-system allows more than a single root2. A checkpoint is taken but not erased

B-trees, Shadowing, and Clones – p.7/51

Page 8: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

B-treesB-trees are used by many file-systems to representfiles and directories: XFS, JFS, ReiserFS, SAN.FSThey guarantee logarithmic-time key-search, insert,removeThe main questions:1. Can we use B-trees to represent files and directories in

conjunction with shadowing?

2. Can we get good concurrency?

3. Can we supports lots of clones efficiently?

B-trees, Shadowing, and Clones – p.8/51

Page 9: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

ChallengesChallenge to multi-threading: changes propagate up tothe root. The root is a contention point.In a regular b-tree the leaves can be linked to theirneighbors.

Not possible in conjunction with shadowingThrows out a lot of the existing b-tree literature

3 15

3 6 10 15 20

3 4 6 7 8 10 11 15 16 20 25

3 15

3 6 10 15 20

3 4 6 7 8 10 11 15 16 20 25

B-trees, Shadowing, and Clones – p.9/51

Page 10: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Write-in-place b-treeUsed by DB2 and SAN.FSUpdates b-trees in place; no shadowingUses a bottom up update procedure

B-trees, Shadowing, and Clones – p.10/51

Page 11: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Write-in-placeexample

Insert-element1. Walk down the tree until reaching the leaf L2. If there is room: insert in L3. If there isn’t, split and propagate upwardNote: tree nodes contain between 2 and 5 elements

3 25

3 6 9 15 20 25 30

3 4 6 7 10 11 15 16 20 23 26 27 30 35 40

3 25

3 6 9 15 20 25 30

3 4 6 7 8 10 11 15 16 20 23 26 27 30 35 40

B-trees, Shadowing, and Clones – p.11/51

Page 12: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Alternate shadowingapproach

Used in many databases, for example, Microsoft SQLserver.Pages have virtual addressesThere is a table that maps virtual addresses to physicaladdressesIn order to modify page P at address L1

1. Copy P to another physical address L2

2. Modify the mapping table, P → L2

3. Modify the page at the L2

B-trees, Shadowing, and Clones – p.12/51

Page 13: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Alternate shadowingapproach II

ProsAvoids the ripple effect of shadowingUses b-link trees, very good concurrency

ConsRequires an additional persistent data structurePerformance of accessing the map is crucial togood behavior

B-trees, Shadowing, and Clones – p.13/51

Page 14: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Requirements fromshadowed b-tree

The b-tree has to:1. Have good concurrency2. Work well with shadowing3. Use deadlock avoidance4. Have guarantied bounds on space and memory

usage per operationTree has to be balanced

B-trees, Shadowing, and Clones – p.14/51

Page 15: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

The solution:insert-key

Top-downLock-coupling for concurrencyProactive splitShadowing on the way downInsert element 81. Causes a split to node [3,6,9,15,20]2. Inserts into [6,7]

3 25

3 6 9 15 20 25 30

3 4 6 7 10 11 15 16 20 23 26 27 30 35 40

3 15 25

3 6 9 15 20 25 30

3 4 6 7 8 10 11 15 16 20 23 26 27 30 35 40

B-trees, Shadowing, and Clones – p.15/51

Page 16: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Remove-keyTop-downLock-coupling for concurrencyProactive merge/shuffleShadowing on the way downExample: remove element 10

3 15

3 6 15 20

3 4 6 7 8 10 11 15 16 20 25 30

3 6 15 20

3 4 6 7 8 10 11 15 16 20 25 30

3 6 15 20

3 4 6 7 8 11 15 16 20 25 30

B-trees, Shadowing, and Clones – p.16/51

Page 17: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Analysis forInsert/Remove-key

Always hold two/three locksAt most three pages held at any timeModify a single path in the tree

B-trees, Shadowing, and Clones – p.17/51

Page 18: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Pros/ConsCons:

Effectively lose two keys per node due to proactivesplit/merge policyNeed loose bounds on number of entries per node(b . . . 3b)

Pros:No deadlocks, no need for deadlockdetection/avoidanceRelatively simple algorithms, adaptable forcontrollers

B-trees, Shadowing, and Clones – p.18/51

Page 19: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

CloningTo clone a b-tree means to create a writable copy of itthat allows all operations: lookup, insert, remove, anddelete.A cloning algorithm has several desirable properties

B-trees, Shadowing, and Clones – p.19/51

Page 20: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Cloning propertiesAssume p is a b-tree and q is a clone of p, then:1. Space efficiency: p and q should, as much as

possible, share common pages2. Speed: creating q from p should take little time and

overhead3. Number of clones: it should be possible to clone p

many times4. Clones as first class citizens: it should be possible

to clone q

B-trees, Shadowing, and Clones – p.20/51

Page 21: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Cloning, a naivesolution

A trivial algorithm for cloning a tree is copying itwholesale.This does not have the above four properties.

B-trees, Shadowing, and Clones – p.21/51

Page 22: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

WAFL free-spaceThere are deficiencies in the classic WAFL free spaceA bit is used to represent each cloneWith a map of 32-bits per data block we get 32 clonesTo support 256 clones, 32 bytes are needed perdata-block.In order to clone a volume or delete a clone we need tomake a pass on the entire free-space

B-trees, Shadowing, and Clones – p.22/51

Page 23: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

ChallengesHow do we support a million clones without a huge free-spacemap?

How do we avoid making a pass on the entire free-space?

B-trees, Shadowing, and Clones – p.23/51

Page 24: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Main ideaModify the free space so it will keep a reference count(ref-count) per blockRef-count records how many times a page is pointed toA zero ref-count means that a block is freeEssentially, instead of copying a tree, the ref-counts ofall its nodes are incremented by oneThis means that all nodes belong to two trees insteadof one; they are all sharedInstead of making a pass on the entire tree andincrementing the counters during the clone operation,this is done in a lazy fashion. B-trees, Shadowing, and Clones – p.24/51

Page 25: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Cloning a tree1. Copy the root-node of p into a new root2. Increment the free-space counters for each of the

children of the root by one

P,1

B,1 C,1

D,1 E,1 G,1 H,1

P,1 Q,1

B,2 C,2

D,1 E,1 G,1 H,1

B-trees, Shadowing, and Clones – p.25/51

Page 26: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Mark-dirty, withoutclones

Before modifying page N, it is “marked-dirty”1. Informs the run-time system that N is about to be

modified2. Gives it a chance to shadow the page if necessaryIf ref-count == 1: page can be modified in placeIf ref-count > 1, and N is relocated from address a1 toaddress a2

1. the ref-count for a1 is decremented2. the ref-count for a2 is made 13. The ref-count of N’s children is incremented by 1

B-trees, Shadowing, and Clones – p.26/51

Page 27: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Example, insert-keyinto leaf H,tree q

P,1 Q,1

B,2 C,2

D,1 E,1 G,1 H,1

P,1 Q,1

B,2 C,2

D,1 E,1 G,1 H,1

(I) Initial trees, Tp and Tq (II) Shadow Q

P,1 Q,1

B,2 C,1 C’,1

D,1 E,1 G,2 H,2

P,1 Q,1

B,2 C,1 C’,1

D,1E,1 G,2H,1 H’,1

(III) shadow C (IV) shadow H

B-trees, Shadowing, and Clones – p.27/51

Page 28: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Deleting a treeref-count(N) > 1: Decrement the ref-count and stopdownward traversal. The node is shared with othertrees.ref-count(N) == 1 : It belongs only to q. Continuedownward traversal and on the way back upde-allocate N.

B-trees, Shadowing, and Clones – p.28/51

Page 29: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Delete exampleP,1 Q,1

B,1 C,2 X,1

D,1 E,1 G,1 Y,2 Z,1

P,1

B,1 C,1

D,1 E,1 G,1 Y,1

(I) Initial trees Tp and Tq (II) After deleting Tq

B-trees, Shadowing, and Clones – p.29/51

Page 30: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Comparison to WAFLfree-space

Clone:Ref-counts: increase ref-counts for children of rootWAFL: make a pass on the entire free-space andset bits

Delete clone p

Ref-counts: traverse the nodes that belong only to p

and decrement ref-countersWAFL: make a pass on the entire free-space andset snapshot bit to zero

B-trees, Shadowing, and Clones – p.30/51

Page 31: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Comparison to WAFLfree-space

During normal runtimeRef-counts: additional cost of incrementingref-counts while performing modificationsWAFL: none

Space taken by free-space mapRef-counts: two bytes per block allow 64K clonesWAFL: two bytes allow 16 clones

B-trees, Shadowing, and Clones – p.31/51

Page 32: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Resource andperformance analysis

The addition of ref-counts does not add b-tree nodeaccesses. Worst-case estimate on memory-pages anddisk-blocks used per operation is unchangedConcurrency remains unaffected by ref-countsSharing on any node that requires modification isquickly broken and each clone gets its own version

B-trees, Shadowing, and Clones – p.32/51

Page 33: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

FS countersThe number of free-space accesses increases.Potential of significantly impacting performance.Several observations make this unlikely:1. Once sharing is broken for a page and it belong to a

single tree, there are no additional ref-count costsassociated with it.

2. The vast majority of b-tree pages are leaves.Leaves have no children and therefore do not incuradditional overhead.

B-trees, Shadowing, and Clones – p.33/51

Page 34: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

FS counters IIThe experimental test-bed uses in-memory free-spacemaps1. Precludes serious investigation of this issue2. Remains for future work

B-trees, Shadowing, and Clones – p.34/51

Page 35: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

SummaryThe b-trees described here:

Are recoverableHave good concurrencyAre efficientHave good bounds on resource usageHave a good cloning strategy

B-trees, Shadowing, and Clones – p.35/51

Page 36: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Backup slides

B-trees, Shadowing, and Clones – p.36/51

Page 37: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

PerformanceIn theory, top-down b-trees have a bottle-neck at thetopIn practice, this does not happen because the topnodes are cachedIn the experiments1. Entries are 16bytes: key=8bytes, data=8bytes2. A 4KB node contains 70-235 entries

B-trees, Shadowing, and Clones – p.37/51

Page 38: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Test-bedSingle machine connected to a DS4500 throughFiber-Channel.Machine: dual-CPU Xeon 2.4Ghz with 2GB of memory.Operating System Linux-2.6.9The b-tree on a virtual LUNThe LUN is a RAID1 in a 2+2 configurationStrip width is 64K, full stripe=512KBRead and write caching is turned off

B-trees, Shadowing, and Clones – p.38/51

Page 39: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Basic diskperformance

IO-size=4KBDisk-area=1GB

#threads op. time per op.(ms) ops per second

10 read N/A 1217write N/A 640R+W N/A 408

1 read 3.9 256write 16.8 59R+W 16.9 59

B-trees, Shadowing, and Clones – p.39/51

Page 40: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Test methodologyThe ratio of cache-size to number of b-tree pages1. Is fixed at initialization time2. This ratio is called the in-memory percentageVarious trees were used, with the same results. Theexperiments reported here are for tree T235 .

B-trees, Shadowing, and Clones – p.40/51

Page 41: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Tree T235

Maximal fanout: 235Legal #entries: 78 .. 235Contains 9.5 million keys and 65500 nodes1. 65000 leaves2. 500 index-nodesTree depth is: 4Average node capacity 150 keys

B-trees, Shadowing, and Clones – p.41/51

Page 42: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Test methodology IICreate a large tree using random operationsFor each test1. Clone the tree2. Age the clone by doing 1000 random

insert-key/remove-key operations3. Perform 10

4 – 108 measurements against the clone

with random keys4. Delete the clonePerform each measurement 5 times, and average.The standard deviation was less than 1% in all tests.

B-trees, Shadowing, and Clones – p.42/51

Page 43: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Latencymeasurements

Four operations whose latency was measured:lookup-key, insert-key, remove-key, append-key.Latency measured in milliseconds

Lookup Insert Remove Append

3.43 16.76 16.46 0.007

B-trees, Shadowing, and Clones – p.43/51

Page 44: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Different in-memoryratios

Workload: 100% lookup workload

% in-memory 1 thread 10 threads ideal

100 14237 19805 ∞

50 391 1842 243425 321 1508 162210 268 1290 13525 254 1210 12812 250 1145 1241

B-trees, Shadowing, and Clones – p.44/51

Page 45: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

ThroughputFour workloads were used:1. Search-100: 100% lookup2. Search-80: 80% lookup, 10% insert, 10% remove3. Modify: 20% lookup, 40% insert, 40% remove4. Insert: 100% insertMetric: operations per secondSince there isn’t much difference between 2% inmemory and 50%, the rest of the experiments weredone using 5%.Allows putting all index nodes in memory.

B-trees, Shadowing, and Clones – p.45/51

Page 46: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Throughput IITree #threads Src-100 Src-80 Modify Insert

T235 10 1227 748 455 4001 272 144 75 62

Ideal 1281 429

B-trees, Shadowing, and Clones – p.46/51

Page 47: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Workload with somelocality

Workload: randomly choose a keyWith 80% probably read the next 100 keys after itWith 10% probability, insert/overwrite the next 100keysWith 10% probability, remove the next 100 keys

#threads semi-local

10 166341 3848

B-trees, Shadowing, and Clones – p.47/51

Page 48: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Clone performanceTwo clones are made of base tree T235

Aging is performed1. 12000 operations are performed2. 6000 against each clone

Src-100 Src-80 Modify Insert

2 clones 1221 663 393 343base 1227 748 455 400

B-trees, Shadowing, and Clones – p.48/51

Page 49: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Clone performance at100% in-memory

Src-100 Src-80 Modify Insert

2 threads 20395 18524 16907 165051 thread 13910 12670 11452 11112

B-trees, Shadowing, and Clones – p.49/51

Page 50: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Performance ofcheckpointing

A checkpoint is taken during the throughput testPerformance degrades1. A dirty page has to be destaged prior to being

modified2. Caching of dirty-pages is effected

B-trees, Shadowing, and Clones – p.50/51

Page 51: B-trees, Shadowing, and Clones · Once a page is shadowed, it does not need to be shadowed again until the next checkpoint 2. Batch dirty-pages and write them sequentially to

Performance ofcheckpointing II

Tree T235 Src-100 Src-80 Modify Insert

checkpoint 1205 689 403 353base 1227 748 455 400

B-trees, Shadowing, and Clones – p.51/51