Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Dostoevsky: Better Space-Time Trade-Offs For Based Key-Value Stores via Adaptive Removal of Superfluous Merging
Niv Dayan & Stratos Idreos
LSM-Tree
LSM-Tree
Key-Value Stores
LSM-Tree
Key-Value Stores
Time-Series Databases
LSM-Tree
Key-Value Stores
Time-Series DatabasesRelational Databases
LSM-Tree
Key-Value Stores
Time-Series Databases
Algorithm Design
Relational Databases
LSM-Tree
LSM-Tree
LSM-Tree
LSM-Tree
LSM-Tree
workload
MAX
MAX
stuck here
MAX
LSM-Tree
LSM-Tree
MAX
stuck here
suboptimal design
LSM-Tree
MAX
stuck here
suboptimal designwhy? how?
Background
LSM-TreeLSM-Tree
updates
buffer
storage
0
level
memory
memoryupdates
buffer
storage
0
1
level
sort & flush runs
memoryupdates
buffer
storage
0
1
2
sort merge
memory
buffer
storage
0
1
2
3
level
exponentially increasing capacities
level 1
level 2
level 3
memory
buffer
storage
0
1
2
3
level
fence pointers
lookup key X
one I/O per run
X
memory
buffer
storage
0
1
2
3
level
fence pointers
Bloomfilters
lookup key X
X
memory
buffer
storage
0
1
2
3
fence pointers
Bloom filters merging frequency
merging frequency
lookup costmerge cost
merging frequency
lookup costmerge cost
LevelingTieringlookup-optimizedmerge-optimized
merging frequency
LevelingTieringmerge-optimized lookup-optimized
LevelingTiering
accumulate
merge-optimized lookup-optimized
LevelingTiering
merge & flush
accumulate
lookup-optimizedmerge-optimized
LevelingTiering
accumulate
lookup-optimizedmerge-optimized
LevelingTiering
mergeaccumulate
lookup-optimizedmerge-optimized
LevelingTiering
flush
accumulate merge
lookup-optimizedmerge-optimized
LevelingTiering
accumulate merge
lookup-optimizedmerge-optimized
LevelingTiering
logR(N)
lookup-optimizedmerge-optimized
LevelingTiering
size ratio
logR(N)
lookup-optimizedmerge-optimized
LevelingTiering
R runs per level
1 run per level
lookup-optimizedmerge-optimized
size ratio
logR(N)
LevelingTiering
R runs per level
1 run per level
size ratio R
lookup-optimizedmerge-optimized
1 run per level
LevelingTiering
1 run per level
size ratio R
lookup-optimizedmerge-optimized
LevelingTiering
T runs per level
1 run per level
size ratio R
lookup-optimizedmerge-optimized
1 run per level
LevelingTiering
log sorted array
O(lNl) runs per level
size ratio R
lookup-optimizedmerge-optimized
merge cost
Tiering
log
Leveling
sorted array
lookup cost
Tiering
log
Leveling
sorted array
size ratio Rlookup cost
merge cost
Tiering
log
Leveling
sorted array
R
R
size ratio Rlookup cost
merge cost
lookup cost
merge cost
lookup cost
merge cost
Monkeylookup cost
merge cost
Dostoevsky
lookup cost
merge cost
Monkey
Bloomfilters
Monkey: Optimal Navigable Key-Value Store SIGMOD17
Bloom filters
SIGMOD17
O(p)
O(p/R)
O(p/R2)
false positive rate
Monkey: Optimal Navigable Key-Value Store
Bloom filters
SIGMOD17
false positive rate
Monkey: Optimal Navigable Key-Value Store
O(e-x)
O(e-x/R)
O(e-x/R2)
Bloom filters
SIGMOD17
false positive rate
Monkey: Optimal Navigable Key-Value Store
O(e-x) I/O =
geometric progression
O(e-x)
O(e-x/R)
O(e-x/R2)
Bloom filters
SIGMOD17
Monkey: Optimal Navigable Key-Value Store
O(e-x · logR(N)) > O(e-x) I/O
Existing
Monkeylookup cost
merge cost
Existing
Monkey
Dostoevsky
lookup cost
merge cost
tiering
lookup cost
merge cost
Monkey
leveling
point merging
I/O overheads with leveling
long range short range
exponentially decreasing
O(e-x)
O(e-x/R)
O(e-x/R2)
point
false positive rates
false positive rates
point largest levelO(e-x)
O(e-x/R)
O(e-x/R2)
largest level
O(e-x)
merginglong range short range point
target key range
target range
O(s)
O(s/R)
O(s/R2)
long range
target key range
target range
largest levellong range O(s)
O(s/R)
O(s/R2)
largest level
largest level
O(e-x) O(s)
point merginglong range short range
short range
target range
1
1
1
all levels
target range
1
1
1
O(logR(N))
short range
largest levellargest level
point
O(e-x) O(s)
all levels
O(logR(N))
merginglong range short range
exponentially more work
exponentially less frequent
merging
exponentially more work
exponentially less frequent
merging
=
=
merging
all levels
mor
e w
ork
less frequent
=
=
merging
all levels
merging
merge 1
merging
merge 2
merging
Rmerge
O(R)
O(R)
O(R)
merging
write-amplification
O(R · logR(N))O(R)
O(R)
O(R)
merging
O(e-x) O(s) O(logR(N)) O(R · logR(N))
O(s/R2)
O(s/R)
1
1
1
O(s)O(e-x)
O(e-x/R)
O(e-x/R2)
===
+
+
+
+
+
+
largest levellargest level all levels all levels
=
+
+
long range point short range merging
O(R)
O(R)
O(R)
O(s)O(e-x)
largest levellargest level all levels
merginglong range point
O(R)
O(R)
O(R)
O(s)O(e-x)
largest levellargest level all levels
merginglong range point
superfluous
O(R)
O(R)
O(R)
merging at smaller levels is superfluousfor point lookups and long range lookups
worse as data grows!
poor performance
poor performancelower device lifetime (on SSD)
Dostoevsky
Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store
Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store
very write-optimized
LevelingTieringlookup-optimizedmerge-optimized
LevelingTieringlookup-optimizedmerge-optimized
Lazy Levelingmixed-workloads
Lazy Leveling
Leveling
Tiering
Lazy Leveling
merge to have at most 1 run
merge when level fills
long range short range point merging
O(e-x)
O(e-x/R2)
O(e-x/R3)
false positive rates
point
false positive rates
exponentially decreasing
point
O(e-x)
O(e-x/R2)
O(e-x/R3)
false positive rates
largest level
point O(e-x)
O(e-x/R2)
O(e-x/R3)
O(e-x)
point merginglong range short range
target key range
target range
O(s)
O(s/R)
O(s/R2)
long range
target key range
target range
largest level
long range O(s)
O(s/R)
O(s/R2)
O(e-x) O(s)
point merging
long range short range
1
O(R)
O(1+R·(logR(N)-1))
target key range
O(R)
short range
O(1+R·(logR(N)-1))O(e-x) O(s)
long range point mergingshort range
write-amplification
O(1)
O(1)
O(R)
merging
write-amplification
O( R + logR(N) )
merging
O(1)
O(1)
O(R)
O( R + logR(N) )O(e-x) O(s)
long range point short range merging
O(1+R · (logR(N)-1))
O( R + logR(N) )O(1+R · (logR(N)-1))O(e-x) O(s)Lazy Leveling
Leveling O(e-x) O(s) O(logR(N)) O( R · logR(N) )
== V V
long range point short range merging
O( R + logR(N) )O(1+R · (logR(N)-1))O(e-x) O(s)Lazy Leveling
Leveling O(e-x) O(s) O(logR(N)) O( R · logR(N) )
Tiering O( logR(N) )O(R · logR(N))O(R · e-x) O(R · s)
V VV V
long range point short range merging
Leveling
merging
Lazy Leveling
Tiering
point
Leveling
short range Lazy Leveling
merging
Tiering
Leveling
long range
Lazy Leveling
merging
Tiering
LevelingTiering Lazy Leveling
LevelingTiering Lazy Levelingupdates
LevelingTiering Lazy Levelingupdates short range
LevelingTieringupdates short rangeupdates & point
Lazy Leveling
Tiering
Lazy Leveling
Fluid
Leveling
K runs
Z runs
LSM-TreeFluid
R runs
1 runs
Lazy Leveling
LSM-TreeFluid
R runs
1 runs
long range short range point merging
Lazy Leveling
R runs
1 runs
long range short range point merging
Lazy Leveling
optimize
2 runs
1 runs
long range short range point merging
Lazy Leveling
optimize
Leveling
short range merging
1 runs
1 runs
optimizelong range point
Leveling
short range merging
1 runs
1 runs
long range point long range pointoptimize
R runs
1 runs
long range short range point merging
Lazy Leveling
optimize
R runs
2 runs
long range short range point
Lazy Leveling
optimizemerging
R runs
R runs
long range short range point
Tiering
optimizemerging
R runs
R runs
long range short range point
Tiering
mergingoptimize
R runs
1 runs
long range short range point merging
Lazy Leveling
optimize
R runs
1 runs
long range short range point merging
Lazy Leveling
optimize
R size ratio
R runs
1 runs
long range short range point merging
Lazy Leveling
optimize
R size ratio
Fluid LSM-TreeR size ratioK runs at smaller levelsZ runs at largest level
Fluid LSM-Tree
Tiering
Lazy Leveling
Leveling
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updates
leveling
point lookups / updates
0
0.5
1
normalized throughput
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updates
leveling
point lookups / updates
0
0.5
1
normalized throughput
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updates
levelingtiering
point lookups / updates
0
0.5
1
normalized throughput
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updates
levelingtiering
lazy leveling
point lookups / updates
1/100 1/100
0.5
1
normalized throughput
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updates
levelingtiering
lazy leveling
Dostoevsky
point lookups / updates
0
0.5
1
normalized throughput
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updates
0
0.5
1
normalized throughput
point lookups / updates
Dostoevsky
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updatespoint lookups / updates
0
1
normalized throughput
Dostoevsky
0.5
Monkey
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updatespoint lookups / updates
0
0.5
1
normalized throughput
Dostoevsky
Tuned RocksDB
Monkey
1/100 1/10
0
0.5
1
norm
alize
dth
roughput(ops/s)
10�2 10�1
point lookups / updatespoint lookups / updates
0
0.5
1
normalized throughput
Dostoevsky
Tuned RocksDB
Monkey
Untuned RocksDB
1/100 1/10
Conclusion
Lazy Leveling no merging
greedy merging
Conclusion
Conclusion
Lazy Leveling
point
merging
Conclusion
Lazy Leveling
point
merging
Fluid LSM-Tree
Conclusion
Lazy Leveling
Fluid LSM-Tree Dostoevskymerging
point
Dostoevsky: Space-Time Optimized Evolvable Scalable Key-Value Store
VERY WRITE-OPTIMIZED
A self-designing key-value storewww.crimsondb.org
Thanks!