Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Jeong-Uk Kang, Jeeseok Hyun, Hyunjoo Maeng*, and Sangyeun Cho
Memory Solutions Lab.
Memory Division, Samsung Electronics Co., Ltd
The Multi-streamed Solid-State Drive
SSD shares a common interface with HDD
• The block device abstraction paved the way for wide adoption of SSDs
SSD as a Drop-in Replacement of HDD
SSDHDD
Generic Block Layer Generic Block Layer
Host
Application
OS
File System
Host
Application
OS
File SystemSATA
Logical Block
Address
Rotating media and NAND flash memory are very different!
Great, BUT…
SSD
Read_Page() Write_Page()Erase_Block() Copy_Page()
HDDRead_Sector()Write_Sector()
Generic Block Layer Generic Block Layer
Host
Application
OS
File System
Host
Application
OS
File System
Sector base
Logical Block
Address
NAND Flash
Memory
NAND Flash
Memory
Disk NAND Flash
Memory
?
SSD
NAND Flash Memory
Flash translation layer (FTL)
• Logical block mapping
• Bad block management
• Garbage Collection (GC)
The Trick is FTL!
…PagePagePagePage
Block
PagePagePagePage
Block
PageWrite
PageRead
BlockErase
Host
Application
OS
(Sector based) File System
FTL
Generic Block Layer
GC reclaims space to prepare new empty blocks
• NAND’s “erase-before-update” requirement Valid page copying followed by an erase operation
• Has a large impact on SSD lifetime and performance
Garbage Collection (GC)
Valid data1
Valid data2
Invalid data
Invalid data
Block A
Valid pagecopying
Invalid data
Invalid data
Valid data3
Valid data4
Block B
Valid data1
Valid data2
Valid data3
Valid data4
Free block
Page
Page
Page
Page
Block A
Page
Page
Page
Page
Block B
ERASED
ERASED
ERASED
ERASED
ERASED
ERASED
ERASED
ERASED
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940
Cas
san
dra
Up
dat
e Th
rou
ghp
ut
(op
s/se
c)
Time
GC is Expensive!
0
0.5
1
1.5
2
2.5
3
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940
Val
id P
ages
co
pie
d (
op
s/se
c)
Cas
san
dra
Up
dat
e Th
rou
ghp
ut
(op
s/se
c)
Throughput
GC overhead
GC highly affects the SSD performance!
(Minutes)
Performance of SSD gradually decreases as time goes on
• Example: Cassandra update throughput
Our Idea: Multi-streamed SSD
Host
Application
OS
File System
Generic Block Layer
SSD
NAND Flash memory
FTL
New interface for SSD
Co-exists with the existing block layer
General & concrete interface
Multi-streaming InterfaceHost-provided stream
information guides desirable data
placement within SSD!
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
The multi-streamed SSD can sustain Cassandra update throughput
End Result
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
Proposed
TraditionalSSD
Background
Write optimization in SSD
The Multi-streamed SSD
Our approach
Case study
Evaluation
Experimental setup
Results
Conclusion
Contents
Previous write patterns (=current state) matter
Effects of Write Patterns
Sequential LBA updates into Block 2 Random LBA updates into Block 2
LBA 7
LBA 0
LBA 1
LBA 4
Block 0
LBA 2
LBA 3
LBA 6
LBA 5
Block 1
LBA 0
LBA 1
LBA 2
LBA 3
Block 2
LBA 0
LBA 1
LBA 2
LBA 3
LBA 7
LBA 0
LBA 1
LBA 4
Block 0
LBA 2
LBA 3
LBA 6
LBA 5
Block 1
LBA 0
LBA 1
LBA 4
LBA 7
Block 2
LBA 0
LBA 1
LBA 4
LBA 7
Need valid page copyingfrom Block 0 & Block 1
Just erase Block 0
Stream
SSD
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
PagePagePagePage
Block
Stream 1
Stream 2
Stream 3
Write to stream 1
Write to stream 2
Write to stream 3
DataLifetime?
Lifetime 1
Lifetime 2
Lifetime 3
…
…
…
Generic Block Layer
Multi-streamed SSD
• Mapping data with different lifetime to different streams
The Multi-streamed SSD
SSD
NAND Flash Memory
Data1Data3
PagePage
Block
Data2Data7Data9
Page
Block
FTL
HostMulti-stream interface
Stream ID = 1
Data10Data12Data13Page
Block
Stream ID = 3Stream ID = 2
Application
Data2Data1
Data3 Data4
Data5 Data10
Provide information about data lifetime
StreamID
Place data with similar lifetimeinto the same erase unit
Data13
1
1
2
2
3
3
32
Multi-streamed SSD
• High GC efficiency (Reduce GC overheads) effects on Performance!
Working Example
120100211
Without Stream
Request data
Block 0 Block 1 Block 2
1
20
100
1
21
1
20
22120
22
1
1
20
120100211
Request data
Block 0 Block 1 Block 2
22120
1 20 1001
211
20
22
1
20
1
Multi-Stream
1
1
1
1
1
1
For effective multi-streaming,proper mapping of data to streams is essential!
Reduce valid pages to copy
Cassandra employs a size-tiered compaction strategy
Case Study: Cassandra
WriteRequest
Memory
Commit Log
Memtable
SSTable 1
K1
K2
SSTable 2
K1
K3
SSTable 3
K2
K3
SSTable 4
K1
K3
SSTable 5
K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
Flu
shin
g
Write operations when Cassandra runs
Summary of Cassandra’s Write Patterns
Memory
Commit Log
Memtable
SSTable 1
K1
K2
SSTable 2
K1
K3
SSTable 3
K2
K3
SSTable 4
K1
K3
SSTable 5
K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
metadata, journal …
System
System dataWrite
Commit-log Write
Compaction data write
Flushing data
Just one stream ID (= conventional SSD)
Mapping #1: “Conventional”
Memory
Commit Log
Memtable
SSTable 1
K1
K2
SSTable 2
K1
K3
SSTable 3
K2
K3
SSTable 4
K1
K3
SSTable 5
K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
metadata, journal …
System
System dataWrite
Commit-log Write
Compaction data write
Flushing data
0
0
0
0
Add a new stream to separately handle application writes (stream ID 1) from system traffic (stream ID 0)
Mapping #2: “Multi-App”
Memory
Commit Log
Memtable
SSTable 1
K1
K2
SSTable 2
K1
K3
SSTable 3
K2
K3
SSTable 4
K1
K3
SSTable 5
K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
metadata, journal …
System
System dataWrite
Commit-log Write
Compaction data write
Flushing data
0
1
1
1
Use three streams; further separate Commit Log
Mapping #3: “Multi-Log”
Memory
Commit Log
Memtable
SSTable 1
K1
K2
SSTable 2
K1
K3
SSTable 3
K2
K3
SSTable 4
K1
K3
SSTable 5
K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
metadata, journal …
System
System dataWrite
Commit-log Write
Compaction data write
Flushing data
0
1
2
2
Give distinct streams to different tiers of SSTables
Mapping #4: “Multi-Data”
Memory
Commit Log
Memtable
SSTable 1
K1
K2
SSTable 2
K1
K3
SSTable 3
K2
K3
SSTable 4
K1
K3
SSTable 5
K1 K2 K3
SSTable 6 SSTable 7
SSTable 21
metadata, journal …
System
System dataWrite
Commit-log Write
Compaction data write
Flushing data
0
1
2
3
Compaction data write 4
Multi-stream SSD Prototype
• Samsung 840 Pro SSD
– 60 GB device capacity
YCSB benchmark on Cassandra
• Write intensive workload
– 1 K data x 1,000,000 record counts
– 100,000,000 operation counts
Intel i7-3770 3.4 GHz processor
2 GB Memory
• Accelerates SSD aging by increasing Cassandra’s flush frequency
Experimental Setup
Linux kernel 3.13 (modified)
• Passes the stream ID through fadvise() system call
• Stores in the inode of VFS
Application
VFS
EXT4
Device
fadvise (fd, Stream ID)
inode field = Stream ID
Store Stream IDIn buffer head
SSD
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
Cassandra’s normalized update throughput
• Conventional “TRIM off”
Results
Conventional(TRIM off)
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
Conventional (TRIM off)
Cassandra’s normalized update throughput
• Conventional “TRIM on”
Results
27/37
Conventional(TRIM on)
TRIM givesnon-trivial improvement
But still far from ideal
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
Conventional (TRIM off)
Conventional (TRIM on)
Cassandra’s normalized update throughput
• “Multi-App” (System data vs. Cassandra data)
Results
Multi-App
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
Conventional (TRIM off)
Conventional (TRIM on)
Multi-App
Cassandra’s normalized update throughput
• “Multi-Log” (System data vs. Commit-Log vs. Flushed data)
Results
Multi-Log
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Up
dat
e T
hro
ugh
pu
t (o
ps/
sec)
Time (Minutes)
Conventional (TRIM off)
Conventional (TRIM on)
Multi-App
Multi-Log
Cassandra’s normalized update throughput
• “Multi-Data” (System data vs. Commit-Log vs. Flushed data vs. Compaction data)
Results
Multi-Data
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Time (Minutes)
Cassandra’s GC overheads
• Conventional “TRIM off”
Result #2
Conventional(TRIM off)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Time (Minutes)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Valid p
age c
opie
d (
ops/
sec)
Time (Minutes)
Cassandra’s GC overheads
• Conventional “TRIM on”
Results
Conventional (TRIM off)
Conventional(TRIM on)
Cassandra’s GC overheads
• “Multi-App” (System data vs. Cassandra data)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Time (Minutes)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Valid p
age c
opie
d (
ops/
sec)
Time (Minutes)
Results
Conventional (TRIM off) Conventional
(TRIM on)
Multi-App
Cassandra’s GC overheads
• “Multi-Log” (System data vs. Commit-Log vs. Flushed data)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Time (Minutes)
Results
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Valid p
age c
opie
d (
ops/
sec)
Time (Minutes)
Conventional (TRIM off) Conventional
(TRIM on)
Multi-App
Multi-Log
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Valid p
age c
opie
d (
ops/
sec)
Time (Minutes)
Cassandra’s GC overheads
• “Multi-Data” (System data vs. Commit-Log vs. Flushed data vs. Compaction data)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Time (Minutes)
Results
The throughput is very well correlated with GC overheads
Conventional (TRIM off) Conventional
(TRIM on)
Multi-App
Multi-LogMulti-Data
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
100
0 25 50 75 100 125 150 175
Cu
mu
late
d d
istr
ibu
tio
n (
%)
Latency (us)
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
100
0 25 50 75 100 125 150 175
Cu
mu
late
d d
istr
ibu
tio
n (
%)
Latency (us)
Cassandra’s cumulated latency distribution
• Multi-streaming improves write latency
• At 99.9%, Multi-Data lowers the latency by 54 % compared to Normal
Result #3
Conventional (TRIM on)
Multi-App
Multi-Log
Multi-Data
Multi-streamed SSD
• Mapping application and system data with different lifetimes to SSD streams
– Higher GC efficiency, lower latency
• Multi-streaming can be supported on a state-of-the-art SSD and co-exist with the traditional block interface
• Multi-stream interface can be standard for using SSD more efficiently
Conclusion
Host
Multi-stream enhanced
Block Layer
SSD
NAND Flash memory
FTL