View
0
Download
0
Category
Preview:
Citation preview
Computer Architecture and Memory systems Laboratory
CAMELab
Myoungsoo Jung
ATC 2020
Sponsored by
Fully Hardware Automated OpenResearch Framework for Future Fast NVMe Device
CAMELab
Emerging Non-Volatile Memory for SSDsLa
tenc
y (r
eads
)
Memory Types
TLC MLC SLC New Flash PRAM MRAM DRAM
Flash Technologies
450 us
150 us25 us
3 us 120 ns 50~80 ns 60~80 ns
Storage Class Memory (SCM)
CAMELab
NVMe Internals and Interfaces
CPU Flash
Flas
hFl
ash
Flas
h
CTRL
CAMELab
NVMe Storage Stack
CPU Flash
Flas
hFl
ash
Flas
h
CTRL
1~3GB/sec
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
CAMELab
NVMe Storage Stack Redesign
CPU Flash
Flas
hFl
ash
Flas
h
CTRL
1~3GB/sec
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
• FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs (OSDI’18)
• De-indirection for Flash-Based SSDs with Nameless writes (FAST’12)
• Towards SLO Complying SSDs Through OPS Isolation (FAST’15)
• The case of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator (FAST’18)
• There’re more and more!
Challenges #1: Most storage
research relies on simulation/kernel-
level emulation
CAMELab
SCM-based NVMe Storage Card
SCM
SCM
SCM
SCM
CTRL
7GB/sec
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
CPU
Challenges #2: SSD’s CPU can be a performance
bottleneck for SCMs
CAMELab
What Does SSD’s CPU Do?
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
CPU SCM
SCM
SCM
SCM
CTRL
7GB/sec
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Host memory
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)
Device register SQ Doorbell
CQ Doorbell
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
❶ I/O submission
Data(PRP)
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
Data(PRP)
❷ Ring SQ doorbell
SQ Doorbell
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
Data(PRP)
SQ Doorbell
❸ I/O fetch
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
❹ Data transfer
Data(PRP)
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
Data(PRP)
❺ I/O process
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
❻ I/O completion
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
❼ Interrupt (notification)
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
❽ Process completion
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ Doorbell
❾ Ring CQ doorbell
CQ Doorbell
CAMELab
What Does SSD’s CPU Do?
CPU SCM
CTRL
Applications (Processes)
VFS/FS
Block layer
Page cache
Block device driver
Address space
Com
plet
ion
queu
e (C
Q)
Subm
issio
nqu
eue
(SQ
)SQ Doorbell
CQ DoorbellCQ Doorbell
All these NVMe activities give a burden on the storage!
CAMELab
Multi-core IP for High-Performance SSDBackend
Chan
nel C
ompl
ex
CPU
Inbo
und
NVMe driver
SQ CQCore0
Out
boun
dPC
Ie
PCIe
Clie
nt L
ogic
PCIe
I-RAM
I-RAM
I-RAM
Interconnection Networks
Memory Controller SRAM
CAMELab
Component Latency Decomposition
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
Completion Translation PRP Queue/Doorbells Fetching NVM
Completion Translation PRP Queue/Doorbells Fetching NVM
CAMELab
Component Latency Decomposition
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
Completion Translation PRP Queue/Doorbells Fetching NVM
Completion Translation PRP Queue/Doorbells Fetching NVM
CAMELab
Component Latency Decomposition
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
Completion Translation PRP Queue/Doorbells Fetching NVM
Completion Translation PRP Queue/Doorbells Fetching NVM
CAMELab
Component Latency Decomposition
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
Completion Translation PRP Queue/Doorbells Fetching NVM
Completion Translation PRP Queue/Doorbells Fetching NVM
CAMELab
Component Latency Decomposition
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
Completion Translation PRP Queue/Doorbells Fetching NVM
Completion Translation PRP Queue/Doorbells Fetching NVM
CAMELab
Component Latency Decomposition
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
TLC MLC SLCZNANDPRA
MMRA
M0.0
0.2
0.4
0.6
0.8
1.0
Late
ncy
brea
kdow
n
Completion Translation PRP Queue/Doorbells Fetching NVM
Completion Translation PRP Queue/Doorbells Fetching NVM
CPU bursts can be overlapped with I/O
bursts
Firmware control become the critical
performance bottleneck
CAMELab
Overview
A framework that fully automates NVMe control logic in hardware to make it possible to build customizable devices• OpenExpress is an NVMe host accelerator IP for the integration with an easy-to-access
FPGA design.
OpenExpress doesn’t require any software intervention to process concurrent read and write NVMe requests• It supports scalable data submission, rich outstanding NVMe commands and submissio
n/completion queue management.
We prototype OpenExpress on a commercially available Xilinx FPGA board and optimize all the logic modules to operate a high frequency.
CAMELab
Full Hardware Automation for Critical I/O PathBackend
Chan
nel C
ompl
ex
CPU
Inbo
und
NVMe driver
SQ CQCore0
Out
boun
dPC
Ie
PCIe
Clie
nt L
ogic
PCIe
I-RAM
I-RAM
I-RAM
Interconnection Networks
Memory Controller SRAM
CAMELab
Full Hardware Automation for Critical I/O PathBackend
Chan
nel C
ompl
ex
CPU
Inbo
und
NVMe driver
SQ CQCore0
Out
boun
dPC
Ie
PCIe
Clie
nt L
ogic
PCIe
SoC Memory Bus
Datapath
Hardware Automation (OpenExpress)
CAMELab
Queue DispatchingOpenExpress
SQ 1 DBSQ 0 DB
SQ Tail Doorbell Region
Addr
ess (
read
s/w
rites
)
Data
(rea
ds/w
rites
)
Writ
e Re
spon
se
SQ Entry Fetch Manager (FET)
CAMELab
Data TransferringOpenExpress
SQ 1 DBSQ 0 DB
SQ Tail Doorbell Region
Addr
ess (
read
s/w
rites
)
Data
(rea
ds/w
rites
)
Writ
e Re
spon
se
SQ Entry Fetch Manager (FET)
PRP Engine(HTRW)
Backend DMA (DMA)
PCIe EP Complex
DIMM0
DIMM1
DIMM2
DIMM3
CAMELab
Completion Handling OpenExpress
SQ 1 DBSQ 0 DB
SQ Tail Doorbell Region
PCIe EP Complex
Addr
ess (
read
s/w
rites
)
Data
(rea
ds/w
rites
)
Writ
e Re
spon
se
SQ Entry Fetch Manager (FET)
PRP Engine(HTRW)
Backend DMA (DMA)
Completion Handler (CMT)
CQ 1 DBCQ 0 DB
CQ Head Doorbell Region
DIMM0
DIMM1
DIMM2
DIMM3
CAMELab
NVMe Context ManagementOpenExpress
SQ 1 DBSQ 0 DB
SQ Tail Doorbell Region
PCIe EP Complex
Addr
ess (
read
s/w
rites
)
Data
(rea
ds/w
rites
)
Writ
e Re
spon
se
SQ Entry Fetch Manager (FET)
PRP Engine(HTRW)
Backend DMA (DMA)
Completion Handler (CMT)
CQ 1 DBCQ 0 DB
CQ Head Doorbell Region
DIMM0
DIMM1
DIMM2
DIMM3
NVMe Context Box (CTX)
CAMELab
Operation Flow of OpenExpressOpenExpress
SQ 1 DBSQ 0 DB
SQ Tail Doorbell Region
PCIe EP Complex
Addr
ess (
read
s/w
rites
)
Data
(rea
ds/w
rites
)
Writ
e Re
spon
se
SQ Entry Fetch Manager (FET)
PRP Engine(HTRW)
Backend DMA (DMA)
Completion Handler (CMT)
CQ 1 DBCQ 0 DB
CQ Head Doorbell Region
DIMM0
DIMM1
DIMM2
DIMM3
CompletionQueue 1
Tail
Head
CQ 1 BAR
SubmissionQueue 1
Tail
Head
Issue a command 1
2 Write SQ 1 tail pointer 3Monitor the DB region and signal an event
4 Fetch 16 DW SQ Entry
5 Send the parsed commandDATA
PRPDRAM
6 Processing PRP
6 Copying data associated with each PRP
Posting CQ entry7
Create MSI and interrupt8
9 Write CQ 1 head pointer
CAMELab
Major IP CoresFET
HTRWCMT
CTX
Frontend
CAMELab
Major IP CoresFET
HTRWCMT
CTX
Frontend
CAMELab
Major IP Cores
CAMELab
Synthesis and Implementation
Four 64GB DDR4 x72 DIMM(256GB)
Xilinx VirtexUltraScale 190
PCIe Gen3Descriptions
CPU i5-9400 2.9GHz
Design Suite Vivado 2017.3.1
OS Ubuntu 18.04 LTS
Building Time More than 7 hours
CAMELab
Frequency Tuning (50MHz ~ 250 MHz) Detect the long routing delay and
place register slice IPs between hardware modules • It can increase the frequency while
minimizing a negative slack• Gradual trial-and error to reduce the
amount of route delay
Memory Controller
DDR interconnect
AXI
DMA
PCIe EP
CMT
FET
HTRW
CTX
Putting register slides between
different IP cores
CAMELab
Frequency Tuning (50MHz ~ 250 MHz) Group hardware modules in the floor-
planning • After synthesis, create FPGA pblock (a part
of grid cells), allocate hardware IPs to the pblock, and do PNR (place and route)
• Check the implementation report and do gradual trial-and-error again
Memory Controller
DDR interconnect
AXI
DMA
PCIe EP
CMT
FET
HTRW
CTX
Reducing route distances by
grouping IP logic
CAMELab
Perf. Improvement of Frequency Tuning The performance improvement for
reads and writes is as high as 20% and 60%, respectively
Reads exhibit more benefits as its nature of data movement on PCIe packets (explained shortly)
The large size block processing has more benefits because of automated PRP data processing
ReadsWrites
CAMELab
Perf. Improvement of Frequency Tuning
ReadsWrites
DRAM
DRAMOutb
ound
Emul
Emul
Emul
Inbound
PCIe
OpenExpress
Host
IO CQ Region
Control Reg Set
IO SQ RegionIO Command
PCIe BAR
PRP List
Pointer
Pointer
PRP parsing/traversing and data transfers
are fully automated
CAMELab
Evaluation For all executions, a single I/O worker cannot extract full bandwidth (CPU
bottleneck)• We execute 10 threads for both benchmark execution and real workload
Descriptions
CPU 8-core 3.3GHz Intel Skylake-X Server Microarchitecture
DRAM 32GB DDR4 (2666)
Benchmark FIO
Real workload CAMEL’s Open Storage Trace (SNIA)https://trace.camelab.org/
I/O Workers 10 threads
Block Device Optane SSD P4800X
Evaluations demonstrated in the paper are all performed with a queue-depth “8”• The queue depth exhibiting the best bandwidth
Note that we don’t claim that OpenExpresscan be faster than other fast NVMe devices.• Instead, the evaluation shows that an FPGA-
based design and implementation for NVMe IP cores can offer good performance to make it a viable candidate for use in storage research.
https://trace.camelab.org/
CAMELab
Performance w/ Microbenchmarks
Bandwidth (seq) Bandwidth (rand)
Latency (seq) Latency (rand)
CAMELab
Performance w/ Microbenchmarks 4KB-sized requests
• There is no much bandwidth/latency difference between reads and writes (3 GB/sec vs 2.8/sec)
• OpenExpress latency is 72~77 us at the queue-depth eight (vs. P4800X offers 120~150 us)
It reaches the maximum bandwidth with 16KB-sized requests• 4.7GB/sec for random writes (258 us)• 7GB/sec for random reads (175 us)
(vs. P4800X offers 532 us~600us)
Bandwidth (seq)
Latency (seq)
Why are writes slower than reads?
CAMELab
Performance w/ Microbenchmarks
Bandwidth (seq)
Latency (seq)
DoorbellNVMe
PRPDataData
DoorbellNVMe
PRPDataData
DoorbellNVMe
PRPDataData
DoorbellNVMe
PRP
DataData
Writes ReadsRXTX RXTXAll payloads are
serialized Data and NVMe payloads are served
in parallel
CAMELab
Real Workload Evaluation
24HRFIU
DevDivSer
verTPCE
0
100
200
300
400 OpenExpress
Mic
rose
c
24HRFIU
DevDivSer
verTPCE
0.0
1.0
2.0
3.0
4.0
5.0
6.0
MB/
s
OpenExpress The performance of real workloads are different with microbenchmark results because of unaligned request offsets, sector length variations, etc.• Most storage cards cannot reach their
best performance with real workload executions
Read-intensive workloads (DevDiv and Server)• OpenExpress offers 4GB/sec ~ 4.5GB/sec
(100~200 us)
Write-intensive workloads (24HR, FIU and TPCE)• 1.25GB/sec~2.1GB/sec (all under 100 us)
CAMELab
Real Workload Evaluation
24HRFIU
DevDivSer
verTPCE
0
100
200
300
400 OpenExpress
Mic
rose
c
24HRFIU
DevDivSer
verTPCE
0.0
1.0
2.0
3.0
4.0
5.0
6.0
MB/
s
OpenExpress
24HRFIU
DevDivSer
verTPCE
0.0
1.0
2.0
3.0
4.0
5.0
6.0
MB/
s
Optane SSD OpenExpress
24HRFIU
DevDivSer
verTPCE
0
100
200
300
400 Optane SSD OpenExpress
Mic
rose
c
Bandwidth• OpenExpress shows 76.3% better
bandwidth than Optane SSD, on average
Latency• It exhibits 68.6% shorter latency than
those of OptaneSSD
DevDiv• 111.5% better performance compared to
Optane SSD (2.1GB/sec)
FPGA is yet much slower, but the FPGA design and implementation for NVMe IP cores are not on the critical path and can be used for system-level studies as a research vehicle
CAMELab
Conclusion: Related Work and Download
$ 35K
$ 45K
$ 40K
Ip-m****
Inte******
Ep*****
Per-month unitary prices for 3rd party NVMe IP Core
The price of a single-use source code is around $ 100K For academic/non commercial purpose,
OpenExpress will be freely downloadable:• Hardware automation IP cores (HTRW, FET, CMT, CTX..)• Firmware for MicroBlaze (to control administrator
command management, device initialization, etc.)
Download information • https://openexpress.camelab.org
https://trace.camelab.org/
Computer Architecture and Memory systems Laboratory
CAMELab
Myoungsoo Jung
ATC 2020
Sponsored by
Fully Hardware Automated OpenResearch Framework for Future Fast NVMe Device
Fully Hardware Automated Open Research Framework for Future Fast NVMe DeviceEmerging Non-Volatile Memory for SSDsNVMe Internals and Interfaces NVMe Storage StackNVMe Storage Stack RedesignSCM-based NVMe Storage CardWhat Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?Multi-core IP for High-Performance SSDComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionOverviewFull Hardware Automation for Critical I/O PathFull Hardware Automation for Critical I/O PathQueue DispatchingData TransferringCompletion Handling NVMe Context ManagementOperation Flow of OpenExpress Major IP CoresMajor IP CoresMajor IP CoresSynthesis and ImplementationFrequency Tuning (50MHz ~ 250 MHz)Frequency Tuning (50MHz ~ 250 MHz)Perf. Improvement of Frequency TuningPerf. Improvement of Frequency TuningEvaluationPerformance w/ MicrobenchmarksPerformance w/ MicrobenchmarksPerformance w/ MicrobenchmarksReal Workload EvaluationReal Workload EvaluationConclusion: Related Work and DownloadFully Hardware Automated Open Research Framework for Future Fast NVMe Device슬라이드 번호 54Real Workload Evaluation
Recommended