Fully Hardware Automated Open Research Framework for ... › system › files › atc20-paper37-slides-jung.pdfNVMe Internals and Interfaces . CPU. CTRL. Flash. Flash. Flash. Flash

Computer Architecture and Memory systems Laboratory

CAMELab

Myoungsoo Jung

ATC 2020

Sponsored by

Fully Hardware Automated OpenResearch Framework for Future Fast NVMe Device

CAMELab

Emerging Non-Volatile Memory for SSDsLa

tenc

y (r

eads

)

Memory Types

TLC MLC SLC New Flash PRAM MRAM DRAM

Flash Technologies

450 us

150 us25 us

3 us 120 ns 50~80 ns 60~80 ns

Storage Class Memory (SCM)

CAMELab

NVMe Internals and Interfaces

CPU Flash

Flas

hFl

ash

Flas

h

CTRL

CAMELab

NVMe Storage Stack

CPU Flash

Flas

hFl

ash

Flas

h

CTRL

1~3GB/sec

Applications (Processes)

VFS/FS

Block layer

Page cache

Block device driver

CAMELab

NVMe Storage Stack Redesign

CPU Flash

Flas

hFl

ash

Flas

h

CTRL

1~3GB/sec


VFS/FS

Block layer

Page cache

Block device driver

• FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs (OSDI’18)

• De-indirection for Flash-Based SSDs with Nameless writes (FAST’12)

• Towards SLO Complying SSDs Through OPS Isolation (FAST’15)

• The case of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator (FAST’18)

• There’re more and more!

Challenges #1: Most storage

research relies on simulation/kernel-

level emulation

CAMELab

SCM-based NVMe Storage Card

SCM

SCM

SCM

SCM

CTRL

7GB/sec


VFS/FS

Block layer

Page cache

Block device driver

CPU

Challenges #2: SSD’s CPU can be a performance

bottleneck for SCMs

CAMELab

What Does SSD’s CPU Do?


VFS/FS

Block layer

Page cache

Block device driver

CPU SCM

SCM

SCM

SCM

CTRL

7GB/sec

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Host memory

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)

Device register SQ Doorbell

CQ Doorbell

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

❶ I/O submission

Data(PRP)

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

Data(PRP)

❷ Ring SQ doorbell

SQ Doorbell

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

Data(PRP)

SQ Doorbell

❸ I/O fetch

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

❹ Data transfer

Data(PRP)

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

Data(PRP)

❺ I/O process

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

❻ I/O completion

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

❼ Interrupt (notification)

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

❽ Process completion

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ Doorbell

❾ Ring CQ doorbell

CQ Doorbell

CAMELab


CPU SCM

CTRL


VFS/FS

Block layer

Page cache

Block device driver

Address space

Com

plet

ion

queu

e (C

Q)

Subm

issio

nqu

eue

(SQ

)SQ Doorbell

CQ DoorbellCQ Doorbell

All these NVMe activities give a burden on the storage!

CAMELab

Multi-core IP for High-Performance SSDBackend

Chan

nel C

ompl

ex

CPU

Inbo

und

NVMe driver

SQ CQCore0

Out

boun

dPC

Ie

PCIe

Clie

nt L

ogic

PCIe

I-RAM

I-RAM

I-RAM

Interconnection Networks

Memory Controller SRAM

CAMELab

Component Latency Decomposition

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

Completion Translation PRP Queue/Doorbells Fetching NVM


CAMELab


TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n



CAMELab


TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n



CAMELab


TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n



CAMELab


TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n



CAMELab


TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n

TLC MLC SLCZNANDPRA

MMRA

M0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

brea

kdow

n



CPU bursts can be overlapped with I/O

bursts

Firmware control become the critical

performance bottleneck

CAMELab

Overview

A framework that fully automates NVMe control logic in hardware to make it possible to build customizable devices• OpenExpress is an NVMe host accelerator IP for the integration with an easy-to-access

FPGA design.

OpenExpress doesn’t require any software intervention to process concurrent read and write NVMe requests• It supports scalable data submission, rich outstanding NVMe commands and submissio

n/completion queue management.

We prototype OpenExpress on a commercially available Xilinx FPGA board and optimize all the logic modules to operate a high frequency.

CAMELab

Full Hardware Automation for Critical I/O PathBackend

Chan

nel C

ompl

ex

CPU

Inbo

und

NVMe driver

SQ CQCore0

Out

boun

dPC

Ie

PCIe

Clie

nt L

ogic

PCIe

I-RAM

I-RAM

I-RAM

Interconnection Networks

Memory Controller SRAM

CAMELab

Full Hardware Automation for Critical I/O PathBackend

Chan

nel C

ompl

ex

CPU

Inbo

und

NVMe driver

SQ CQCore0

Out

boun

dPC

Ie

PCIe

Clie

nt L

ogic

PCIe

SoC Memory Bus

Datapath

Hardware Automation (OpenExpress)

CAMELab

Queue DispatchingOpenExpress

SQ 1 DBSQ 0 DB

SQ Tail Doorbell Region

Addr

ess (

read

s/w

rites

)

Data

(rea

ds/w

rites

)

Writ

e Re

spon

se

SQ Entry Fetch Manager (FET)

CAMELab

Data TransferringOpenExpress

SQ 1 DBSQ 0 DB


Addr

ess (

read

s/w

rites

)

Data

(rea

ds/w

rites

)

Writ

e Re

spon

se


PRP Engine(HTRW)

Backend DMA (DMA)

PCIe EP Complex

DIMM0

DIMM1

DIMM2

DIMM3

CAMELab

Completion Handling OpenExpress

SQ 1 DBSQ 0 DB


PCIe EP Complex

Addr

ess (

read

s/w

rites

)

Data

(rea

ds/w

rites

)

Writ

e Re

spon

se


PRP Engine(HTRW)

Backend DMA (DMA)

Completion Handler (CMT)

CQ 1 DBCQ 0 DB

CQ Head Doorbell Region

DIMM0

DIMM1

DIMM2

DIMM3

CAMELab

NVMe Context ManagementOpenExpress

SQ 1 DBSQ 0 DB


PCIe EP Complex

Addr

ess (

read

s/w

rites

)

Data

(rea

ds/w

rites

)

Writ

e Re

spon

se


PRP Engine(HTRW)

Backend DMA (DMA)


CQ 1 DBCQ 0 DB


DIMM0

DIMM1

DIMM2

DIMM3

NVMe Context Box (CTX)

CAMELab

Operation Flow of OpenExpressOpenExpress

SQ 1 DBSQ 0 DB


PCIe EP Complex

Addr

ess (

read

s/w

rites

)

Data

(rea

ds/w

rites

)

Writ

e Re

spon

se


PRP Engine(HTRW)

Backend DMA (DMA)


CQ 1 DBCQ 0 DB


DIMM0

DIMM1

DIMM2

DIMM3

CompletionQueue 1

Tail

Head

CQ 1 BAR

SubmissionQueue 1

Tail

Head

Issue a command 1

2 Write SQ 1 tail pointer 3Monitor the DB region and signal an event

4 Fetch 16 DW SQ Entry

5 Send the parsed commandDATA

PRPDRAM

6 Processing PRP

6 Copying data associated with each PRP

Posting CQ entry7

Create MSI and interrupt8

9 Write CQ 1 head pointer

CAMELab

Major IP CoresFET

HTRWCMT

CTX

Frontend

CAMELab

Major IP Cores

CAMELab

Synthesis and Implementation

Four 64GB DDR4 x72 DIMM(256GB)

Xilinx VirtexUltraScale 190

PCIe Gen3Descriptions

CPU i5-9400 2.9GHz

Design Suite Vivado 2017.3.1

OS Ubuntu 18.04 LTS

Building Time More than 7 hours

CAMELab

Frequency Tuning (50MHz ~ 250 MHz) Detect the long routing delay and

place register slice IPs between hardware modules • It can increase the frequency while

minimizing a negative slack• Gradual trial-and error to reduce the

amount of route delay

Memory Controller

DDR interconnect

AXI

DMA

PCIe EP

CMT

FET

HTRW

CTX

Putting register slides between

different IP cores

CAMELab

Frequency Tuning (50MHz ~ 250 MHz) Group hardware modules in the floor-

planning • After synthesis, create FPGA pblock (a part

of grid cells), allocate hardware IPs to the pblock, and do PNR (place and route)

• Check the implementation report and do gradual trial-and-error again

Memory Controller

DDR interconnect

AXI

DMA

PCIe EP

CMT

FET

HTRW

CTX

Reducing route distances by

grouping IP logic

CAMELab

Perf. Improvement of Frequency Tuning The performance improvement for

reads and writes is as high as 20% and 60%, respectively

Reads exhibit more benefits as its nature of data movement on PCIe packets (explained shortly)

The large size block processing has more benefits because of automated PRP data processing

ReadsWrites

CAMELab

Perf. Improvement of Frequency Tuning

ReadsWrites

DRAM

DRAMOutb

ound

Emul

Emul

Emul

Inbound

PCIe

OpenExpress

Host

IO CQ Region

Control Reg Set

IO SQ RegionIO Command

PCIe BAR

PRP List

Pointer

Pointer

PRP parsing/traversing and data transfers

are fully automated

CAMELab

Evaluation For all executions, a single I/O worker cannot extract full bandwidth (CPU

bottleneck)• We execute 10 threads for both benchmark execution and real workload

Descriptions

CPU 8-core 3.3GHz Intel Skylake-X Server Microarchitecture

DRAM 32GB DDR4 (2666)

Benchmark FIO

Real workload CAMEL’s Open Storage Trace (SNIA)https://trace.camelab.org/

I/O Workers 10 threads

Block Device Optane SSD P4800X

Evaluations demonstrated in the paper are all performed with a queue-depth “8”• The queue depth exhibiting the best bandwidth

Note that we don’t claim that OpenExpresscan be faster than other fast NVMe devices.• Instead, the evaluation shows that an FPGA-

based design and implementation for NVMe IP cores can offer good performance to make it a viable candidate for use in storage research.

https://trace.camelab.org/

CAMELab

Performance w/ Microbenchmarks

Bandwidth (seq) Bandwidth (rand)

Latency (seq) Latency (rand)

CAMELab

Performance w/ Microbenchmarks 4KB-sized requests

• There is no much bandwidth/latency difference between reads and writes (3 GB/sec vs 2.8/sec)

• OpenExpress latency is 72~77 us at the queue-depth eight (vs. P4800X offers 120~150 us)

It reaches the maximum bandwidth with 16KB-sized requests• 4.7GB/sec for random writes (258 us)• 7GB/sec for random reads (175 us)

(vs. P4800X offers 532 us~600us)

Bandwidth (seq)

Latency (seq)

Why are writes slower than reads?

CAMELab

Performance w/ Microbenchmarks

Bandwidth (seq)

Latency (seq)

DoorbellNVMe

PRPDataData

DoorbellNVMe

PRPDataData

DoorbellNVMe

PRPDataData

DoorbellNVMe

PRP

DataData

Writes ReadsRXTX RXTXAll payloads are

serialized Data and NVMe payloads are served

in parallel

CAMELab

Real Workload Evaluation

24HRFIU

DevDivSer

verTPCE

0

100

200

300

400 OpenExpress

Mic

rose

c

24HRFIU

DevDivSer

verTPCE

0.0

1.0

2.0

3.0

4.0

5.0

6.0

MB/

s

OpenExpress The performance of real workloads are different with microbenchmark results because of unaligned request offsets, sector length variations, etc.• Most storage cards cannot reach their

best performance with real workload executions

Read-intensive workloads (DevDiv and Server)• OpenExpress offers 4GB/sec ~ 4.5GB/sec

(100~200 us)

Write-intensive workloads (24HR, FIU and TPCE)• 1.25GB/sec~2.1GB/sec (all under 100 us)

CAMELab

Real Workload Evaluation

24HRFIU

DevDivSer

verTPCE

0

100

200

300

400 OpenExpress

Mic

rose

c

24HRFIU

DevDivSer

verTPCE

0.0

1.0

2.0

3.0

4.0

5.0

6.0

MB/

s

OpenExpress

24HRFIU

DevDivSer

verTPCE

0.0

1.0

2.0

3.0

4.0

5.0

6.0

MB/

s

Optane SSD OpenExpress

24HRFIU

DevDivSer

verTPCE

0

100

200

300

400 Optane SSD OpenExpress

Mic

rose

c

Bandwidth• OpenExpress shows 76.3% better

bandwidth than Optane SSD, on average

Latency• It exhibits 68.6% shorter latency than

those of OptaneSSD

DevDiv• 111.5% better performance compared to

Optane SSD (2.1GB/sec)

FPGA is yet much slower, but the FPGA design and implementation for NVMe IP cores are not on the critical path and can be used for system-level studies as a research vehicle

CAMELab

Conclusion: Related Work and Download

$ 35K

$ 45K

$ 40K

Ip-m****

Inte******

Ep*****

Per-month unitary prices for 3rd party NVMe IP Core

The price of a single-use source code is around $ 100K For academic/non commercial purpose,

OpenExpress will be freely downloadable:• Hardware automation IP cores (HTRW, FET, CMT, CTX..)• Firmware for MicroBlaze (to control administrator

command management, device initialization, etc.)

Download information • https://openexpress.camelab.org

https://trace.camelab.org/

Computer Architecture and Memory systems Laboratory

CAMELab

Myoungsoo Jung

ATC 2020

Sponsored by

Fully Hardware Automated OpenResearch Framework for Future Fast NVMe Device

Fully Hardware Automated Open Research Framework for Future Fast NVMe DeviceEmerging Non-Volatile Memory for SSDsNVMe Internals and Interfaces NVMe Storage StackNVMe Storage Stack RedesignSCM-based NVMe Storage CardWhat Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?Multi-core IP for High-Performance SSDComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionOverviewFull Hardware Automation for Critical I/O PathFull Hardware Automation for Critical I/O PathQueue DispatchingData TransferringCompletion Handling NVMe Context ManagementOperation Flow of OpenExpress Major IP CoresMajor IP CoresMajor IP CoresSynthesis and ImplementationFrequency Tuning (50MHz ~ 250 MHz)Frequency Tuning (50MHz ~ 250 MHz)Perf. Improvement of Frequency TuningPerf. Improvement of Frequency TuningEvaluationPerformance w/ MicrobenchmarksPerformance w/ MicrobenchmarksPerformance w/ MicrobenchmarksReal Workload EvaluationReal Workload EvaluationConclusion: Related Work and DownloadFully Hardware Automated Open Research Framework for Future Fast NVMe Device슬라이드 번호 54Real Workload Evaluation

Documents

Fully Hardware Automated Open Research Framework for ... › system › files › atc20-paper37-slides-jung.pdfNVMe Internals and Interfaces . CPU. CTRL. Flash. Flash. Flash. Flash