49
Computer Architecture and Memory systems Laboratory CAMELab Myoungsoo Jung ATC 2020 Sponsored by Fully Hardware Automated Open Research Framework for Future Fast NVMe Device

Fully Hardware Automated Open Research Framework for ... › system › files › atc20-paper37-slides-jung.pdfNVMe Internals and Interfaces . CPU. CTRL. Flash. Flash. Flash. Flash

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Computer Architecture and Memory systems Laboratory

    CAMELab

    Myoungsoo Jung

    ATC 2020

    Sponsored by

    Fully Hardware Automated OpenResearch Framework for Future Fast NVMe Device

  • CAMELab

    Emerging Non-Volatile Memory for SSDsLa

    tenc

    y (r

    eads

    )

    Memory Types

    TLC MLC SLC New Flash PRAM MRAM DRAM

    Flash Technologies

    450 us

    150 us25 us

    3 us 120 ns 50~80 ns 60~80 ns

    Storage Class Memory (SCM)

  • CAMELab

    NVMe Internals and Interfaces

    CPU Flash

    Flas

    hFl

    ash

    Flas

    h

    CTRL

  • CAMELab

    NVMe Storage Stack

    CPU Flash

    Flas

    hFl

    ash

    Flas

    h

    CTRL

    1~3GB/sec

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

  • CAMELab

    NVMe Storage Stack Redesign

    CPU Flash

    Flas

    hFl

    ash

    Flas

    h

    CTRL

    1~3GB/sec

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    • FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs (OSDI’18)

    • De-indirection for Flash-Based SSDs with Nameless writes (FAST’12)

    • Towards SLO Complying SSDs Through OPS Isolation (FAST’15)

    • The case of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator (FAST’18)

    • There’re more and more!

    Challenges #1: Most storage

    research relies on simulation/kernel-

    level emulation

  • CAMELab

    SCM-based NVMe Storage Card

    SCM

    SCM

    SCM

    SCM

    CTRL

    7GB/sec

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    CPU

    Challenges #2: SSD’s CPU can be a performance

    bottleneck for SCMs

  • CAMELab

    What Does SSD’s CPU Do?

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    CPU SCM

    SCM

    SCM

    SCM

    CTRL

    7GB/sec

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Host memory

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )

    Device register SQ Doorbell

    CQ Doorbell

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    ❶ I/O submission

    Data(PRP)

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    Data(PRP)

    ❷ Ring SQ doorbell

    SQ Doorbell

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    Data(PRP)

    SQ Doorbell

    ❸ I/O fetch

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    ❹ Data transfer

    Data(PRP)

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    Data(PRP)

    ❺ I/O process

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    ❻ I/O completion

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    ❼ Interrupt (notification)

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    ❽ Process completion

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ Doorbell

    ❾ Ring CQ doorbell

    CQ Doorbell

  • CAMELab

    What Does SSD’s CPU Do?

    CPU SCM

    CTRL

    Applications (Processes)

    VFS/FS

    Block layer

    Page cache

    Block device driver

    Address space

    Com

    plet

    ion

    queu

    e (C

    Q)

    Subm

    issio

    nqu

    eue

    (SQ

    )SQ Doorbell

    CQ DoorbellCQ Doorbell

    All these NVMe activities give a burden on the storage!

  • CAMELab

    Multi-core IP for High-Performance SSDBackend

    Chan

    nel C

    ompl

    ex

    CPU

    Inbo

    und

    NVMe driver

    SQ CQCore0

    Out

    boun

    dPC

    Ie

    PCIe

    Clie

    nt L

    ogic

    PCIe

    I-RAM

    I-RAM

    I-RAM

    Interconnection Networks

    Memory Controller SRAM

  • CAMELab

    Component Latency Decomposition

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    Completion Translation PRP Queue/Doorbells Fetching NVM

    Completion Translation PRP Queue/Doorbells Fetching NVM

  • CAMELab

    Component Latency Decomposition

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    Completion Translation PRP Queue/Doorbells Fetching NVM

    Completion Translation PRP Queue/Doorbells Fetching NVM

  • CAMELab

    Component Latency Decomposition

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    Completion Translation PRP Queue/Doorbells Fetching NVM

    Completion Translation PRP Queue/Doorbells Fetching NVM

  • CAMELab

    Component Latency Decomposition

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    Completion Translation PRP Queue/Doorbells Fetching NVM

    Completion Translation PRP Queue/Doorbells Fetching NVM

  • CAMELab

    Component Latency Decomposition

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    Completion Translation PRP Queue/Doorbells Fetching NVM

    Completion Translation PRP Queue/Doorbells Fetching NVM

  • CAMELab

    Component Latency Decomposition

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    TLC MLC SLCZNANDPRA

    MMRA

    M0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Late

    ncy

    brea

    kdow

    n

    Completion Translation PRP Queue/Doorbells Fetching NVM

    Completion Translation PRP Queue/Doorbells Fetching NVM

    CPU bursts can be overlapped with I/O

    bursts

    Firmware control become the critical

    performance bottleneck

  • CAMELab

    Overview

    A framework that fully automates NVMe control logic in hardware to make it possible to build customizable devices• OpenExpress is an NVMe host accelerator IP for the integration with an easy-to-access

    FPGA design.

    OpenExpress doesn’t require any software intervention to process concurrent read and write NVMe requests• It supports scalable data submission, rich outstanding NVMe commands and submissio

    n/completion queue management.

    We prototype OpenExpress on a commercially available Xilinx FPGA board and optimize all the logic modules to operate a high frequency.

  • CAMELab

    Full Hardware Automation for Critical I/O PathBackend

    Chan

    nel C

    ompl

    ex

    CPU

    Inbo

    und

    NVMe driver

    SQ CQCore0

    Out

    boun

    dPC

    Ie

    PCIe

    Clie

    nt L

    ogic

    PCIe

    I-RAM

    I-RAM

    I-RAM

    Interconnection Networks

    Memory Controller SRAM

  • CAMELab

    Full Hardware Automation for Critical I/O PathBackend

    Chan

    nel C

    ompl

    ex

    CPU

    Inbo

    und

    NVMe driver

    SQ CQCore0

    Out

    boun

    dPC

    Ie

    PCIe

    Clie

    nt L

    ogic

    PCIe

    SoC Memory Bus

    Datapath

    Hardware Automation (OpenExpress)

  • CAMELab

    Queue DispatchingOpenExpress

    SQ 1 DBSQ 0 DB

    SQ Tail Doorbell Region

    Addr

    ess (

    read

    s/w

    rites

    )

    Data

    (rea

    ds/w

    rites

    )

    Writ

    e Re

    spon

    se

    SQ Entry Fetch Manager (FET)

  • CAMELab

    Data TransferringOpenExpress

    SQ 1 DBSQ 0 DB

    SQ Tail Doorbell Region

    Addr

    ess (

    read

    s/w

    rites

    )

    Data

    (rea

    ds/w

    rites

    )

    Writ

    e Re

    spon

    se

    SQ Entry Fetch Manager (FET)

    PRP Engine(HTRW)

    Backend DMA (DMA)

    PCIe EP Complex

    DIMM0

    DIMM1

    DIMM2

    DIMM3

  • CAMELab

    Completion Handling OpenExpress

    SQ 1 DBSQ 0 DB

    SQ Tail Doorbell Region

    PCIe EP Complex

    Addr

    ess (

    read

    s/w

    rites

    )

    Data

    (rea

    ds/w

    rites

    )

    Writ

    e Re

    spon

    se

    SQ Entry Fetch Manager (FET)

    PRP Engine(HTRW)

    Backend DMA (DMA)

    Completion Handler (CMT)

    CQ 1 DBCQ 0 DB

    CQ Head Doorbell Region

    DIMM0

    DIMM1

    DIMM2

    DIMM3

  • CAMELab

    NVMe Context ManagementOpenExpress

    SQ 1 DBSQ 0 DB

    SQ Tail Doorbell Region

    PCIe EP Complex

    Addr

    ess (

    read

    s/w

    rites

    )

    Data

    (rea

    ds/w

    rites

    )

    Writ

    e Re

    spon

    se

    SQ Entry Fetch Manager (FET)

    PRP Engine(HTRW)

    Backend DMA (DMA)

    Completion Handler (CMT)

    CQ 1 DBCQ 0 DB

    CQ Head Doorbell Region

    DIMM0

    DIMM1

    DIMM2

    DIMM3

    NVMe Context Box (CTX)

  • CAMELab

    Operation Flow of OpenExpressOpenExpress

    SQ 1 DBSQ 0 DB

    SQ Tail Doorbell Region

    PCIe EP Complex

    Addr

    ess (

    read

    s/w

    rites

    )

    Data

    (rea

    ds/w

    rites

    )

    Writ

    e Re

    spon

    se

    SQ Entry Fetch Manager (FET)

    PRP Engine(HTRW)

    Backend DMA (DMA)

    Completion Handler (CMT)

    CQ 1 DBCQ 0 DB

    CQ Head Doorbell Region

    DIMM0

    DIMM1

    DIMM2

    DIMM3

    CompletionQueue 1

    Tail

    Head

    CQ 1 BAR

    SubmissionQueue 1

    Tail

    Head

    Issue a command 1

    2 Write SQ 1 tail pointer 3Monitor the DB region and signal an event

    4 Fetch 16 DW SQ Entry

    5 Send the parsed commandDATA

    PRPDRAM

    6 Processing PRP

    6 Copying data associated with each PRP

    Posting CQ entry7

    Create MSI and interrupt8

    9 Write CQ 1 head pointer

  • CAMELab

    Major IP CoresFET

    HTRWCMT

    CTX

    Frontend

  • CAMELab

    Major IP CoresFET

    HTRWCMT

    CTX

    Frontend

  • CAMELab

    Major IP Cores

  • CAMELab

    Synthesis and Implementation

    Four 64GB DDR4 x72 DIMM(256GB)

    Xilinx VirtexUltraScale 190

    PCIe Gen3Descriptions

    CPU i5-9400 2.9GHz

    Design Suite Vivado 2017.3.1

    OS Ubuntu 18.04 LTS

    Building Time More than 7 hours

  • CAMELab

    Frequency Tuning (50MHz ~ 250 MHz) Detect the long routing delay and

    place register slice IPs between hardware modules • It can increase the frequency while

    minimizing a negative slack• Gradual trial-and error to reduce the

    amount of route delay

    Memory Controller

    DDR interconnect

    AXI

    DMA

    PCIe EP

    CMT

    FET

    HTRW

    CTX

    Putting register slides between

    different IP cores

  • CAMELab

    Frequency Tuning (50MHz ~ 250 MHz) Group hardware modules in the floor-

    planning • After synthesis, create FPGA pblock (a part

    of grid cells), allocate hardware IPs to the pblock, and do PNR (place and route)

    • Check the implementation report and do gradual trial-and-error again

    Memory Controller

    DDR interconnect

    AXI

    DMA

    PCIe EP

    CMT

    FET

    HTRW

    CTX

    Reducing route distances by

    grouping IP logic

  • CAMELab

    Perf. Improvement of Frequency Tuning The performance improvement for

    reads and writes is as high as 20% and 60%, respectively

    Reads exhibit more benefits as its nature of data movement on PCIe packets (explained shortly)

    The large size block processing has more benefits because of automated PRP data processing

    ReadsWrites

  • CAMELab

    Perf. Improvement of Frequency Tuning

    ReadsWrites

    DRAM

    DRAMOutb

    ound

    Emul

    Emul

    Emul

    Inbound

    PCIe

    OpenExpress

    Host

    IO CQ Region

    Control Reg Set

    IO SQ RegionIO Command

    PCIe BAR

    PRP List

    Pointer

    Pointer

    PRP parsing/traversing and data transfers

    are fully automated

  • CAMELab

    Evaluation For all executions, a single I/O worker cannot extract full bandwidth (CPU

    bottleneck)• We execute 10 threads for both benchmark execution and real workload

    Descriptions

    CPU 8-core 3.3GHz Intel Skylake-X Server Microarchitecture

    DRAM 32GB DDR4 (2666)

    Benchmark FIO

    Real workload CAMEL’s Open Storage Trace (SNIA)https://trace.camelab.org/

    I/O Workers 10 threads

    Block Device Optane SSD P4800X

    Evaluations demonstrated in the paper are all performed with a queue-depth “8”• The queue depth exhibiting the best bandwidth

    Note that we don’t claim that OpenExpresscan be faster than other fast NVMe devices.• Instead, the evaluation shows that an FPGA-

    based design and implementation for NVMe IP cores can offer good performance to make it a viable candidate for use in storage research.

    https://trace.camelab.org/

  • CAMELab

    Performance w/ Microbenchmarks

    Bandwidth (seq) Bandwidth (rand)

    Latency (seq) Latency (rand)

  • CAMELab

    Performance w/ Microbenchmarks 4KB-sized requests

    • There is no much bandwidth/latency difference between reads and writes (3 GB/sec vs 2.8/sec)

    • OpenExpress latency is 72~77 us at the queue-depth eight (vs. P4800X offers 120~150 us)

    It reaches the maximum bandwidth with 16KB-sized requests• 4.7GB/sec for random writes (258 us)• 7GB/sec for random reads (175 us)

    (vs. P4800X offers 532 us~600us)

    Bandwidth (seq)

    Latency (seq)

    Why are writes slower than reads?

  • CAMELab

    Performance w/ Microbenchmarks

    Bandwidth (seq)

    Latency (seq)

    DoorbellNVMe

    PRPDataData

    DoorbellNVMe

    PRPDataData

    DoorbellNVMe

    PRPDataData

    DoorbellNVMe

    PRP

    DataData

    Writes ReadsRXTX RXTXAll payloads are

    serialized Data and NVMe payloads are served

    in parallel

  • CAMELab

    Real Workload Evaluation

    24HRFIU

    DevDivSer

    verTPCE

    0

    100

    200

    300

    400 OpenExpress

    Mic

    rose

    c

    24HRFIU

    DevDivSer

    verTPCE

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    MB/

    s

    OpenExpress The performance of real workloads are different with microbenchmark results because of unaligned request offsets, sector length variations, etc.• Most storage cards cannot reach their

    best performance with real workload executions

    Read-intensive workloads (DevDiv and Server)• OpenExpress offers 4GB/sec ~ 4.5GB/sec

    (100~200 us)

    Write-intensive workloads (24HR, FIU and TPCE)• 1.25GB/sec~2.1GB/sec (all under 100 us)

  • CAMELab

    Real Workload Evaluation

    24HRFIU

    DevDivSer

    verTPCE

    0

    100

    200

    300

    400 OpenExpress

    Mic

    rose

    c

    24HRFIU

    DevDivSer

    verTPCE

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    MB/

    s

    OpenExpress

    24HRFIU

    DevDivSer

    verTPCE

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    6.0

    MB/

    s

    Optane SSD OpenExpress

    24HRFIU

    DevDivSer

    verTPCE

    0

    100

    200

    300

    400 Optane SSD OpenExpress

    Mic

    rose

    c

    Bandwidth• OpenExpress shows 76.3% better

    bandwidth than Optane SSD, on average

    Latency• It exhibits 68.6% shorter latency than

    those of OptaneSSD

    DevDiv• 111.5% better performance compared to

    Optane SSD (2.1GB/sec)

    FPGA is yet much slower, but the FPGA design and implementation for NVMe IP cores are not on the critical path and can be used for system-level studies as a research vehicle

  • CAMELab

    Conclusion: Related Work and Download

    $ 35K

    $ 45K

    $ 40K

    Ip-m****

    Inte******

    Ep*****

    Per-month unitary prices for 3rd party NVMe IP Core

    The price of a single-use source code is around $ 100K For academic/non commercial purpose,

    OpenExpress will be freely downloadable:• Hardware automation IP cores (HTRW, FET, CMT, CTX..)• Firmware for MicroBlaze (to control administrator

    command management, device initialization, etc.)

    Download information • https://openexpress.camelab.org

    https://trace.camelab.org/

  • Computer Architecture and Memory systems Laboratory

    CAMELab

    Myoungsoo Jung

    ATC 2020

    Sponsored by

    Fully Hardware Automated OpenResearch Framework for Future Fast NVMe Device

    Fully Hardware Automated Open Research Framework for Future Fast NVMe DeviceEmerging Non-Volatile Memory for SSDsNVMe Internals and Interfaces NVMe Storage StackNVMe Storage Stack RedesignSCM-based NVMe Storage CardWhat Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?What Does SSD’s CPU Do?Multi-core IP for High-Performance SSDComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionComponent Latency DecompositionOverviewFull Hardware Automation for Critical I/O PathFull Hardware Automation for Critical I/O PathQueue DispatchingData TransferringCompletion Handling NVMe Context ManagementOperation Flow of OpenExpress Major IP CoresMajor IP CoresMajor IP CoresSynthesis and ImplementationFrequency Tuning (50MHz ~ 250 MHz)Frequency Tuning (50MHz ~ 250 MHz)Perf. Improvement of Frequency TuningPerf. Improvement of Frequency TuningEvaluationPerformance w/ MicrobenchmarksPerformance w/ MicrobenchmarksPerformance w/ MicrobenchmarksReal Workload EvaluationReal Workload EvaluationConclusion: Related Work and DownloadFully Hardware Automated Open Research Framework for Future Fast NVMe Device슬라이드 번호 54Real Workload Evaluation