Upload
others
View
21
Download
0
Embed Size (px)
Citation preview
Zoned Namespaces Use Cases, Standard and Linux Ecosystem
SDC SNIA EMEA 20 | Javier González – Principal Software Engineer / R&D Center Lead
1
Why do we need a new interface?
Addr. Mapping (L2P)Data Placement Metadata Mgmt.
I/O SchedulingMedia Mgmt.Garbage Collection
Block LayerHostTraditional SSD
FTL - Log-Structured Mapping
Log-Structured File System (e.g., F2FS, XFS, ZFS)
Metadata Mgmt.
I/O Scheduling
Garbage Collection
Addr. Mapping (L2L)
Data Placement
Read / Write / Trim / Hints
VFS
User SpaceKernel Space
Log-Structured Applications (e.g., RocksDB)
Data PlacementAddr. Mapping (L2L)
Metadata Mgmt.
Garbage Collection Metadata Mgmt.
https://blocksandfiles.com/2019/08/28/nearline-disk-drives-ssd-attack/
1. Log-on-log Problem (WAF + OP)- Remove redundancies- Leverage log data structures
2. Cost Gap with HDDs- Higher bit count (QLC)- Reduce DRAM- Reduce OP & WAF
3. Multi-tenancy everywhere- Address noisy-neighbor- Provide QoS
SSDs are already mainstream• Great performance (Bandwidth / Latency) – Combination of NAND + NVMe• Easy to deploy - Direct replacement for HDDs• Acceptable price $/GB
But, we have 3 recurrent problems:
1.J. Yang, N. Plasson, G. Gillis, N. Talagala, and S. Sundararaman. Don’t stack your 2.log on my log. In 2nd Workshop on Interactions of NVM/Flash with Operating 3.Systems and Workloads (INFLOW), 2014.
2
ZNS: Basic Concepts
Zone 0 Zone 1 Zone 2 Zone n-1…
LBA 0 LBA m-1
ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA ZSLBA ZSLBA + ZCAP - 1
Write Pointer
…
- Strict sequential writes- Write after reset
Die 0 Die 1 Die 2 Die 3 Die k-1
Zone 0Zone1
Zone n-1…
Zone 0 Zone 1 Zone k-1Zone k Zone k+1 Zone 2k+1
Zone n-k Zone n-k+1 Zone n-1
……
……… …
…
Config. A
Config. BConfig. Z
Die 4 Die 5 Die 6 …
Divide LBA address space in fixed size zones• Write Constraints
- Write strictly sequentially within a zone (Write Pointer)- Explicitly reset write pointer
• No zone mapping constraints- Zone physical location transparent to host- Open design space for different configuration
Zone state machine• Zone transitions managed by host
- Direct commands or boundaries• Zone transitions triggered by device
- Notification to host through a exception
ZSE: Empty
ZSF: Full
ZSIO: Implicitly
Open
ZSEO: Explicitly
Open
ZSC: Closed
ZSRO: Read Only
ZSO: Offline
Open ResourcesActive Resources
LBA mapping in zone
Zone mapping in device
3
ZNS: Advanced ConceptsAppend Command• Implementation of nameless write [1]• Benefits
- Increase QD in a single zone to improve parallelism• Host changes
- Changes needed to block allocator in application / file system- Full reimplementation of LBA-based features in submission path (e.g., snapshotting)
[1] Y. Zhang, L. P. Arulraj, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. De-indirection for flash-based SSDs with nameless writes. FAST, 2012.
Write Path• Required QD=1
- Reminder: No ordering guarantees in NVMe
Append Path• QD<=X / X = #LBAs in zone
Die 0 Die 1 Die 2 Die 3 Die k-1
Zone 0Zone1
Zone n-1…
Die 4 Die 5 Die 6 …
ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA+4 ZSLBA+5 ZSLBA + ZCAP - 1…
Write Pointer
AppendCommands
(SQ)
…
AppendCommands
(CQ)
LBA=
ZSLB
A+3
LBA=
ZSLB
A+4
…
LBA=
ZSLB
A+ZC
AP-1
Die 0 Die 1 Die 2 Die 3 Die k-1
Zone 0Zone1
Zone n-1…
Die 4 Die 5 Die 6 …
ZSLBA ZSLBA+1 ZSLBA+2 ZSLBA+3 ZSLBA+4 ZSLBA+5 ZSLBA + ZCAP - 1…
Write Pointer
Write Command(QD1)
QD1
4
ZNS: Host Responsibilities
ZNS requires changes to the host• First step: Changes to file systems
- Advantages in log-structured file systems (e.g., F2FS, XFS, ZFS)- Still best-effort from application perspective- First important step for adoption
• Second step: Changes to applications- Only make sense to log-structure applications (e.g., LSM databases such as RocksDB)- Real benefits in terms of WAF, OP and data movement (GC)- Need a way to minimize application changes! (Spoiler Alert: xNVMe)
Zone Mapping (L2P)I/O Scheduling Metadata Mgmt.
Exception Mgmt.Media Mgmt.Zone Meta. Mgmt.
HostZNS SSD
Zone FTL
Metadata Mgmt.
Zone AddressMapping (L2L)
Zone Data Placement
Zone Garbage Collection
I/O: Read / Write / Append / Reset / Commit
Address Mapping
GarbageCollection
Zone Metadata Mgmt.
Admin: Zone Management
Exception Handling
Sect. Mapping (L2P)I/O Scheduling
Exception Mgmt.Media Mgmt.Metadata Mgmt.
HostTraditional SSD FTL
I/O: Read / Write / Trim / Hints
Metadata Mgmt.
AddressMapping
GarbageCollection
Garbage CollectionOver Provisioning
5
ZNS Extensions: Zone Random Write Area (ZRWA)Concept• Expose the write buffer in front of a zone to the host
Benefits• Enable in-place updates within ZRWA
- Reduce WAF due to metadata updates- Aligns with metadata layout of many file systems- Simplifies recovery (not changes to metadata layout)
• Allows writes at QD > 1- No need changes to host stack (as opposed to append)- ZRWA size balanced with expected performance
Operation• Writes are placed into the ZRWA
- No write pointer constraint• In-place updates allowed in the ZRWA window• ZRWA can be committed explicitly
- Through dedicated command• ZRWA can be committed implicitly
- When writing over sizeof(ZRWA)
Zone 0 Zone 1 Zone 2 Zone n-1
…LBA 0 LBA m-1
ZRWA ZRWA ZRWA ZRWACommit
Read / Write / Commit
Write
ZONE
In-place Updates Allowed
Explicit Commit
Implicit Commit
ZRWA
6
ZNS Extensions: Simple Copy
Concept• Offload copy operation to SSD• Simple copy, because SCSI XCOPY become to complex
Benefits• Data is moved by the SSD controller
- No data movement through interconnect (e.g., PCIe)- No CPU cycles used for data movement
• Specially relevant for ZNS – host GCOperation• Host forms and submits SC command
- Select a list of source LBA ranges- Select a destination with a single LBA + length
• SSD moves data from sources to destination- Error model complexity depending on scope
Scope• Under discussion in NVMe
Simple Copy
Zone 1
Zone 3
Zone 2
7
ZNS: Archival Use Case
Goal: Reduce TCO• Reduce WAF• Reduce OP• Reduce DRAM usage• Facilitate consumption of QLC
Adoption• Tiering of cold storage
- Denmark cold: ZNS SSDs- Finland cold: High-capacity SMR HDDs- Mars cold: Tape
• Leverage SMR ecosystem- Build on top of same abstractions- Expand use-cases
Die 0 Die 1 Die 2 Die 3 Die k-1
Zone 1Zone 2
Zone n-1…
Die 4 Die 5 Die 6 …
Met
adat
a Zone 0
Zone
M
appi
ngApplication 1
Zone 0Zone 7
Zone 14
Zone 245…
Application 2Zone 10Zone 5
Zone 12
Zone 351…
Large Zones - Large Erase block - Large I/O sizes - Few “streams”Internal data placementDevice Scheduling
1 zone:1 applicationTarget low GC
Configuration optimized for cold data• Large zones• Semi-immutable per-zone data (all / nothing)• Minimize metadata and mapping tables• Need for large QDs (Append or ZRWA)
Architecture properties• Host GC à Less WAF and OP• Zone Mapping table à Less DRAM (break 1GB / 1TB)
8
ZNS: I/O Predictability Use CaseGoal: Support multi-tenant workloads• Inherit archival benefits:
- ↓WAF, ↓OP, and ↓DRAM• Support QoS policies
- Provide isolation properties across zones- Enable host I/O scheduling- Allow more control over I/O submission / completion
Die 0 Die 1 Die 2 Die 3 Die k-1Die 4 Die 5 Die 6 …
Met
adat
a
Zone
M
appi
ng
Application 1Zone 0Zone kZone 2k
Zone n-k…
Application 2Zone 1
Zone k+1Zone 2k+1
Zone n-1…
Small Zones - Small Erase block - Variable I/O sizes - 10s / 100s “streams”Host data placementHost I/O Scheduling
1 isolation domain:1 application
Target low GC
Zone 0 Zone 1 Zone k-1Zone k Zone k+1 Zone 2k+1
Zone n-k Zone n-k+1 Zone n-1
……
…… ……
Manage Stripping
Configuration optimized for multi-tenancy• Zones grouped in “isolation domains”• Host data placement / stripping• Host I/O scheduling
Inherit archival architecture properties• Host GC à Less WAF and OP• Zone Mapping table à Less DRAM (break 1GB / 1TB)
9
Linux Stack: Zoned Block FrameworkPurpose• Built for SMR drivers• Add support for write constraints• Expose to applications through
block layerKernel Status• Merged in 4.9• Block / drivers
- Support in block layer- Support in SCSI / ATA- Support in null_blk
• File systems- F2FS- Btrfs (patches posted)- ZFS / XFS (ongoing)
• Libraries- libzbc
• Tools- Utils-linux, fio
SAS ZBC / SATA ZAC HDD
SCSI / ATA
Host
SCSI / ATA Drivers
Block Layer
HW
KernelSpace
UserSpace
Zoned-aware FS(e.g., F2FS, Btrfs, ZFS)
Zoned Block Device Zoned Block Device
libzbc
ApplicationsApplicationApplication
dm-zoned
10
Linux Stack: Adding ZNS support
ZNS SSD
Kern
el S
pace
Use
r Spa
ce
io_uring libaio / sync IOCTL
NVMe DriverZoned Block + ZNS
Block LayerZoned Block
ZNS
ZNSZNS ZNS
F2FSZNS
XFSZoned + ZNS
FS I/O FS I/O Raw Block I/O
Admin
Admin & I/O
LegacyApplications
Zoned FS(e.g., F2FS)
Zoned Applications(e.g., RocksDB)
xNVMe SPDK nvme-cliZNS
blkzonedZNS
FS (e.g., F2FS)ZNS
mq-
dead
line
fioZNS
xNVMe
blktestsZNS
xNVMeRocksDBZNS Env.
QEMUZNS
FrameworkKernel-Space ZNS extensionsUser-space ZNS extensions
ZNS SSDKe
rnel
Spa
ceU
ser S
pace
io_uring / AIO
NVMe Driver
Block Layer
F2FSR / W
RocksDB RocksDB
mq-
dead
line
R / WSQ / CQ
R / W / A
Zone Mgmt
Posix xNVMe
io_uring / AIO
R / W / ASQ / CQ
zone
d
bloc
k
Zone Mgmt (e.g., Reset)
Admin
Block Layer
zone
d
bloc
k
R / W / A
Zoned Env. Zone Mgmt
IOCTL
Zone Mgmt (e.g., Reset)
ZNS support in Zoned Block• Add support ZNS-specific features (kernel & tools)
- Append, ZRWA and Simple Copy- Async() path for new admin commands
• Relax assumptions from SMR- Conventional zones- Zone sizes, state transitions, etc.
I/O paths• Using zoned FS (e.g., F2FS)
- No changes to application• Using zoned application
- Support in application backend- Explicit zone management• Data placement, sched., zone mgmt.
11
xNVMe: Supporting non-block ApplicationsxNVMe: Library to enable non-block applications across I/O backends• Common API and helpers• Linux kernel path: psync, libaio, io_uring• SPDK: User-space in Linux and FreeBSD• Windows: Future work – new backend required
Benefits• Portable API across I/O backends and OSs (minimal benefit for block devices)• Common API for new I/O interfaces (Re-write application backend only once)• File-system semantics on raw devices (block & non-block)• No performance impact
12
TakeawaysZNS is a new interface in NVMe• Builds on top of Namespace Types (e.g., KV)• Driven by vendors and early customers
ZNS targets 2 main use-cases • Archival (original use case)• I/O Predictability (ZNS extensions)
Host needs changes• Basic functionality builds on top of existing zoned block framework
- Respect write constraints- Follow zone state machine: Reset->Open->Closed->Full / ->Offline / ->ReadOnly
• Append needs major changes to block allocators in I/O path- Might work for some file systems / applications, but not for all
• ZRWA gives an alternative to append- Same functionality, not changes to block allocator in I/O path
We are building open ecosystem• Linux stack to be released on TP ratification
- xNVMe, Linux Kernel, tools (e.g., nvme-cli), fio- RocksDB Backend on xNVMe
• Sponsoring TPs for ZNS extensions in NVMe