Angelos Bilas, FORTH [email protected] - TERENA€¦ · • Most FS operations do not scale with # of cores • Two main scaling problems • (2) FS journaling 124816 0 12468 1216 #CPUs

17-June-2011 TERENA TF on Storage 1

PERSISTENT I/O CHALLENGES & APPROACHESCHALLENGES & APPROACHES

Angelos Bilas, [email protected]

Outline


Outline• Modern application stackspp

• Stream processing (STREAM)• Transaction processing (CumuloNimbo)

St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Abstractions for modern applications (CumuloNimbo)• Parallel I/O (SCALUS)

• Remarks

Application Stacks


Application Stacks• STREAM• CumuloNimbo

Stream Global Architecture Picture


Stream Global Architecture PictureCredit Card

Fraud DetectionSLA

ComplianceTelephony

Fraud DetectionCOIsAggregation Queries

FraudProfiles Monitoring

FraudDetectionQueries

Queries

SLA ViolationDetection

Fraud Detection Queries

Profiles

FraudProfiles

StreamCloud StreamMine

Parallel Stream Operators

Parallel DB Operators

Stream MapReduceOperators Operators

Fault Tolerance Self‐Provisioning

MapReduce

StateMachine

DynamicGraphs

Communication & StorageCompressed SSDQueue mem‐to‐memcommunication

Persistent Streaming Silent Error Detection

CumuloNimbo Global Architecture



JEE Application Server: JBoss+Hibernate

Object Cache: CumuloCache Transactions

SelfProvisioner

Query Engine: DerbyConcurrencyControllers

Distributed File System: HDFS

Column-Oriented Data Store & Block Cache: HBASE

CommitSequencers

Monitors

y

Storage

Commu-nication

LoggersLoad

Balancers

TransactionManagement

ElasticityManagement

Application Stacks


Application Stacks• They tend to be complexy p• Each layer adds substantial protocol “machinery”

• E.g. transactions, global name space

• Today I/O significant bottleneck• Hard to know what all layers do

• Questionable what can be modified realistically• Questionable what can be modified realistically

• How can modern storage systems best support these?

Outline





• Remarks

Dimension Infrastructure Properly17‐June‐2011 TERENA TF on Storage 8

Multicores +

PCs/bladesDifferent flavors of PCs/bladesPCs/bladesDifferent flavors of PCs/bladesDifferent flavors of PCs/blades

Multicores + memory + IO xput

High‐speed InterconnectInterconnect

High‐speed Interconnect

100s file serversservers

1000s of applservers

10‐40 Gbits/s• Dimensioning issues not straight forward today- I/O application overheads not understood

Disk controllers~2GB/sSATA disks, 12‐36 disks/node

10‐100 Gbits/s

- Do you balance thin or fat?- Other factors besides performance, power

SATA disks, 36 disks/node100 MBy/s, ~2TBytes+10% SSD cache

Scaling I/O on multicore CPUs


Scaling I/O on multicore CPUs• Observation

• As the number of cores increases in modern systems, we are not able to perform more I/O

• Target: 1M IOPS 10 GBytes/s• Target: 1M IOPS, 10 GBytes/s

• Goal• Provide scalable I/O stack (virtualized) over direct and networked

storage devices

• Go over1. Performance and scaling analysis1. Performance and scaling analysis2. Hybrid hierarchies to take advantage of potential3. Design for memory and synchronization issues4 Parallelism in lower part of networked I/O stack4. Parallelism in lower part of networked I/O stack

(1) Performance and Scaling Analysis


(1) Performance and Scaling Analysis• Bottom-upGuestOS

U S Applications Bottom up• Controller

• Actual controller

UserSpace

SystemCalls

Applications

Middleware

• PCI• Host driversBl k l

SystemCalls

VirtualDrivers

GuestOSKernel VFS+FS

• Block layer• SCSI• Block

HostOSVFS+FS

BlockDevices• Block

• Filesystem• xfs (a well accepted fs)

SCSILayers,HWdevicedrivers,PCIdriver

PCIExpressInterconnect ( p )• vfs (integral linux part)

StorageController,DiskController

NetworkController

I/O Controller [Systor’10]


I/O Controller [Systor 10]• (1) A queue protocol over PCI( ) q p

• Many parameters and quite complex• Requires decisions: Tune for high throughput

(2) R t t l ti t ll• (2) Request translation on controller• Memory management: Balance between speed and waste

• (3) Request issue completion towards devices(3) q p• Use existing mechanisms but do careful scheduling

• Prototype comparable to commercial products

Results and Outlook


Results and Outlook140016001800

MB/sec

DMA Throughput

host-to-HBA HBA-to-host

head: valid queue elementNew-head

80010001200

4 8 16 32 64

MB/sec

transfer size (KB)Impact of host-issued PIO on DMA Throughput

tailHost

head DMAPCIe interconnect

: valid element to dequeue

OFF

ON

host

-issu

ed P

IO?

2-way to-host from-host

Controller

head

Controller initiates DMA

tail

• xput: Each controller can achieve 2 Gbytes/s bi-dir• IOPs: Each controller can achieve ~80K IOPs

f f /O

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

MB/sec

-Needs to know tail at Host-Host needs to know head at Controller

New-tail

• 50K for commercial controllers with full I/O processing• Controller CPU is an important limitation• Outlook

• (1) Scale throughput and IOPs by using multiple controllers(2) O tlook I/O controllers sho ld be f sed ith host CPU• (2) Outlook: I/O controllers should be fused with host CPU

Block Layer


Block Layer• I/O request protocol• I/O request protocol translation, e.g. SCSI

• Buffer managementBuffer management and placement• Other layers involved, y ,

essentially a block-type operationModern architecture• Modern architecture trends create significant problems

Results and Outlook


Results and Outlook• Translation processing scales

with number of cores 5000

6000MB/s Sequential I/O Throughput

4

rs

Random I/O Operations

with number of cores• Both throughput and IOPs• I/O translation incurs overhead

0

1000

2000

3000

4000

5000

seq.readsseq.writes

1

2

3

# co

ntro

ller

write IOPS

read IOPS

• Affinity an important problem• Wrong placement can reduce

throughput almost to half 6

1 2 3 4# ControllersIOPS

g p

4

TLOR0TROR0

1 2 3 4 5 6 7 8

0

2 TROR0TLORPRILTLORPLIL

1 7

No. of Instance of Benchmark

Filesystem


Filesystem• Complex layerp y

• Many complain about FS performance on multicores• Translates from (request, file, offset, size) API to (request, block#)

APIAPI• Responsible for recovery (first layer to include extensive metadata

in traditional systems)We include VFS in our analysis additional complexity• We include VFS in our analysis – additional complexity

• Detailed analysis with extensive modifications to kernel• Required non-trivial instrumentation to measure lock and wait times• Extensive tuning to ensure that we measure “meaningful” cases

Results and Outlook


Results and Outlook100

kfsmark‐ CPUbreakdown(1MBfiles,64app.threads)

IO‐WAIT USER SYSTEM INTERRUPT IDLE100

120

sand

s 1LOG PER PROCESS200

250

sand

s 1LOG PER PROCESS

020406080

%CPU

20

40

60

80

ops/

sec

Thou

s

CREAT

50

100

150

ops/

sec

Thou

s

#CPUs

READ

• Most FS operations do not scale with # of cores• Two main scaling problems• (2) FS journaling

1 2 4 8 1601 2 4 6 8 12 16

o

#CPUs0

1 2 4 6 8 12 16

( ) j g• All modern FSs need to worry about recovery• Most use a journaling scheme that is integrated with lookup/update path• Synchronization over this journal is hindering scaling

• (1) vfs lockingf t t f i t i i di t t d i d i f ti (d t d i d• vfs uses a structure for maintaining directory entry and inode information (dentry and inode

caches)• Synchronization over the dentry cache is problematic due to vfs design

• Outlook• There is significant potential from both (1) and (2)g p ( ) ( )• (1) is being discussed and (a) people are working on it, (b) there is potential to bypass• (2) is more fundamental – our goal is to target this

Summary of Analysis


• (1) Fundamentally, I/O performance should scale

Summary of Analysis( ) y, p

• (2) Controller: use spatial parallelism and go with technology trends

• (3) Block: worry about placement and affinity problems• (4) FS: worry about synchronization at specific points

Both (3) and (4) are due to current trends in multicores• Both (3) and (4) are due to current trends in multicores• Not broadly known problems yet

(2) Hybrid Device Hierarchies


(2) Hybrid Device Hierarchies• To take advantage of this potentialg p

• Need hybrid device hierarchies using disks and SSDs• Otherwise, not be adequate raw performance• [FlashCache’06, BPLRU’08, …] HDD(WD5001AALS‐00L3B2) SSD(IntelX25‐E)

Price/capacity ($/GB) $0 3 $3

• Designed and evaluated such a base hierarchy

Price/capacity ($/GB) $0.3 $3Responsetime(ms) 12.6 0.17Throughput(R/W)(MB/s) 100/90 277/202IOPS(R/W) 150/150 30,000/3,500

• Designed and evaluated such a base hierarchy

• Significant improvementOver disks only• Over disks only

• Over disks + SSDs due to our policies

Summary [Eurosys’10 NAS’11]

EuroSys 2010 - Compressed SSD I/O Caching 19

Summary [Eurosys 10, NAS 11]• Transparent SSD caching promising for improving performance• Improve SSD caching efficiency using online compression

• Trade (cheap) CPU cycles for (expensive) I/O performance• Address challenges in online block-level compression for SSDsAddress challenges in online block level compression for SSDs

• Our techniques mitigate CPU and additional I/O overheads• Results in increased performance with realistic workloads

• TPC-H up to 99% PostMark up to 20% SPECsfs2008 up to 11%TPC H up to 99%, PostMark up to 20%, SPECsfs2008 up to 11%• Cache hit ratio improves between 22%-145%• Increased CPU utilization by up to 4.5x• Low concurrency, small I/O workloads problematicLow concurrency, small I/O workloads problematic

• Overall our approach worthwhile, but adds complexity…• Future work

• Power performance implications interesting hardware off loading• Power-performance implications interesting, hardware off-loading• Improving compression efficiency by grouping similar blocks

(3) Buffer Mgmt and Recovery Issues


(3) Buffer Mgmt and Recovery Issues• Revisit

• Buffer mgmt in DRAM required to stage/cache I/Os• Recovery required due to volatility of DRAM

Both fundamental and related to system I/O architecture• Both fundamental and related to system I/O architecture

• We design a new DRAM buffer+cache mechanism• (1) Allow isolation and partitioning• (2) Allow control over placement• (3) Deal with both fixed and variable size items• Similar techniques recently used for other structures in kernel• Similar techniques recently used for other structures in kernel

[OSDI’10]

• Use it with a kernel-level FS that is stateless

(4) Networked I/O Stack


(4) Networked I/O Stack• Host overhead for network processing significantp g g

• We would like to push limits for networked I/O• Related: TCP/IP overhead at 10 GigE, xATA over Ethernet

U ti l ll li i th t k• Use spatial parallelism in the network• Multiple 10GBit/s controllers• Total 80GBit/s bi-dir over Ethernet• Treat as a transparent link between target and initiator

• Storage protocols not arbitraryRequest/response• Request/response

• Fixed size buffers

• How well can we do

Results and Outlook


Results and Outlook• Base net protocol design and

implementationp• Preliminary numbers (latest)

• Over 4.5 GBytes/s• Writes, 4x10GigE NICs• Read is about 2GBytes/s• Read is about 2GBytes/s• Over 160K IOPs

• Insight: Using a traditional-generic commprotocol induces overheads

• Able to design comm protocol that benefits from storage-specific semantics

• Target vs. initiator• I/O semantics not simpleI/O semantics not simple• Buffer management happens high-up in stack• Initiator less important (?)• Results very encouraging

IOLanes


IOLanes• Overall, data intensive

applications are increasing

GuestOSUserSpace

TPC WS

TPC‐WSPECjAppServ

RUBiSReplication

gLinearRoadStreaming

Tariff AdvisorTariff Advisor

New

TPC‐H,TPC‐Cpp g• Distributed, data-center type

applications• I/O subsystem important

building block

SystemCallsSystemCallsMwarepMware

TariffAdvisorTariffAdvisor

VFS+FSGuestOSKernel

g QPostgreSQL

building block• Main challenges

• (1) Performance and scalability• (2) Extensibility and effort SystemCalls

BlockDevices,VirtioSplit‐X

QEMU

(2) Extensibility and effort• Today

• Few disks per cpu/core (e.g. two)

Syste Ca s

HostOS

VFS+FSOn/off‐load module

KVMSplit‐X

• Any new feature or adaptation in stack remarkably complex

• IOLanes• (1) Identify bottlenecks St /N tSt /N t

SCSILayers,HWdevicedrivers,PCIdriver

BlockDevicesOn/off loadmodule

• (1) Identify bottlenecks• (2) Build better stack• (3) Allow for easier extensibility

Co t o e sStorage/NetControllersCo t o e s

Storage/NetControllers

Specific Challenges


Specific Challenges• Scaling the I/O stack across all system layers on multicore g y y

CPUs• Interaction of the I/O paths of multiple isolated virtual

himachines• Use cycles offered by multicores to offer more

“machinery” and optimize onliney p• Evaluation with realistic workloads• Full stack monitoring and analysis

Outline




St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Parallel I/O (SCALUS)• Abstractions for modern applications (CumuloNimbo)

• Remarks

Dimension Infrastructure Properly17‐June‐2011 TERENA TF on Storage 26

Multicores +

PCs/bladesDifferent flavors of PCs/bladesPCs/bladesDifferent flavors of PCs/bladesDifferent flavors of PCs/blades

Multicores + memory + IO xput

High‐speed InterconnectInterconnect

High‐speed Interconnect

100s file serversservers

1000s of applservers

10‐40 Gbits/s• Dimensioning issues not straight forward today- I/O application overheads not understood

Disk controllers~2GB/sSATA disks, 12‐36 disks/node

10‐100 Gbits/s

- Do you balance thin or fat?- Other factors besides performance, power

SATA disks, 36 disks/node100 MBy/s, ~2TBytes+10% SSD cache

Scaling Beyond Single Node Requires

17‐June‐2011 TERENA TF on Storage 27

Scaling Beyond Single Node Requires• Namespace managementp g• Distributed recovery, mostly for metadata• Distributed DRAM caching, at the client side• Understanding scaling overheads (efficiency)

NamespaceManagement


Namespace Management• Need to go from (filename, offset) to (node, device, object g ( , ) ( , , jblock)

• This requires translation metadata• Metadata cannot be co‐located with file/object data, if we need to scale single file performance

• This requires distributed lookup• This requires distributed lookup• Also, updates can be complicated• Would be interesting to separate from rest of data storage g p g

Distributed Recovery


Distributed Recovery• Single node recovery not enough when data is spread out

• Some layer will need to do it• Part of the storage system or the application middleware

• It probably means that storage nodes and application nodes will d t b t tineed to be separate tiers

• Fewer storage nodes and more applications nodes• Recovery protocol will only involve (hopefully) storage nodesS f f t ti l API t t i ht• Some form of transactional API to storage seems right• Not simply read/write any more• Versioning vs. logging approachesWill i l t t l f ll d i l d i• Will involve some agreement protocol for all nodes involved in an operation due to striping, replication, metadata/data, etc.

• New mechanism for the common path• Much more complicated to traditional systems• Much more complicated to traditional systems• Either centralized controllers or centralized metadata servers

Distributed DRAMCaching


Distributed DRAM Caching• Traditionally, a cache exists as close to the application node as possible• In the file client

• This is problematicp• For recovery• For scaling to many application nodes

• Two possibilitiesTwo possibilities• (1) Do client‐side caching but avoid write back• (2) Do not do client side caching and use single object owner approach at a next (storage) tierapproach at a next (storage) tier

• Both seem good approaches• (1) relies on “smarter” I/O path(2) li “ /f ” k b li i /fil• (2) relies on “smarter/faster” networks between application/file client and storage node

Efficiency: Ultimately it is all about power


Efficiency: Ultimately it is all about power• Today, people do not pay much attention to the cost of scaling

• Goal is to scale performance• Experimental setups with 1‐2 disks per node and many nodes for scaling I/O are commonh ff ( d k d )• This is very poor efficiency (CPU to disk ratio, consider power)

• How much are you willing to pay for scaling?• Start from a base, optimized I/O stack like the one I have described• If we can scale and each I/O subsystem operates at best rate we are fine• Essentially, the cost of scaling should not be too high (or ideally visible) going from one to many nodesThis is not true today by far• This is not true today, by far…

• Ultimately power will force everyone to look into this• Or, only a few applications will be able to pay for itA l SAN t d k b t th t• Analogy: SANs today work but they cost

“Machinery” for distribution


“Machinery” for distribution• All previous mechanisms, require “machinery” that is expensive

W d t ith di t ib t d I/O h th t d ll i• We need to come up with distributed I/O approaches that do all processing more efficiently

• We have or can assume a lot of concurrency so there is always work• This is more about being asynchronous all the time and using DRAM as a g y gbuffer to not starve any other resource

• Design systems that wait only when I/O xput is exhausted• No application should be I/O bound!

ith hi h th h t d i d t i t t i d d• …with high throughput devices and system interconnects in modern and future systems

• Efficiency will matter at some point• Even for apps that are able to scale and achieve their perf goals• Even for apps that are able to scale and achieve their perf goals

• We need to understand• Mechanisms required for scaling and their overheadsWho should do what in the distributed I/O path• Who should do what in the distributed I/O path

• Different appl domains will resolve tradeoffs in overheads, semantics

Where Should Each OpGo in the I/O Path?


Where Should Each Op Go in the I/O Path?• (1) Everything in the file‐system (most prevalent today)

H t b id d b fil t• Has to be provided by every filesystem• The world will have many filesystems• Some problems, e.g. consistent client caching, inherently difficult (not scalable)• Try using GPFS (not to mention extending it…)(2) Wh t b l t t diti l SAN/NAS ?

File Servers

• (2) Why not be closer to traditional SAN/NAS ?• Let’s do reliability and availability as SAN• File operations and scaling as NAS• Requires distributed block‐level consistency and atomicity

h i f l l (k l fi )

NAS (NFS/CIFS)

FS Layer

Block I/O stack

NAS (NFS/CIFS)

FS Layer

Block I/O stack

• … at the infrastructure level (kernel, firmware, …)• Not clear this is the way to go…

• (3) Other alternatives? Who knows…Network

Block level Block‐levelBlock‐level stack

Block level stack

Storage Nodes

33 I/O Path Design & Implementation

Storage Nodes

Outline




St t h l i• Storage technologies• Storage virtualization and scaling with multicores (IOLanes)• Parallel I/O (SCALUS)• Abstractions for modern applications (CumuloNimbo)

• Summary




JEE Application Server: JBoss+Hibernate

Object Cache: CumuloCache Transactions

SelfProvisioner

Query Engine: DerbyConcurrencyControllers

Distributed File System: HDFS

Column-Oriented Data Store & Block Cache: HBASE

CommitSequencers

Monitors

y

Storage

Commu-nication

LoggersLoad

Balancers

TransactionManagement

ElasticityManagement

State of the Art


State of the Art• Key-value data stores gaining significancey g g g

• Supporting arbitrary variable-size keys and values

• Distributed key-value stores used increasingly• HBase is a component of the CumuloNimbo architecture• Also, other s/w stacks are built on top of key-value stores

• To access persistent storage such systems are built todayTo access persistent storage such systems are built today on top of traditional file systems• However, semantics of the underlying system differ in fundamental

waysways

Key value Store vs FS Mismatch


Key-value Store vs. FS Mismatch• Hard to map mutable variable size keys/values to filesp y

• Key-based indexing vs. offset based indexing in the presence of variable size values

• Data placement on local/networked storage devices p gcannot take advantage of semantics of key/value stores• Information that has been provided by the application is thrown

away during mapping to flat filesy g pp g• Local file systems offer limited recovery/availability

guarantees• Last write recovery expensive no data consistency guaranteesLast write recovery expensive, no data consistency guarantees

• Significant performance overheads and scalability limitations

When scaling to large amounts of storage and high rates• When scaling to large amounts of storage and high rates

Our Goal


Our Goal• Raise the abstraction of traditional locally managed persistent

t i ti k l APIstorage using a native key-value API• Support mutable variable-length items– important for workloads

that incur frequent updates• Perform all operations required (packing cleanup) for dealing withPerform all operations required (packing, cleanup) for dealing with

variable size items over fixed block-size persistent devices• Optimize device use based on the importance of data items• Ensure consistency of the data store after a failure based on

fi bl kl d i tconfigurable workload requirements• Use tunable data replication for availability purposes

• Separate distributed aspects from efficiency at local levelSynergies can be important for performance e g recovery• Synergies can be important for performance, e.g. recovery mechanism

Storage Layer Architecture


Storage Layer Architecture

Outline





• Remarks

The role of persistent I/O


The role of persistent I/O• Required to keep user dataq p

• Data generated and used at different times (and over long periods)

• Tolerate failures• Persistence of control information (metadata)

• Both emerge as problems

Data


Data• Many applications today in data centers require large y pp y q g

amounts of data • “Waste” in todays architectures

G tti d t f i t t d i t• Getting data from persistent devices to memory• Requires complex namespace operations which lead to significant

resource utilization• Contrast this to memory accesses that are simpler in nature

• Systems have been built to tolerate high response times• Results in more work per I/OResults in more work per I/O

• Virtualization introduces significant overheads for I/O• But important for isolation among workloads and environment

Metadata


Metadata• Examples

• In a filesystem inodes and dentries• In a tuplestore hash-tables and b-trees for indexing• At block-level (e.g. FTL) logical to physical (re)mapping tables

Equally important to data• Equally important to data• In some cases even more so• Many systems can afford to be sloppy about data but not metadata

• FootprintFootprint• Metadata needs to be kept in memory for performance purposes• Sophisticated (and application-specific) caching techniques• Otherwise dramatically increase the number of I/Os per user I/O

• Persistence• Remaining consistent at failures of paramount importance• But DRAM not persistent => complex write management techniques

M iddl d li i l d h dl• Many system, middleware, and application layers need to handle metadata, resulting in multiple times these in-efficiencies

Today


Today• Persistence is “heavy” due to device/controller technologyy gy• Persistence not designed with multicores in mind• Persistence inefficient when scaling across nodes

• Persistence incurs overheads in multiple layers

What can we do?


What can we do?• Persistent I/O should “get closer” to the CPUg

• Namespace issues should be simpler• Transfers between persistent and non-persistent stages of memory

should be more efficientshould be more efficient• Role of access granularity

• Architectures should better support persistence for metadata• Treating data and metadata the same is a very inefficient

simplification

• Understand overheads and scaling characteristics on modern systems

How many cycles of processing per I/O does a data centric• How many cycles of processing per I/O does a data-centric application need?

Summary


Summary• (1) Memory hierarchy work to bring persistence closer to CPU( ) y y g p

• Profound changes – impact all layers• Achieving efficiency with device technology

• (2) I/O path evolution to scale with # cores• (2) I/O path evolution to scale with # cores • Current systems not designed with this in mind• As cores increase, base I/O performance does not scale

Vi t li ti h d / t ti b t• Virtualization overheads/contention exacerbates• Energy proportionality

• (3) Persistent I/O needs to scale efficiently with # nodes• Extensive additional “machinery” today at system and middleware

level to achieve scaling => incurs high overhead and impacts efficiency

• E.g. Heartbeats and replication not compatible with energy efficiency

Acknowledgements


Acknowledgements• People

• Funding agencies• EC

• Shoaib Akram• Konstantinos Chassapis• Michail Flouris

• SIVSS, SCALUS, IOLANES• CumuloNimbo, STREAM, HiPEAC

• GSRT: National research office• Markos Foundoulakis• Dhiraj Gulati• Yiannis KlonatosYiannis Klonatos• Kostas Magoutis• Thanos Makatos• Manolis MarazakisManolis Marazakis• Stelios Mavridis• Zoe Sebepou

• Many partners and colleagues• Many partners and colleagues

Documents

Angelos Bilas, FORTH [email protected] - TERENA€¦ · • Most FS operations do not scale with # of cores • Two main scaling problems • (2) FS journaling 124816 0 12468 1216 #CPUs