Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and...

Preview:

Citation preview

Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for

Modern HPC Clusters

Dipti Shankar

Dr. Dhabaleswar K. Panda (Advisor)Dr. Xiaoyi Lu (Co-Advisor)

Department of Computer Science & EngineeringThe Ohio State University

SC 2018 Doctoral Showcase 2Network Based Computing Laboratory

Outline

• Introduction and Problem Statement

• Research Highlights and Results• Broader Impact

• Conclusion & Future Avenues

SC 2018 Doctoral Showcase 3Network Based Computing Laboratory

Key-Value Storage in HPC and Data Centers• General purpose distributed memory-centric storage

– Allows to aggregate spare memory from multiple nodes (e.g, Memcached)

• Accelerating Online and Offline Analytics in High-Performance Compute (HPC) environments

• Our Basis: Current High-performance and hybrid key-value stores for modern HPC clusters

v High-Performance Network Interconnects (e.g., InfiniBand) v Low end-to-end latencies with IP-over-InfiniBand

(IPoIB) and Remote Direct Memory Access (RDMA)

v ‘DRAM+SSD’ hybrid memory designs

v Extend storage capabilities beyond DRAM capabilities using high-speed SSDs

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

Networks

(Offline Analytical Workloads: Software-Assisted Burst-Buffer )

(Online Analytical Workloads: OLTP/NoSQL Query Cache)

SC 2018 Doctoral Showcase 4Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (e.g., PCIe/NVMe-SSDs), NVRAM (e.g., PCM, 3DXpoint), Parallel Filesystems (e.g., Lustre)

• Accelerators (e.g., NVIDIA GPGPUs)

• Production-scale HPC Clusters: SDSC Comet, TACC Stampede, OSC Owens, etc.

High Performance

Interconnects

Multi-core Processors with

vectorization support +

Accelerators (GPUs)

Large Memory

Nodes (DRAM)SSD, NVMe-SSD, NVRAM

SC 2018 Doctoral Showcase 5Network Based Computing Laboratory

• Our focus: Holistic approach to designing a high-performance and resilient

key-value storage for current and emerging HPC systems

– RDMA-enabled networking

– Hybrid NVM storage-awareness: High-speed NVMe-/PCIe-SSDs and upcoming

NVRAM technologies

– Heterogeneous compute devices (e.g., SIMD capabilities of CPUs and GPUs)

• Goals: Maximize end-to-end performance to

– Optimally exploit available compute/storage/network capabilities

– Enable data-intensive Big Data applications to leverage memory-centric key-value

storage to improve their overall performance

Holistic Approach for HPC-Centric KVS

SC 2018 Doctoral Showcase 6Network Based Computing Laboratory

Research Framework

High­Performance, Hybrid and Resilient Key Value Storage 

Application WorkloadsOnline 

(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)

Offline (E.g., Burst­Buffer over PFS for Hadoop for

MapReduce/Spark)

Non­Blocking RDMA­aware API Extensions

Fast Online Erasure Coding (Reliability)

NVRAM­aware CommunicationProtocols (Persistence)

Accelerations on SIMD­based(e.g., GPU) Architecture

(Scalability)

Heterogeneity­Aware (DRAM/NVRAM/NVMe/PFS) Key­Value Storage Engine 

Modern HPC System Architecture

Volatile/Non­VolatileMemory Technologies (DRAM, NVRAM)

Multi­CoreNodes 

(w/ SIMD units)GPUs

RDMA­CapableNetworks (InfiniBand,

40Gbe RoCE)

Parallel FileSystem (Lustre)

Local Storage(PCIe SSD/  NVMe­SSD)

SC 2018 Doctoral Showcase 7Network Based Computing Laboratory

High-Performance Non-Blocking API Semantics

v Heterogeneous Storage-Aware Key-Value Stores (e.g., ‘DRAM + PCIe/NVMe-SSD’)

• Higher data retention at the cost of SSD I/O; suitable for out-of-memory scenarios

• Performance limited by Blocking API semantics

v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory

v Approach: Novel Non-blocking API Semantics to extend RDMA-Libmemcached library

• memcached_(iset/iget/bset/bget) APIs for SET/GET

• memcached_(test/wait) APIs for progressing communication

D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and

SSDs: Non-blocking Extensions, Designs, and Benefits”, IPDPS 2016

BLOCKINGAPI

NON-BLOCKINGAPIREQ.

NON-BLOCKINGAPIREPLY

CLIENT

SERV

ER

HYBRIDSLABMANAGER(RAM+SSD)

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

LIBMEMCACHEDLIBRARY

BlockingAPIFlow Non-BlockingAPIFlowNonB-b

Test/Wait

NonB-i

SC 2018 Doctoral Showcase 8Network Based Computing Laboratory

High-Performance Non-Blocking API Semantics

• Set/Get Latency with Non-Blocking API: Up to 8x gain in overall latency vs. blocking API semantics over RDMA+SSD hybrid design

• Up to 2.5x gain in throughput observed at client; Ability to overlap request and response phases to hide SSD I/O overheads

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-

Block H-RDMA-Opt-

NonB-i H-RDMA-Opt-

NonB-b

Aver

age

Late

ncy

(us)

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

Overlap SSD I/O Access (near in-memory latency)

SC 2018 Doctoral Showcase 9Network Based Computing Laboratory

v Erasure Coding (EC): A low storage-overhead alternative to Replication

v Bottlenecks: (1) Encode/decode compute overheads (2) Communication overhead of scattering/ gathering distributed data/parity chunks

v Goal: Making Online EC viable for key-value stores

v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap

v Encode/Decode offload capabilities integrated into Memcached client (CE/CD) and server (SE/SD)

Fast Online Erasure Coding with RDMA

EC

Perf

orman

ce

Ideal Scenario

MemoryEfficient

High Performance

Memory Overhead

No FT

Replication

RS (3,2) => 66% Storage Overhead vs 200% of Replication-Factor=3

SC 2018 Doctoral Showcase 10Network Based Computing Laboratory

Fast Online Erasure Coding with RDMA

D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)

0

500

1000

1500

2000

2500

3000

3500

512 4K 16K 64K 256K 1M

Tota

l Set

Lat

ency

(us

)

Value Size (Bytes)

Sync-Rep=3 Async-Rep=3

Era(3,2)-CE-CD Era(3,2)-SE-SD

Era(3,2)-SE-CD

~1.6x

~2.8x

0

50

100

150

200

250

4K 16K 32K 4K 16K 32K

YCSB-A YCSB-B

Ag

g. T

hro

ug

hp

ut

(Ko

ps/

sec)

Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD

~1.5x~1.34x

• Experiments with YCSB for Online EC vs. Async. Rep:

• 150 Clients on 10 nodes on SDSC Comet Cluster (IB FDR + 24-core Intel Haswell) over 5-node RDMA-Memcached Cluster

• (1) CE-CD gains ~1.34x for Update-Heavy workloads; SE-CD on-par (2) CE-CD/SE-CD on-par for Read-Heavy workloads

Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)

SC 2018 Doctoral Showcase 11Network Based Computing Laboratory

• Offline Data Analytics Use-Case: Hybrid and resilient key-value store-based Burst-Buffer system Over Lustre (Boldio)

• Overcome local storage limitations on HPC nodes; performance of `data locality’

• Light-weight transparent interface to Hadoop/ Spark applications

• Accelerating I/O-intensive Big Data workloads

– Non-blocking RDMA-Libmemcached APIs to maximize overlap

– Client-based replication or Online Erasure Coding with RDMA for resilience

– Asynchronous persistence to Lustre parallel file system at RDMA-Memcached Servers

D. Shankar, X. Lu, D. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE International Conference on Big

Data 2016 (Short Paper)

Co-Designing Key-Value Store-based Burst Buffer over PFS

SC 2018 Doctoral Showcase 12Network Based Computing Laboratory

Co-Designing Key-Value Store-based Burst Buffer over PFS

• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge and IB QDR) with 16-node MapReduce Cluster + 4-node Boldio Cluster

• Boldio can sustain 3x and 6.7x gains in read and write throughputs over stand-alone Lustre

0

20000

40000

60000

80000

60 GB 100 GB 60 GB 100 GB

Write Read

Agg.Throu

ghpu

t(MBp

s) Lustre-Direct Boldio

6.7x

3x

0

10000

20000

30000

40000

50000

60000

Write Read

DF

SIO

Agg

. T

hro

ugh

pu

t (M

Bp

s)

Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC

No EC overhead (Rep=3 vs. RS (3,2))

• TestDFSIO on Intel Westmere Cluster (8-core Intel Sandy Bridge and IB QDR); 8-node MapReduce Cluster + 5-node Boldio Cluster over Lustre

• Performance gains over designs like Alluxio(formerly Tachyon) in HPC environments with no local storage

SC 2018 Doctoral Showcase 13Network Based Computing Laboratory

Exploring Opportunities with NVRAM and RDMA

DRAM HCA 

Mem.  Cntrl 

DRAM NVRAM 

I/O  Cntrllr 

L3 

PCIe Bus 

HCA 

Mem.Cntrl

NVRAM 

I/O Cntrllr

L3 

PCIe Bus  Du

rab

ility P

ath

RDMA into NVRAM RDMA from NVRAMLocal Durability Point

Core L1/L2 

Core L1/L2 

Core L1/L2 

Core L1/L2 

Node1 (Initiator/Target)Core L1/L2 

Core L1/L2 

Core L1/L2 

Core L1/L2 

Node0 (Initiator/Target)• Emerging Non-volatile Memory technologies

(NVRAM)

• Potential: Byte-addressable and persistent; capable of RDMA

• Observations: RDMA writes into NVRAM needs to guarantee remote durability

• Appliance Method: Hardware-Assisted remote direct persistence

• General Purpose Server Method: Server-Assisted software-based persistence

• Opportunities: Designing high-performance RDMA-based Persistence Protocols (e.g., Persistent In-Memory KVS over NVRAM) D. Shankar, X. Lu, D. Panda, RDMP-KVS: Remote Direct Memory

Persistence Aware Key-Value Store for NVRAM Systems (Under

Submission)

SC 2018 Doctoral Showcase 14Network Based Computing Laboratory

• RDMA for Apache Spark (RDMA-Spark), Apache Hadoop 2.x (RDMA-Hadoop-2.x), RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

– RDMA-aware `DRAM+SSD’ hybrid Memcached server design

– Non-Blocking RDMA-based Client API designs (RDMA-Libmemcached)

– Based on Memcached 1.5.3 and Libmemcached client 1.0.18

– Available for InfiniBand and RoCE

• OSU HiBD-Benchmarks (OHB)– Memcached Set/Get Micro-benchmarks for Blocking and Non-Blocking APIs, and Hybrid

Memcached designs

– YCSB plugin for RDMA-Memcached

– Also includes HDFS, HBase, Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 290 organizations, 34 countries, 27,950 downloads

Broader Impact: The High-Performance Big Data (HiBD) Project

SC 2018 Doctoral Showcase 15Network Based Computing Laboratory

Conclusion & Future Avenuesv Holistic approach to designing key-value storage by exploiting the capabilities of HPC

clusters for (1) performance, (2) scalability, and, (3) data resilience/availability

v RDMA-capable Networks: (1) Proposed Non-blocking RDMA-based Libmemcached APIs

(2) Fast Online EC-based RDMA-Memcached designs

v Heterogeneous Storage-Awareness: (1) Leverage `RDMA+SSD’ hybrid designs, (2)

`RDMA+NVRAM’ Persistent Key-Value Storage

v Application Co-Design: Memory-centric data-intensive applications on HPC Clusters

v Online (e.g., SQL query cache, YCSB) and Offline Data Analytics (e.g., Boldio Burst-Buffer for

Hadoop I/O)

v Future Work: Ongoing work in this thesis direction

v Heterogeneous compute capabilities of CPU/GPU: End-to-end SIMD-aware KVS designs

v Exploring co-design of (1) Read-intensive Graph-based workloads (E.g., LinkBench, RedisGraph)

(2) Key-value storage engine for ML Parameter Server frameworks

SC 2018 Doctoral Showcase 16Network Based Computing Laboratory

shankar.50@osu.edu

http://www.cse.ohio-state.edu/~shankard

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Recommended