Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and...

Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for

Modern HPC Clusters

Dipti Shankar

Dr. Dhabaleswar K. Panda (Advisor)Dr. Xiaoyi Lu (Co-Advisor)

Department of Computer Science & EngineeringThe Ohio State University

SC 2018 Doctoral Showcase 2Network Based Computing Laboratory

Outline

• Introduction and Problem Statement

• Research Highlights and Results• Broader Impact

• Conclusion & Future Avenues

Key-Value Storage in HPC and Data Centers• General purpose distributed memory-centric storage

– Allows to aggregate spare memory from multiple nodes (e.g, Memcached)

• Accelerating Online and Offline Analytics in High-Performance Compute (HPC) environments

• Our Basis: Current High-performance and hybrid key-value stores for modern HPC clusters

v High-Performance Network Interconnects (e.g., InfiniBand) v Low end-to-end latencies with IP-over-InfiniBand

(IPoIB) and Remote Direct Memory Access (RDMA)

v ‘DRAM+SSD’ hybrid memory designs

v Extend storage capabilities beyond DRAM capabilities using high-speed SSDs

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

Networks

(Offline Analytical Workloads: Software-Assisted Burst-Buffer )

(Online Analytical Workloads: OLTP/NoSQL Query Cache)

Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (e.g., PCIe/NVMe-SSDs), NVRAM (e.g., PCM, 3DXpoint), Parallel Filesystems (e.g., Lustre)

• Accelerators (e.g., NVIDIA GPGPUs)

• Production-scale HPC Clusters: SDSC Comet, TACC Stampede, OSC Owens, etc.

High Performance

Interconnects

Multi-core Processors with

vectorization support +

Accelerators (GPUs)

Large Memory

Nodes (DRAM)SSD, NVMe-SSD, NVRAM

• Our focus: Holistic approach to designing a high-performance and resilient

key-value storage for current and emerging HPC systems

– RDMA-enabled networking

– Hybrid NVM storage-awareness: High-speed NVMe-/PCIe-SSDs and upcoming

NVRAM technologies

– Heterogeneous compute devices (e.g., SIMD capabilities of CPUs and GPUs)

• Goals: Maximize end-to-end performance to

– Optimally exploit available compute/storage/network capabilities

– Enable data-intensive Big Data applications to leverage memory-centric key-value

storage to improve their overall performance

Holistic Approach for HPC-Centric KVS

Research Framework

HighPerformance, Hybrid and Resilient Key Value Storage

Application WorkloadsOnline

(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)

Offline (E.g., BurstBuffer over PFS for Hadoop for

MapReduce/Spark)

NonBlocking RDMAaware API Extensions

Fast Online Erasure Coding (Reliability)

NVRAMaware CommunicationProtocols (Persistence)

Accelerations on SIMDbased(e.g., GPU) Architecture

(Scalability)

HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine

Modern HPC System Architecture

Volatile/NonVolatileMemory Technologies (DRAM, NVRAM)

MultiCoreNodes

(w/ SIMD units)GPUs

RDMACapableNetworks (InfiniBand,

40Gbe RoCE)

Parallel FileSystem (Lustre)

Local Storage(PCIe SSD/ NVMeSSD)

High-Performance Non-Blocking API Semantics

v Heterogeneous Storage-Aware Key-Value Stores (e.g., ‘DRAM + PCIe/NVMe-SSD’)

• Higher data retention at the cost of SSD I/O; suitable for out-of-memory scenarios

• Performance limited by Blocking API semantics

v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory

v Approach: Novel Non-blocking API Semantics to extend RDMA-Libmemcached library

• memcached_(iset/iget/bset/bget) APIs for SET/GET

• memcached_(test/wait) APIs for progressing communication

D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and

SSDs: Non-blocking Extensions, Designs, and Benefits”, IPDPS 2016

BLOCKINGAPI

NON-BLOCKINGAPIREQ.

NON-BLOCKINGAPIREPLY

CLIENT

HYBRIDSLABMANAGER(RAM+SSD)

RDMA-ENHANCEDCOMMUNICATIONLIBRARY

LIBMEMCACHEDLIBRARY

BlockingAPIFlow Non-BlockingAPIFlowNonB-b

Test/Wait

NonB-i

High-Performance Non-Blocking API Semantics

• Set/Get Latency with Non-Blocking API: Up to 8x gain in overall latency vs. blocking API semantics over RDMA+SSD hybrid design

• Up to 2.5x gain in throughput observed at client; Ability to overlap request and response phases to hide SSD I/O overheads

Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-

Block H-RDMA-Opt-

NonB-i H-RDMA-Opt-

NonB-b

Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)

Overlap SSD I/O Access (near in-memory latency)

v Erasure Coding (EC): A low storage-overhead alternative to Replication

v Bottlenecks: (1) Encode/decode compute overheads (2) Communication overhead of scattering/ gathering distributed data/parity chunks

v Goal: Making Online EC viable for key-value stores

v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap

v Encode/Decode offload capabilities integrated into Memcached client (CE/CD) and server (SE/SD)

Fast Online Erasure Coding with RDMA

Ideal Scenario

MemoryEfficient

High Performance

Memory Overhead

Replication

RS (3,2) => 66% Storage Overhead vs 200% of Replication-Factor=3

Fast Online Erasure Coding with RDMA

D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)

512 4K 16K 64K 256K 1M

Value Size (Bytes)

Sync-Rep=3 Async-Rep=3

Era(3,2)-CE-CD Era(3,2)-SE-SD

Era(3,2)-SE-CD

4K 16K 32K 4K 16K 32K

YCSB-A YCSB-B

Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD

~1.5x~1.34x

• Experiments with YCSB for Online EC vs. Async. Rep:

• 150 Clients on 10 nodes on SDSC Comet Cluster (IB FDR + 24-core Intel Haswell) over 5-node RDMA-Memcached Cluster

• (1) CE-CD gains ~1.34x for Update-Heavy workloads; SE-CD on-par (2) CE-CD/SE-CD on-par for Read-Heavy workloads

Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)

• Offline Data Analytics Use-Case: Hybrid and resilient key-value store-based Burst-Buffer system Over Lustre (Boldio)

• Overcome local storage limitations on HPC nodes; performance of `data locality’

• Light-weight transparent interface to Hadoop/ Spark applications

• Accelerating I/O-intensive Big Data workloads

– Non-blocking RDMA-Libmemcached APIs to maximize overlap

– Client-based replication or Online Erasure Coding with RDMA for resilience

– Asynchronous persistence to Lustre parallel file system at RDMA-Memcached Servers

D. Shankar, X. Lu, D. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE International Conference on Big

Data 2016 (Short Paper)

Co-Designing Key-Value Store-based Burst Buffer over PFS

• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge and IB QDR) with 16-node MapReduce Cluster + 4-node Boldio Cluster

• Boldio can sustain 3x and 6.7x gains in read and write throughputs over stand-alone Lustre

60 GB 100 GB 60 GB 100 GB

Write Read

Agg.Throu

s) Lustre-Direct Boldio

Write Read

Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC

No EC overhead (Rep=3 vs. RS (3,2))

• TestDFSIO on Intel Westmere Cluster (8-core Intel Sandy Bridge and IB QDR); 8-node MapReduce Cluster + 5-node Boldio Cluster over Lustre

• Performance gains over designs like Alluxio(formerly Tachyon) in HPC environments with no local storage

Exploring Opportunities with NVRAM and RDMA

DRAM HCA

Mem. Cntrl

DRAM NVRAM

I/O Cntrllr

PCIe Bus

Mem.Cntrl

I/O Cntrllr

PCIe Bus Du

ility P

RDMA into NVRAM RDMA from NVRAMLocal Durability Point

Core L1/L2

Node1 (Initiator/Target)Core L1/L2

Core L1/L2

Node0 (Initiator/Target)• Emerging Non-volatile Memory technologies

(NVRAM)

• Potential: Byte-addressable and persistent; capable of RDMA

• Observations: RDMA writes into NVRAM needs to guarantee remote durability

• Appliance Method: Hardware-Assisted remote direct persistence

• General Purpose Server Method: Server-Assisted software-based persistence

• Opportunities: Designing high-performance RDMA-based Persistence Protocols (e.g., Persistent In-Memory KVS over NVRAM) D. Shankar, X. Lu, D. Panda, RDMP-KVS: Remote Direct Memory

Persistence Aware Key-Value Store for NVRAM Systems (Under

Submission)

• RDMA for Apache Spark (RDMA-Spark), Apache Hadoop 2.x (RDMA-Hadoop-2.x), RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

– RDMA-aware `DRAM+SSD’ hybrid Memcached server design

– Non-Blocking RDMA-based Client API designs (RDMA-Libmemcached)

– Based on Memcached 1.5.3 and Libmemcached client 1.0.18

– Available for InfiniBand and RoCE

• OSU HiBD-Benchmarks (OHB)– Memcached Set/Get Micro-benchmarks for Blocking and Non-Blocking APIs, and Hybrid

Memcached designs

– YCSB plugin for RDMA-Memcached

– Also includes HDFS, HBase, Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 290 organizations, 34 countries, 27,950 downloads

Broader Impact: The High-Performance Big Data (HiBD) Project

Conclusion & Future Avenuesv Holistic approach to designing key-value storage by exploiting the capabilities of HPC

clusters for (1) performance, (2) scalability, and, (3) data resilience/availability

v RDMA-capable Networks: (1) Proposed Non-blocking RDMA-based Libmemcached APIs

(2) Fast Online EC-based RDMA-Memcached designs

v Heterogeneous Storage-Awareness: (1) Leverage `RDMA+SSD’ hybrid designs, (2)

`RDMA+NVRAM’ Persistent Key-Value Storage

v Application Co-Design: Memory-centric data-intensive applications on HPC Clusters

v Online (e.g., SQL query cache, YCSB) and Offline Data Analytics (e.g., Boldio Burst-Buffer for

Hadoop I/O)

v Future Work: Ongoing work in this thesis direction

v Heterogeneous compute capabilities of CPU/GPU: End-to-end SIMD-aware KVS designs

v Exploring co-design of (1) Read-intensive Graph-based workloads (E.g., LinkBench, RedisGraph)

(2) Key-value storage engine for ML Parameter Server frameworks

shankar.50@osu.edu

http://www.cse.ohio-state.edu/~shankard

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and...

Documents

Biophysical principles for designing resilient networks of ... · Biophysical principles for designing resilient networks of marine protected areas to integrate fisheries, biodiversity

Designing and managing resilient recreation landscapes€¦ · Designing and Managing Resilient Recreation Landscapes 145 Landscape architects are professionals who design, plan,

Designing Resilient Products for the Internet of Things

Models for Designing Resilient Supply Chain Networks for Chemicals

Network of Steel Designing Ultra-Resilient Networks to ... · PDF fileNetwork of Steel – Designing Ultra-Resilient Networks to ... SPO-F03 Amritam Putatunda. ... Alert: Port snoop

Designing Systems for Resilient Communities

The Water - Food - Energy Nexus: Insights into resilient ... · The water-food-energy nexus – Insights into resilient development 03 Nexus policymaking is about designing resilient

Dipti Report

Innovation Working Group: Designing Resilient Native Communities

Designing Resilient Structures 2015-12-27_Draft 2.31

Designing Resilient Energy Education Programs for a ... · Journal of Sustainability Education Vol. 8, January 2015 ISSN: 2151-7452 Designing Resilient Energy Education Programs for

The Resilient Workplace Designing for Engagement · The Resilient Workplace Designing for Engagement ... A Resilient Workplace is an ecosystem of spaces ... SEELCASE SS + SOLOS SEELCASE

SWAT: Designing Resilient Hardware by Treating Software Anomalies

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Challenges in Designing Resilient Socio-technical Systems

Designing & Managing Resilient Landscapes

Designing a Secure and Resilient Internet Infrastructure Dan Massey USC/ISI

Designing for Resilient & Net Zero Buildings · 2016-10-19 · 10/19/2016 1 Designing for Resilient & Net Zero Buildings Neil ... Newyorker, Aug 2015 How Comfortable Are We? “70

Dipti Sharma

Designing Robust Tablet Formulas Resilient to Raw Material