Exploiting HPC Technologies for Accelerating Big Data … · 2020. 1. 15. · Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Dhabaleswar

Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Keynote Talk at Swiss Conference (April ‘18)

by


HPCAC-Switzerland (April ’18) 2Network Based Computing Laboratory

• Big Data has changed the way people understand

and harness the power of data, both in the

business and research domains

• Big Data has become one of the most important

elements in business analytics

• Big Data and High Performance Computing (HPC)

are converging to meet large scale data processing

challenges

• Running High Performance Data Analysis (HPDA) workloads in the cloud is gaining popularity

• According to the latest OpenStack survey, 27% of cloud deployments are running HPDA workloads

Introduction to Big Data Analytics and Trends

http://www.coolinfographics.com/blog/tag/data?currentPage=3

http://www.climatecentral.org/news/white-house-brings-together-big-data-and-climate-change-17194


• From 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40,000

exabytes.

• By 2020, a third of the data in the digital universe (more than 13,000 exabytes) will have Big

Data Value, but only if it is tagged and analyzed.

Big Volume of Data by the End of this Decade

Courtesy: John Gantz and David Reinsel, "The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East”, IDC's Digital Universe Study, sponsored by EMC, December 2012.


Big Velocity – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 7.5% from 2016 and now represents

3.7 Billion People.Courtesy: https://www.domo.com/blog/data-never-sleeps-5/


• Scientific Data Management, Analysis, and Visualization

• Applications examples

– Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

Not Only in Internet Services - Big Data in Scientific Domains


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

Big Data Management and Processing on Modern Clusters


(1) Prepare Datasets @Scale

(2) Deep Learning @Scale

(3) Non-deep learning

analytics @Scale

(4) Apply ML model @Scale

• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms

• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big

data stacks, such as Apache Hadoop and Spark

• Benefits of the DLoBD approach

– Easily build a powerful data analytics pipeline

• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof

– Better data locality

– Efficient resource sharing and cost effective

Deep Learning over Big Data (DLoBD)


• CaffeOnSpark

• SparkNet

• TensorFlowOnSpark

• TensorFrame

• DeepLearning4J

• BigDL

• mmlspark

– CNTKOnSpark

• Many others…

Examples of DLoBD Stacks


Drivers of Modern HPC Cluster and Data Center Architecture

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

– Single Root I/O Virtualization (SR-IOV)

• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters

• Accelerators (NVIDIA GPGPUs and FPGAs)

High Performance Interconnects –InfiniBand (with SR-IOV)

<1usec latency, 200Gbps Bandwidth>Multi-/Many-core

Processors

Cloud CloudSDSC Comet TACC Stampede

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

SSD, NVMe-SSD, NVRAM


Interconnects and Protocols in OpenFabrics Stack for HPC(http://openfabrics.org)

Kernel

Space

Application / Middleware

Verbs

Ethernet

Adapter

EthernetSwitch

Ethernet Driver

TCP/IP

1/10/40/100 GigE

InfiniBand

Adapter

InfiniBand

Switch

IPoIB

IPoIB

Ethernet

Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBand

Adapter

InfiniBand

Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand

Switch

InfiniBand

Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand

Adapter

InfiniBand

Switch

RDMA

SDP

SDP


How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major

bottlenecks in current Big

Data processing

middleware (e.g. Hadoop,

Spark, and Memcached)?

Can the bottlenecks be alleviated with new

designs by taking advantage of HPC

technologies?

Can RDMA-enabled

high-performance

interconnects

benefit Big Data

processing?

Can HPC Clusters with

high-performance

storage systems (e.g.

SSD, parallel file

systems) benefit Big

Data applications?

How much

performance benefits

can be achieved

through enhanced

designs?

How to design

benchmarks for

evaluating the

performance of Big

Data middleware on

HPC clusters?


Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?





Spark Job

Hadoop Job Deep LearningJob


Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies

(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-

SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Benchmarks

Upper level

Changes?


• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 280 organizations from 34 countries

• More than 25,750 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE

Also run on Ethernet

Available for x86 and OpenPOWER

Upcoming Release will have support

For Singularity and Docker

http://hibd.cse.ohio-state.edu/


HiBD Release Timeline and Downloads

0

5000

10000

15000

20000

25000

30000

Nu

mb

er

of

Do

wn

load

s

Timeline

RD

MA

-Had

oo

p 2

.x 0

.9.1

RD

MA

-Had

oo

p 2

.x 0

.9.6

RD

MA

-Me

mca

ched

0.9

.4

RD

MA

-Had

oo

p 2

.x 0

.9.7

RD

MA

-Sp

ark

0.9

.1

RD

MA

-HB

ase

0.9

.1

RD

MA

-Had

oo

p 2

.x 1

.0.0

RD

MA

-Sp

ark

0.9

.4

RD

MA

-Had

oo

p 2

.x 1

.3.0

RD

MA

-Me

mca

ched

0.9

.6&

OH

B-0

.9.3

RD

MA

-Sp

ark

0.9

.5

RD

MA

-Had

oo

p 1

.x 0

.9.0

RD

MA

-Had

oo

p 1

.x 0

.9.8

RD

MA

-Me

mca

ched

0.9

.1 &

OH

B-0

.7.1

RD

MA

-Had

oo

p 1

.x 0

.9.9


• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and

RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP

– Support for OpenPOWER

• Current release: 1.3.0

– Based on Apache Hadoop 2.8.0

– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms (x86, POWER)

• Different file systems with disks and SSDs and Lustre

RDMA for Apache Hadoop 2.x Distribution

http://hibd.cse.ohio-state.edu

http://hadoop-rdma.cse.ohio-state.edu/


• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well

as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-

memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst

buffer design is hosted by Memcached servers, each of which has a local SSD.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top

of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-

L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x


• High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Off-JVM-heap buffer management

– Support for OpenPOWER

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)


– Based on Apache Spark 2.1.0

– Tested with



• Various multi-core platforms (x86, POWER)

• RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution



• High-Performance Design of HBase over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level

for HBase

– Compliant with Apache HBase 1.1.2 APIs and applications

– On-demand connection setup

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)


– Based on Apache HBase 1.1.2

– Tested with



• Various multi-core platforms


RDMA for Apache HBase Distribution



• High-Performance Design of Memcached over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and

libMemcached components

– High performance design of SSD-Assisted Hybrid Memory

– Non-Blocking Libmemcached Set/Get API extensions

– Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with

IPoIB)


– Based on Memcached 1.5.3 and libMemcached 1.0.18

– Compliant with libMemcached APIs and applications

– Tested with



• Various multi-core platforms

• SSD


RDMA for Memcached Distribution



• Micro-benchmarks for Hadoop Distributed File System (HDFS)

– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency

(RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT)

Benchmark

– Support benchmarking of

• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH)

HDFS

• Micro-benchmarks for Memcached

– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid

Memory Latency Benchmark

– Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached

• Micro-benchmarks for HBase

– Get Latency Benchmark, Put Latency Benchmark

• Micro-benchmarks for Spark

– GroupBy, SortBy


• http://hibd.cse.ohio-state.edu

OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, HBase, and Spark

http://hibd.cse.ohio-state.edu/


Using HiBD Packages on Existing HPC Infrastructure


Using HiBD Packages on Existing HPC Infrastructure


• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and

available on SDSC Comet.

– Examples for various modes of usage are available in:

• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP

• RDMA for Apache Spark: /share/apps/examples/SPARK/

– Please email [email protected] (reference Comet as the machine, and SDSC as the

site) if you have any further questions about usage and configuration.

• RDMA for Apache Hadoop is also available on Chameleon Cloud as an

appliance

– https://www.chameleoncloud.org/appliances/17/

HiBD Packages on SDSC Comet and Chameleon Cloud

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC

Comet, XSEDE’16, July 2016

https://www.chameleoncloud.org/appliances/17/


• Basic Designs for HiBD Packages

– HDFS and MapReduce

– Spark

– Memcached

– Kafka

• Deep Learning

– TensorFlow with gRPC

– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)

• Cloud Support for Big Data Stacks

– Virtualization with SR-IOV and Containers

– Swift Object Store

Acceleration Case Studies and Performance Evaluation


• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based HDFS with communication library written in native code

Design Overview of HDFS with RDMA

HDFS

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

WriteOthers

OSU Design

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– Parallel replication support


– InfiniBand/RoCE support

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS

over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014


Triple-H

Heterogeneous Storage

• Design Features

– Three modes

• Default (HHH)

• In-Memory (HHH-M)

• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the heterogeneous

storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on data usage

pattern

– Hybrid Replication

– Lustre-Integrated mode:

• Lustre-based fault-tolerance

Enhanced HDFS with In-Memory and Heterogeneous Storage

Hybrid Replication

Data Placement Policies

Eviction/Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters

with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications


Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks

(IB, iWARP, RoCE ..)

OSU Design

Applications



Job

Tracker

Task

Tracker

Map

Reduce


• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features

– RDMA-based shuffle

– Prefetching and caching map output

– Efficient Shuffle Algorithms

– In-memory merge

– On-demand Shuffle Adjustment

– Advanced overlapping

• map, shuffle, and merge

• shuffle, merge, and reduce



M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in

MapReduce over High Performance Interconnects, ICS, June 2014


0

50

100

150

200

250

300

350

400

80 120 160

Exe

cuti

on

Tim

e (

s)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

0

100

200

300

400

500

600

700

800

80 160 240

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

• RandomWriter

– 3x improvement over IPoIB

for 80-160 GB file size

• TeraGen

– 4x improvement over IPoIB for

80-240 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x


0

100

200

300

400

500

600

700

800

80 120 160

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (EDR)

OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps and 32 reduces

• Sort

– 61% improvement over IPoIB for

80-160 GB data

• TeraSort

– 18% improvement over IPoIB for

80-240 GB data

Reduced by 61%

Reduced by 18%

Cluster with 8 Nodes with a total of 64 maps and 14 reduces

Sort TeraSort

0

100

200

300

400

500

600

80 160 240

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (EDR)

OSU-IB (EDR)


Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)

• For 200GB TeraGen on 32 nodes

– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)

– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)

0

20

40

60

80

100

120

140

160

180

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size : Data Size (GB)

IPoIB (QDR) Tachyon OSU-IB (QDR)

0

100

200

300

400

500

600

700

8:50 16:100 32:200

Exe

cuti

on

Tim

e (

s)

Cluster Size : Data Size (GB)

Reduced by 2.4x

Reduced by 25.2%

TeraGen TeraSort

N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File

Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015




– Spark

– Memcached

– Kafka

• Deep Learning








• Design Features

– RDMA based shuffle plugin

– SEDA-based architecture

– Dynamic connection management and sharing

– Non-blocking data transfer

– Off-JVM-heap buffer management


Design Overview of Spark with RDMA


• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High

Performance Interconnects (HotI'14), August 2014

X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.

Spark Core

RDMA Capable Networks(IB, iWARP, RoCE ..)

Apache Spark Benchmarks/Applications/Libraries/Frameworks



Native RDMA-based Comm. Engine

Shuffle Manager (Sort, Hash, Tungsten-Sort)

Block Transfer Service (Netty, NIO, RDMA-Plugin)

Netty

Server

NIO

ServerRDMA

Server

Netty

Client

NIO

ClientRDMA

Client


• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)

• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.

– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)

– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – SortBy/GroupBy

64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time

0

50

100

150

200

250

300

64 128 256

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

50

100

150

200

250

64 128 256

Tim

e (

sec)

Data Size (GB)

IPoIB

RDMA

74%80%


• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)

• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.

– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)

– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

0

50

100

150

200

250

300

350

400

450

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData Gigantic

Tim

e (

sec)

Data Size (GB)

IPoIB

RDMA

43%37%


Application Evaluation on SDSC Comet

• Kira Toolkit: Distributed astronomy image processing

toolkit implemented using Apache Spark

– https://github.com/BIDS/Kira

• Source extractor application, using a 65GB dataset from

the SDSS DR2 survey that comprises 11,150 image files.

0

20

40

60

80

100

120

RDMA Spark Apache Spark(IPoIB)

21 %

Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC

Comet, XSEDE’16, July 2016

0

200

400

600

800

1000

24 48 96 192 384

On

e E

po

ch T

ime

(se

c)

Number of cores

IPoIB RDMA

• BigDL: Distributed Deep Learning Tool using Apache

Spark

– https://github.com/intel-analytics/BigDL

• VGG training model on the CIFAR-10 dataset

4.58x


• Challenges

– Performance characteristics of RDMA-based Hadoop and Spark over OpenPOWER systems with IB networks

– Can RDMA-based Big Data middleware demonstrate good performance benefits on OpenPOWER systems?

– Any new accelerations for Big Data middleware on OpenPOWER systems to attain better benefits?

• Results

– For the Spark SortBy benchmark, RDMA-Opt outperforms IPoIB and RDMA-Def by 23.5% and 10.5%

Accelerating Hadoop and Spark with RDMA over OpenPOWER

0

20

40

60

80

100

120

140

160

180

200

32GB 64GB 96GB

Lat

ency

(se

c)

IPoIB RDMA-Def RDMA-Opt

Spark SortBy

Reduced by 23.5%

Socket 0

Ch

ip

Inte

rco

nn

ect

Me

mo

ry

Co

ntr

olle

r(s)

Connect to DDR4 Memory

PC

Ie

Inte

rfac

e

Connect to IB HCA

Socket 1

(SMT)

Core(SMT)

Core(SMT)

Core

(SMT)

Core(SMT)Core

(SMT)

Core

Ch

ip

Inte

rco

nn

ect

(SMT)

Core(SMT)

Core(SMT)

Core

(SMT)

Core(SMT)

Core(SMT)

Core

Me

mo

ry

Co

ntr

olle

r(s)

Connect to DDR4 Memory

PC

Ie

Inte

rfac

e

Connect to IB HCA

IBM POWER8 Architecture

X. Lu, H. Shi, D. Shankar and D. K. Panda, Performance Characterization and Acceleration of Big Data Workloads on OpenPOWER System, IEEE BigData 2017.




– Spark

– Memcached

– Kafka

• Deep Learning








1

10

100

1000

1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

0

200

400

600

800

16 32 64 128 256 512 102420484080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X 2X

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High

Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid

Transport, CCGrid’12


• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing

tool

• Memcached-RDMA can

- improve query latency by up to 66% over IPoIB (32Gbps)

- throughput by up to 69% over IPoIB (32Gbps)

Micro-benchmark Evaluation for OLDP workloads

012345678

64 96 128 160 320 400

Late

ncy

(se

c)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

0

1000

2000

3000

4000

64 96 128 160 320 400

Thro

ugh

pu

t (

Kq

/s)

No. of Clients

Memcached-IPoIB (32Gbps)

Memcached-RDMA (32Gbps)

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads

with Memcached and MySQL, ISPASS’15

Reduced by 66%


– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid

– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem

Performance Evaluation with Non-Blocking Memcached API

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i

H-RDMA-Opt-NonB-b

Ave

rage

Lat

en

cy (

us)

Miss Penalty (Backend DB Access Overhead)

Client Wait

Server Response

Cache Update

Cache check+Load (Memory and/or SSD read)

Slab Allocation (w/ SSD write on Out-of-Mem)

H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee


0

200

400

600

800

1000

100 1000 10000 100000Thro

ugh

pu

t (M

B/s

)

Record Size

0

500

1000

1500

2000

2500

100 1000 10000 100000

Late

ncy

(m

s)

Record Size

RDMA

IPoIB

RDMA-Kafka: High-Performance Message Broker for Streaming Workloads

Kafka Producer Benchmark (1 Broker 4 Producers)

• Experiments run on OSU-RI2 cluster

• 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD– Up to 98% improvement in latency compared to IPoIB

– Up to 7x increase in throughput over IPoIB

98%7x

H. Javed, X. Lu , and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka,

BDCAT’17




– Spark

– Memcached

– Kafka

• Deep Learning








Overview of RDMA-gRPC with TensorFlow

Worker services communicate among each other using RDMA-gRPC

Client Master

Worker

/job:PS/task:0

Worker

/job:Worker/task:0

CPU GPU

RDMA gRPC server/ client

CPU GPU

RDMA gRPC server/ client

mailto:[email protected]


Performance Benefits for RDMA-gRPC with Micro-Benchmark

RDMA-gRPC RPC Latency

• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR– Up to 2.7x performance speedup over IPoIB for Latency for small messages

– Up to 2.8x performance speedup over IPoIB for Latency for medium messages

– Up to 2.5x performance speedup over IPoIB for Latency for large messages

0

15

30

45

60

75

90

2 8 32 128 512 2K 8K

Late

ncy

(u

s)

payload (Bytes)

Default gRPC

AR-gRPC

0

200

400

600

800

1000

16K 32K 64K 128K 256K 512KLa

ten

cy (

us)

Payload (Bytes)

Default gRPC

AR-gRPC

100

3800

7500

11200

14900

18600

1M 2M 4M 8M

Late

ncy

(u

s)

Payload (Bytes)

Default gRPC

AR-gRPC

R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? Under Review


Performance Benefits for RDMA-gRPC with TensorFlow Communication Mimic Benchmark

TensorFlow communication Pattern mimic benchmark

• TensorFlow communication pattern mimic on SDSC-Comet-FDR– Single process spawns both gRPC server and gRPC client

– Up to 3.6x performance speedup over IPoIB for Latency

– Up to 3.7x throughput improvement over IPoIB

0

90

180

270

2M 4M 8M

Cal

ls/

Seco

nd

Payload (Bytes)

Default gRPC OSU RDMA gRPC

0

9000

18000

27000

2M 4M 8M

Late

ncy

(u

s)

Payload (Bytes)

Default gRPC

OSU RDMA gRPC




– Spark

– Memcached

– Kafka

• Deep Learning








• Layers of DLoBD Stacks

– Deep learning application layer

– Deep learning library layer

– Big data analytics framework layer

– Resource scheduler layer

– Distributed file system layer

– Hardware resource layer

• How much performance benefit we

can achieve for end deep learning

applications?

Overview of DLoBD Stacks

Sub-optimal

Performance

?


X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

High-Performance Deep Learning over Big Data (DLoBD) Stacks• Benefits of Deep Learning over Big Data (DLoBD)

Easily integrate deep learning components into Big Data processing workflow

Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters;

Reuse existing big data analytics clusters

• Challenges Can RDMA-based designs in DLoBD stacks improve

performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?

What are the performance characteristics of representative DLoBD stacks on RDMA networks?

• Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band

communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark)

can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy

0

20

40

60

10

1010

2010

3010

4010

5010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Acc

ura

cy (

%)

Ep

och

s T

ime

(sec

s)

Epoch Number

IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy

2.6X


Performance Evaluation for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR)

• CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communication overhead becomes significant.

• Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully optimized yet. For MNIST tests, RDMA is not showing obvious benefits.

050

100150200250300350400450

IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA

GPU GPU-cuDNN GPU GPU-cuDNN

CIFAR-10 Quick LeNet on MNIST

End

-to-E

nd T

ime

(sec

s)

CaffeOnSpark

2 4 8 16

050

100150200250300350400450

IPoIB RDMA IPoIB RDMA

CIFAR10 (2 GPUs/node) MNIST (1 GPU/node)

End-t

o-E

nd T

ime

(sec

s)

TensorFlowOnSpark2 GPUs

4 GPUs

8 GPUs

14.2%13.3%

51.2%45.6%

33.01%

4.9%




– Spark

– Memcached

– Kafka

• Deep Learning








Overview of RDMA-Hadoop-Virt Architecture• Virtualization-aware modules in all the four

main Hadoop components:

– HDFS: Virtualization-aware Block Management

to improve fault-tolerance

– YARN: Extensions to Container Allocation Policy

to reduce network traffic

– MapReduce: Extensions to Map Task Scheduling

Policy to reduce network traffic

– Hadoop Common: Topology Detection Module

for automatic topology detection

• Communications in HDFS, MapReduce, and RPC

go through RDMA-based designs over SR-IOV

enabled InfiniBand

HDFS

YARN

Ha

do

op

Co

mm

on

MapReduce

HBase Others

Virtual Machines Bare-Metal nodesContainers

Big Data ApplicationsT

opolo

gy D

ete

ction M

odule Map Task Scheduling

Policy Extension

Container Allocation

Policy Extension

CloudBurst MR-MS Polygraph Others

Virtualization Aware

Block Management

S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on

SR-IOV-enabled Clouds. CloudCom, 2016.


Evaluation with Applications

– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join

– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join

0

20

40

60

80

100

Default Mode Distributed Mode

EXEC

UTI

ON

TIM

ECloudBurst

RDMA-Hadoop RDMA-Hadoop-Virt

0

50

100

150

200

250

300

350

400

Default Mode Distributed Mode

EXEC

UTI

ON

TIM

E

Self-Join

RDMA-Hadoop RDMA-Hadoop-Virt

30% reduction

55% reduction

Will be released in future

mailto:[email protected]




– Spark

– Memcached

– Kafka

• Deep Learning








• Distributed Cloud-based Object Storage Service

• Deployed as part of OpenStack installation

• Can be deployed as standalone storage solution as well

• Worldwide data access via Internet

– HTTP-based

• Architecture

– Multiple Object Servers: To store data

– Few Proxy Servers: Act as a proxy for all requests

– Ring: Handles metadata

• Usage

– Input/output source for Big Data applications (most common use

case)

– Software/Data backup

– Storage of VM/Docker images

• Based on traditional TCP sockets communication

OpenStack Swift Overview

Send PUT or GET requestPUT/GET /v1/<account>/<container>/<object>

Proxy Server

Object Server

Object Server

Object Server

Ring

Disk 1

Disk 2

Disk 1

Disk 2

Disk 1

Disk 2

Swift Architecture


• Challenges

– Proxy server is a bottleneck for large scale deployments

– Object upload/download operations network intensive

– Can an RDMA-based approach benefit?

• Design

– Re-designed Swift architecture for improved scalability and

performance; Two proposed designs:

• Client-Oblivious Design: No changes required on the client side

• Metadata Server-based Design: Direct communication between

client and object servers; bypass proxy server

– RDMA-based communication framework for accelerating

networking performance

– High-performance I/O framework to provide maximum

overlap between communication and I/O

Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds

S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud,

accepted at CCGrid’17, May 2017

Client-Oblivious Design

(D1)

Metadata Server-based

Design (D2)


0

5

10

15

20

25

SwiftPUT

Swift-X(D1)PUT

Swift-X(D2)PUT

SwiftGET

Swift-X(D1)GET

Swift-X(D2)GET

LATE

NC

Y (S

)

TIME BREAKUP OF GET AND PUT

Communication I/O Hashsum Other

Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds

0

5

10

15

20

1MB 4MB 16MB 64MB 256MB 1GB 4GB

LATE

NC

Y (s

)

OBJECT SIZE

GET LATENCY EVALUATION

Swift Swift-X (D2) Swift-X (D1)

Reduced

by 66%

• Up to 66% reduction in GET latency• Communication time reduced by up to

3.8x for PUT and up to 2.8x for GET


• Discussed challenges in accelerating BigData and Deep Learning Stacks with HPC

technologies

• Presented basic and advanced designs to take advantage of InfiniBand/RDMA for

HDFS, MapReduce, HBase, Spark, Memcached, gRPC, and TensorFlow on HPC clusters

and clouds

• Results are promising for Big Data processing and the associated Deep Learning tools

• Many other open issues need to be solved

• Will enable Big Data and Deep Learning communities to take advantage of modern HPC

technologies to carry out their analytics in a fast and scalable manner

Concluding Remarks


One more Presentation

• Today at 11:45 am

Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems


Funding Acknowledgments

Funding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– R. Biswas (M.S.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– S. Guganani (Ph.D.)

Past Students

– A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist

– K. Hamidouche

– S. Sur

Past Post-Docs

– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– H. Javed (Ph.D.)

– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– H. Shi (Ph.D.)

– J. Zhang (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists

– X. Lu

– H. Subramoni

Past Programmers

– D. Bureddy

– J. Perkins

Current Research Specialist

– J. Smith

– M. Arnold

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc

– A. Ruhela

– K. Manian

Current Students (Undergraduate)

– N. Sarkauskas (B.S.)


The 4th International Workshop on High-Performance Big Data Computing (HPBDC)

HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing

Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018

Workshop Date: May 21st, 2018

Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment

Six Regular Research Papers and Two Short Research Papers

Panel Topic: Which Framework is the Best for High-Performance Deep Learning:

Big Data Framework or HPC Framework?

http://web.cse.ohio-state.edu/~luxi/hpbdc2018

HPBDC 2017 was held in conjunction with IPDPS’17


HPBDC 2016 was held in conjunction with IPDPS’16






[email protected]


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

Documents

Exploiting HPC Technologies for Accelerating Big Data … · 2020. 1. 15. · Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning Dhabaleswar