Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Exploiting HPC Technologies for Accelerating Big Data Processing and Associated Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Keynote Talk at Swiss Conference (April ‘18)
by
HPCAC-Switzerland (April ’18) 2Network Based Computing Laboratory
• Big Data has changed the way people understand
and harness the power of data, both in the
business and research domains
• Big Data has become one of the most important
elements in business analytics
• Big Data and High Performance Computing (HPC)
are converging to meet large scale data processing
challenges
• Running High Performance Data Analysis (HPDA) workloads in the cloud is gaining popularity
• According to the latest OpenStack survey, 27% of cloud deployments are running HPDA workloads
Introduction to Big Data Analytics and Trends
http://www.coolinfographics.com/blog/tag/data?currentPage=3
http://www.climatecentral.org/news/white-house-brings-together-big-data-and-climate-change-17194
HPCAC-Switzerland (April ’18) 3Network Based Computing Laboratory
• From 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40,000
exabytes.
• By 2020, a third of the data in the digital universe (more than 13,000 exabytes) will have Big
Data Value, but only if it is tagged and analyzed.
Big Volume of Data by the End of this Decade
Courtesy: John Gantz and David Reinsel, "The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East”, IDC's Digital Universe Study, sponsored by EMC, December 2012.
HPCAC-Switzerland (April ’18) 4Network Based Computing Laboratory
Big Velocity – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 7.5% from 2016 and now represents
3.7 Billion People.Courtesy: https://www.domo.com/blog/data-never-sleeps-5/
HPCAC-Switzerland (April ’18) 5Network Based Computing Laboratory
• Scientific Data Management, Analysis, and Visualization
• Applications examples
– Climate modeling
– Combustion
– Fusion
– Astrophysics
– Bioinformatics
• Data Intensive Tasks
– Runs large-scale simulations on supercomputers
– Dump data on parallel storage systems
– Collect experimental / observational data
– Move experimental / observational data to analysis sites
– Visual analytics – help understand data visually
Not Only in Internet Services - Big Data in Scientific Domains
HPCAC-Switzerland (April ’18) 6Network Based Computing Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!
Increasing Need to Run these applications on the Cloud!!
HPCAC-Switzerland (April ’18) 7Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Big Data Management and Processing on Modern Clusters
HPCAC-Switzerland (April ’18) 8Network Based Computing Laboratory
(1) Prepare Datasets @Scale
(2) Deep Learning @Scale
(3) Non-deep learning
analytics @Scale
(4) Apply ML model @Scale
• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms
• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big
data stacks, such as Apache Hadoop and Spark
• Benefits of the DLoBD approach
– Easily build a powerful data analytics pipeline
• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof
– Better data locality
– Efficient resource sharing and cost effective
Deep Learning over Big Data (DLoBD)
HPCAC-Switzerland (April ’18) 9Network Based Computing Laboratory
• CaffeOnSpark
• SparkNet
• TensorFlowOnSpark
• TensorFrame
• DeepLearning4J
• BigDL
• mmlspark
– CNTKOnSpark
• Many others…
Examples of DLoBD Stacks
HPCAC-Switzerland (April ’18) 10Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and FPGAs)
High Performance Interconnects –InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors high compute density, high
performance/watt>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
HPCAC-Switzerland (April ’18) 11Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack for HPC(http://openfabrics.org)
Kernel
Space
Application / Middleware
Verbs
Ethernet
Adapter
EthernetSwitch
Ethernet Driver
TCP/IP
1/10/40/100 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40 GigE-TOE
InfiniBand
Adapter
InfiniBand
Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCEAdapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
HPCAC-Switzerland (April ’18) 12Network Based Computing Laboratory
How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a “convergent trajectory”!
What are the major
bottlenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?
Can the bottlenecks be alleviated with new
designs by taking advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applications?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluating the
performance of Big
Data middleware on
HPC clusters?
HPCAC-Switzerland (April ’18) 13Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
HPCAC-Switzerland (April ’18) 14Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
HPCAC-Switzerland (April ’18) 15Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep LearningJob
HPCAC-Switzerland (April ’18) 16Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)
Storage Technologies(HDD, SSD, NVM, and NVMe-
SSD)
Programming Models(Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-PointCommunication
QoS & Fault Tolerance
Threaded Modelsand Synchronization
Performance TuningI/O and File Systems
Virtualization (SR-IOV)
Benchmarks
Upper level
Changes?
HPCAC-Switzerland (April ’18) 17Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 280 organizations from 34 countries
• More than 25,750 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Upcoming Release will have support
For Singularity and Docker
HPCAC-Switzerland (April ’18) 18Network Based Computing Laboratory
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
Nu
mb
er
of
Do
wn
load
s
Timeline
RD
MA
-Had
oo
p 2
.x 0
.9.1
RD
MA
-Had
oo
p 2
.x 0
.9.6
RD
MA
-Me
mca
ched
0.9
.4
RD
MA
-Had
oo
p 2
.x 0
.9.7
RD
MA
-Sp
ark
0.9
.1
RD
MA
-HB
ase
0.9
.1
RD
MA
-Had
oo
p 2
.x 1
.0.0
RD
MA
-Sp
ark
0.9
.4
RD
MA
-Had
oo
p 2
.x 1
.3.0
RD
MA
-Me
mca
ched
0.9
.6&
OH
B-0
.9.3
RD
MA
-Sp
ark
0.9
.5
RD
MA
-Had
oo
p 1
.x 0
.9.0
RD
MA
-Had
oo
p 1
.x 0
.9.8
RD
MA
-Me
mca
ched
0.9
.1 &
OH
B-0
.7.1
RD
MA
-Had
oo
p 1
.x 0
.9.9
HPCAC-Switzerland (April ’18) 19Network Based Computing Laboratory
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and
RPC components
– Enhanced HDFS with in-memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)
– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP
– Support for OpenPOWER
• Current release: 1.3.0
– Based on Apache Hadoop 2.8.0
– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• Different file systems with disks and SSDs and Lustre
RDMA for Apache Hadoop 2.x Distribution
http://hibd.cse.ohio-state.edu
HPCAC-Switzerland (April ’18) 20Network Based Computing Laboratory
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
HPCAC-Switzerland (April ’18) 21Network Based Computing Laboratory
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Off-JVM-heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
HPCAC-Switzerland (April ’18) 22Network Based Computing Laboratory
• High-Performance Design of HBase over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for HBase
– Compliant with Apache HBase 1.1.2 APIs and applications
– On-demand connection setup
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache HBase 1.1.2
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
– http://hibd.cse.ohio-state.edu
RDMA for Apache HBase Distribution
HPCAC-Switzerland (April ’18) 23Network Based Computing Laboratory
• High-Performance Design of Memcached over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and
libMemcached components
– High performance design of SSD-Assisted Hybrid Memory
– Non-Blocking Libmemcached Set/Get API extensions
– Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with
IPoIB)
• Current release: 0.9.6
– Based on Memcached 1.5.3 and libMemcached 1.0.18
– Compliant with libMemcached APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• SSD
– http://hibd.cse.ohio-state.edu
RDMA for Memcached Distribution
HPCAC-Switzerland (April ’18) 24Network Based Computing Laboratory
• Micro-benchmarks for Hadoop Distributed File System (HDFS)
– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency
(RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT)
Benchmark
– Support benchmarking of
• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH)
HDFS
• Micro-benchmarks for Memcached
– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid
Memory Latency Benchmark
– Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached
• Micro-benchmarks for HBase
– Get Latency Benchmark, Put Latency Benchmark
• Micro-benchmarks for Spark
– GroupBy, SortBy
• Current release: 0.9.3
• http://hibd.cse.ohio-state.edu
OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, HBase, and Spark
HPCAC-Switzerland (April ’18) 25Network Based Computing Laboratory
Using HiBD Packages on Existing HPC Infrastructure
HPCAC-Switzerland (April ’18) 26Network Based Computing Laboratory
Using HiBD Packages on Existing HPC Infrastructure
HPCAC-Switzerland (April ’18) 27Network Based Computing Laboratory
• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and
available on SDSC Comet.
– Examples for various modes of usage are available in:
• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
• RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email [email protected] (reference Comet as the machine, and SDSC as the
site) if you have any further questions about usage and configuration.
• RDMA for Apache Hadoop is also available on Chameleon Cloud as an
appliance
– https://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
HPCAC-Switzerland (April ’18) 28Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 29Network Based Computing Laboratory
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
Design Overview of HDFS with RDMA
HDFS
Verbs
RDMA Capable Networks(IB, iWARP, RoCE ..)
Applications
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
WriteOthers
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS replication
– Parallel replication support
– On-demand connection setup
– InfiniBand/RoCE support
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS
over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
HPCAC-Switzerland (April ’18) 30Network Based Computing Laboratory
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the heterogeneous
storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on data usage
pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
Enhanced HDFS with In-Memory and Heterogeneous Storage
Hybrid Replication
Data Placement Policies
Eviction/Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters
with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
HPCAC-Switzerland (April ’18) 31Network Based Computing Laboratory
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
OSU Design
Applications
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in
MapReduce over High Performance Interconnects, ICS, June 2014
HPCAC-Switzerland (April ’18) 32Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
80 120 160
Exe
cuti
on
Tim
e (
s)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
0
100
200
300
400
500
600
700
800
80 160 240
Exec
uti
on
Tim
e (s
)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter
– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen
– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
HPCAC-Switzerland (April ’18) 33Network Based Computing Laboratory
0
100
200
300
400
500
600
700
800
80 120 160
Exec
uti
on
Tim
e (s
)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps and 32 reduces
• Sort
– 61% improvement over IPoIB for
80-160 GB data
• TeraSort
– 18% improvement over IPoIB for
80-240 GB data
Reduced by 61%
Reduced by 18%
Cluster with 8 Nodes with a total of 64 maps and 14 reduces
Sort TeraSort
0
100
200
300
400
500
600
80 160 240
Exec
uti
on
Tim
e (s
)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
HPCAC-Switzerland (April ’18) 34Network Based Computing Laboratory
Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)
• For 200GB TeraGen on 32 nodes
– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)
– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)
0
20
40
60
80
100
120
140
160
180
8:50 16:100 32:200
Exe
cuti
on
Tim
e (
s)
Cluster Size : Data Size (GB)
IPoIB (QDR) Tachyon OSU-IB (QDR)
0
100
200
300
400
500
600
700
8:50 16:100 32:200
Exe
cuti
on
Tim
e (
s)
Cluster Size : Data Size (GB)
Reduced by 2.4x
Reduced by 25.2%
TeraGen TeraSort
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File
Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
HPCAC-Switzerland (April ’18) 35Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 36Network Based Computing Laboratory
• Design Features
– RDMA based shuffle plugin
– SEDA-based architecture
– Dynamic connection management and sharing
– Non-blocking data transfer
– Off-JVM-heap buffer management
– InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High
Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks(IB, iWARP, RoCE ..)
Apache Spark Benchmarks/Applications/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Native RDMA-based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-Sort)
Block Transfer Service (Netty, NIO, RDMA-Plugin)
Netty
Server
NIO
ServerRDMA
Server
Netty
Client
NIO
ClientRDMA
Client
HPCAC-Switzerland (April ’18) 37Network Based Computing Laboratory
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Tim
e (
sec)
Data Size (GB)
IPoIB
RDMA
74%80%
HPCAC-Switzerland (April ’18) 38Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Tim
e (
sec)
Data Size (GB)
IPoIB
RDMA
43%37%
HPCAC-Switzerland (April ’18) 39Network Based Computing Laboratory
Application Evaluation on SDSC Comet
• Kira Toolkit: Distributed astronomy image processing
toolkit implemented using Apache Spark
– https://github.com/BIDS/Kira
• Source extractor application, using a 65GB dataset from
the SDSS DR2 survey that comprises 11,150 image files.
0
20
40
60
80
100
120
RDMA Spark Apache Spark(IPoIB)
21 %
Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
0
200
400
600
800
1000
24 48 96 192 384
On
e E
po
ch T
ime
(se
c)
Number of cores
IPoIB RDMA
• BigDL: Distributed Deep Learning Tool using Apache
Spark
– https://github.com/intel-analytics/BigDL
• VGG training model on the CIFAR-10 dataset
4.58x
HPCAC-Switzerland (April ’18) 40Network Based Computing Laboratory
• Challenges
– Performance characteristics of RDMA-based Hadoop and Spark over OpenPOWER systems with IB networks
– Can RDMA-based Big Data middleware demonstrate good performance benefits on OpenPOWER systems?
– Any new accelerations for Big Data middleware on OpenPOWER systems to attain better benefits?
• Results
– For the Spark SortBy benchmark, RDMA-Opt outperforms IPoIB and RDMA-Def by 23.5% and 10.5%
Accelerating Hadoop and Spark with RDMA over OpenPOWER
0
20
40
60
80
100
120
140
160
180
200
32GB 64GB 96GB
Lat
ency
(se
c)
IPoIB RDMA-Def RDMA-Opt
Spark SortBy
Reduced by 23.5%
Socket 0
Ch
ip
Inte
rco
nn
ect
Me
mo
ry
Co
ntr
olle
r(s)
Connect to DDR4 Memory
PC
Ie
Inte
rfac
e
Connect to IB HCA
Socket 1
(SMT)
Core(SMT)
Core(SMT)
Core
(SMT)
Core(SMT)Core
(SMT)
Core
Ch
ip
Inte
rco
nn
ect
(SMT)
Core(SMT)
Core(SMT)
Core
(SMT)
Core(SMT)
Core(SMT)
Core
Me
mo
ry
Co
ntr
olle
r(s)
Connect to DDR4 Memory
PC
Ie
Inte
rfac
e
Connect to IB HCA
IBM POWER8 Architecture
X. Lu, H. Shi, D. Shankar and D. K. Panda, Performance Characterization and Acceleration of Big Data Workloads on OpenPOWER System, IEEE BigData 2017.
HPCAC-Switzerland (April ’18) 41Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 42Network Based Computing Laboratory
1
10
100
1000
1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
Tim
e (
us)
Message Size
OSU-IB (FDR)
0
200
400
600
800
16 32 64 128 256 512 102420484080
Tho
usa
nd
s o
f Tr
ansa
ctio
ns
pe
r Se
con
d (
TPS)
No. of Clients
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced by nearly 20X 2X
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High
Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid
Transport, CCGrid’12
HPCAC-Switzerland (April ’18) 43Network Based Computing Laboratory
• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing
tool
• Memcached-RDMA can
- improve query latency by up to 66% over IPoIB (32Gbps)
- throughput by up to 69% over IPoIB (32Gbps)
Micro-benchmark Evaluation for OLDP workloads
012345678
64 96 128 160 320 400
Late
ncy
(se
c)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
0
1000
2000
3000
4000
64 96 128 160 320 400
Thro
ugh
pu
t (
Kq
/s)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads
with Memcached and MySQL, ISPASS’15
Reduced by 66%
HPCAC-Switzerland (April ’18) 44Network Based Computing Laboratory
– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid
– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem
Performance Evaluation with Non-Blocking Memcached API
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i
H-RDMA-Opt-NonB-b
Ave
rage
Lat
en
cy (
us)
Miss Penalty (Backend DB Access Overhead)
Client Wait
Server Response
Cache Update
Cache check+Load (Memory and/or SSD read)
Slab Allocation (w/ SSD write on Out-of-Mem)
H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee
HPCAC-Switzerland (April ’18) 45Network Based Computing Laboratory
0
200
400
600
800
1000
100 1000 10000 100000Thro
ugh
pu
t (M
B/s
)
Record Size
0
500
1000
1500
2000
2500
100 1000 10000 100000
Late
ncy
(m
s)
Record Size
RDMA
IPoIB
RDMA-Kafka: High-Performance Message Broker for Streaming Workloads
Kafka Producer Benchmark (1 Broker 4 Producers)
• Experiments run on OSU-RI2 cluster
• 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD– Up to 98% improvement in latency compared to IPoIB
– Up to 7x increase in throughput over IPoIB
98%7x
H. Javed, X. Lu , and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka,
BDCAT’17
HPCAC-Switzerland (April ’18) 46Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 47Network Based Computing Laboratory
Overview of RDMA-gRPC with TensorFlow
Worker services communicate among each other using RDMA-gRPC
Client Master
Worker
/job:PS/task:0
Worker
/job:Worker/task:0
CPU GPU
RDMA gRPC server/ client
CPU GPU
RDMA gRPC server/ client
HPCAC-Switzerland (April ’18) 48Network Based Computing Laboratory
Performance Benefits for RDMA-gRPC with Micro-Benchmark
RDMA-gRPC RPC Latency
• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR– Up to 2.7x performance speedup over IPoIB for Latency for small messages
– Up to 2.8x performance speedup over IPoIB for Latency for medium messages
– Up to 2.5x performance speedup over IPoIB for Latency for large messages
0
15
30
45
60
75
90
2 8 32 128 512 2K 8K
Late
ncy
(u
s)
payload (Bytes)
Default gRPC
AR-gRPC
0
200
400
600
800
1000
16K 32K 64K 128K 256K 512KLa
ten
cy (
us)
Payload (Bytes)
Default gRPC
AR-gRPC
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Late
ncy
(u
s)
Payload (Bytes)
Default gRPC
AR-gRPC
R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? Under Review
HPCAC-Switzerland (April ’18) 49Network Based Computing Laboratory
Performance Benefits for RDMA-gRPC with TensorFlow Communication Mimic Benchmark
TensorFlow communication Pattern mimic benchmark
• TensorFlow communication pattern mimic on SDSC-Comet-FDR– Single process spawns both gRPC server and gRPC client
– Up to 3.6x performance speedup over IPoIB for Latency
– Up to 3.7x throughput improvement over IPoIB
0
90
180
270
2M 4M 8M
Cal
ls/
Seco
nd
Payload (Bytes)
Default gRPC OSU RDMA gRPC
0
9000
18000
27000
2M 4M 8M
Late
ncy
(u
s)
Payload (Bytes)
Default gRPC
OSU RDMA gRPC
HPCAC-Switzerland (April ’18) 50Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 51Network Based Computing Laboratory
• Layers of DLoBD Stacks
– Deep learning application layer
– Deep learning library layer
– Big data analytics framework layer
– Resource scheduler layer
– Distributed file system layer
– Hardware resource layer
• How much performance benefit we
can achieve for end deep learning
applications?
Overview of DLoBD Stacks
Sub-optimal
Performance
?
HPCAC-Switzerland (April ’18) 52Network Based Computing Laboratory
X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.
High-Performance Deep Learning over Big Data (DLoBD) Stacks• Benefits of Deep Learning over Big Data (DLoBD)
Easily integrate deep learning components into Big Data processing workflow
Easily access the stored data in Big Data systems No need to set up new dedicated deep learning clusters;
Reuse existing big data analytics clusters
• Challenges Can RDMA-based designs in DLoBD stacks improve
performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?
What are the performance characteristics of representative DLoBD stacks on RDMA networks?
• Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-of-band
communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource utilization RDMA-based DLoBD stacks (e.g., BigDL over RDMA-Spark)
can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy
0
20
40
60
10
1010
2010
3010
4010
5010
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Acc
ura
cy (
%)
Ep
och
s T
ime
(sec
s)
Epoch Number
IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy
2.6X
HPCAC-Switzerland (April ’18) 53Network Based Computing Laboratory
Performance Evaluation for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR)
• CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communication overhead becomes significant.
• Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully optimized yet. For MNIST tests, RDMA is not showing obvious benefits.
050
100150200250300350400450
IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA
GPU GPU-cuDNN GPU GPU-cuDNN
CIFAR-10 Quick LeNet on MNIST
End
-to-E
nd T
ime
(sec
s)
CaffeOnSpark
2 4 8 16
050
100150200250300350400450
IPoIB RDMA IPoIB RDMA
CIFAR10 (2 GPUs/node) MNIST (1 GPU/node)
End-t
o-E
nd T
ime
(sec
s)
TensorFlowOnSpark2 GPUs
4 GPUs
8 GPUs
14.2%13.3%
51.2%45.6%
33.01%
4.9%
HPCAC-Switzerland (April ’18) 54Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 55Network Based Computing Laboratory
Overview of RDMA-Hadoop-Virt Architecture• Virtualization-aware modules in all the four
main Hadoop components:
– HDFS: Virtualization-aware Block Management
to improve fault-tolerance
– YARN: Extensions to Container Allocation Policy
to reduce network traffic
– MapReduce: Extensions to Map Task Scheduling
Policy to reduce network traffic
– Hadoop Common: Topology Detection Module
for automatic topology detection
• Communications in HDFS, MapReduce, and RPC
go through RDMA-based designs over SR-IOV
enabled InfiniBand
HDFS
YARN
Ha
do
op
Co
mm
on
MapReduce
HBase Others
Virtual Machines Bare-Metal nodesContainers
Big Data ApplicationsT
opolo
gy D
ete
ction M
odule Map Task Scheduling
Policy Extension
Container Allocation
Policy Extension
CloudBurst MR-MS Polygraph Others
Virtualization Aware
Block Management
S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on
SR-IOV-enabled Clouds. CloudCom, 2016.
HPCAC-Switzerland (April ’18) 56Network Based Computing Laboratory
Evaluation with Applications
– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join
– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join
0
20
40
60
80
100
Default Mode Distributed Mode
EXEC
UTI
ON
TIM
ECloudBurst
RDMA-Hadoop RDMA-Hadoop-Virt
0
50
100
150
200
250
300
350
400
Default Mode Distributed Mode
EXEC
UTI
ON
TIM
E
Self-Join
RDMA-Hadoop RDMA-Hadoop-Virt
30% reduction
55% reduction
Will be released in future
HPCAC-Switzerland (April ’18) 57Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Switzerland (April ’18) 58Network Based Computing Laboratory
• Distributed Cloud-based Object Storage Service
• Deployed as part of OpenStack installation
• Can be deployed as standalone storage solution as well
• Worldwide data access via Internet
– HTTP-based
• Architecture
– Multiple Object Servers: To store data
– Few Proxy Servers: Act as a proxy for all requests
– Ring: Handles metadata
• Usage
– Input/output source for Big Data applications (most common use
case)
– Software/Data backup
– Storage of VM/Docker images
• Based on traditional TCP sockets communication
OpenStack Swift Overview
Send PUT or GET requestPUT/GET /v1/<account>/<container>/<object>
Proxy Server
Object Server
Object Server
Object Server
Ring
Disk 1
Disk 2
Disk 1
Disk 2
Disk 1
Disk 2
Swift Architecture
HPCAC-Switzerland (April ’18) 59Network Based Computing Laboratory
• Challenges
– Proxy server is a bottleneck for large scale deployments
– Object upload/download operations network intensive
– Can an RDMA-based approach benefit?
• Design
– Re-designed Swift architecture for improved scalability and
performance; Two proposed designs:
• Client-Oblivious Design: No changes required on the client side
• Metadata Server-based Design: Direct communication between
client and object servers; bypass proxy server
– RDMA-based communication framework for accelerating
networking performance
– High-performance I/O framework to provide maximum
overlap between communication and I/O
Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds
S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud,
accepted at CCGrid’17, May 2017
Client-Oblivious Design
(D1)
Metadata Server-based
Design (D2)
HPCAC-Switzerland (April ’18) 60Network Based Computing Laboratory
0
5
10
15
20
25
SwiftPUT
Swift-X(D1)PUT
Swift-X(D2)PUT
SwiftGET
Swift-X(D1)GET
Swift-X(D2)GET
LATE
NC
Y (S
)
TIME BREAKUP OF GET AND PUT
Communication I/O Hashsum Other
Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds
0
5
10
15
20
1MB 4MB 16MB 64MB 256MB 1GB 4GB
LATE
NC
Y (s
)
OBJECT SIZE
GET LATENCY EVALUATION
Swift Swift-X (D2) Swift-X (D1)
Reduced
by 66%
• Up to 66% reduction in GET latency• Communication time reduced by up to
3.8x for PUT and up to 2.8x for GET
HPCAC-Switzerland (April ’18) 61Network Based Computing Laboratory
• Discussed challenges in accelerating BigData and Deep Learning Stacks with HPC
technologies
• Presented basic and advanced designs to take advantage of InfiniBand/RDMA for
HDFS, MapReduce, HBase, Spark, Memcached, gRPC, and TensorFlow on HPC clusters
and clouds
• Results are promising for Big Data processing and the associated Deep Learning tools
• Many other open issues need to be solved
• Will enable Big Data and Deep Learning communities to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
Concluding Remarks
HPCAC-Switzerland (April ’18) 62Network Based Computing Laboratory
One more Presentation
• Today at 11:45 am
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
HPCAC-Switzerland (April ’18) 63Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
HPCAC-Switzerland (April ’18) 64Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– N. Sarkauskas (B.S.)
HPCAC-Switzerland (April ’18) 65Network Based Computing Laboratory
The 4th International Workshop on High-Performance Big Data Computing (HPBDC)
HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018
Workshop Date: May 21st, 2018
Keynote Talk: Prof. Geoffrey Fox, Twister2: A High-Performance Big Data Programming Environment
Six Regular Research Papers and Two Short Research Papers
Panel Topic: Which Framework is the Best for High-Performance Deep Learning:
Big Data Framework or HPC Framework?
http://web.cse.ohio-state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjunction with IPDPS’17
http://web.cse.ohio-state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjunction with IPDPS’16
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
HPCAC-Switzerland (April ’18) 66Network Based Computing Laboratory
http://www.cse.ohio-state.edu/~panda
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/