Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DHABALESWAR K PANDA
DK Panda is a Professor and University Distinguished Scholar of Computer
Science and Engineering at the Ohio State University. The MVAPICH2 (High
Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE)
libraries, designed and developed by his research group (http://mvapich.cse.ohio-
state.edu), are currently being used by more than 2,675 organizations worldwide
(in 81 countries). More than 400,000 downloads of this software have taken place
from the project's site. The RDMA packages for Apache Spark, Apache Hadoop
and Memcached together with OSU HiBD benchmarks from his group
(http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are
currently being used by more than 190 organizations in 26 countries. More than
18,000 downloads of these libraries have taken place. He is an IEEE Fellow.
Designing HPC, Big Data, and Deep Learning Middleware for
Exascale Systems:
Challenges and Opportunities
Designing HPC, Big Data and Deep Learning Middleware for Exascale Systems: Challenges and Opportunities
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at HPC Connection Workshop (SC ‘16)
by
HPC-Connection (SC ‘16) 3 Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
Tianhe – 2 Titan Stampede Tianhe – 1A
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Accelerators / Coprocessors high compute density, high
performance/watt >1 TFlop DP on a chip
High Performance Interconnects - InfiniBand
<1usec latency, 100Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM
HPC-Connection (SC ‘16) 4 Network Based Computing Laboratory
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
– Many discussions towards Partitioned Global Address Space (PGAS)
• UPC, OpenSHMEM, CAF, UPC++ etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Deep Learning
– Caffe, CNTK, TensorFlow, and many more
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Spark and Hadoop (HDFS, HBase, MapReduce)
– Memcached is also used for Web 2.0
Three Major Computing Categories
HPC-Connection (SC ‘16) 5 Network Based Computing Laboratory
Designing Communication Middleware for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,
OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
Application Kernels/Applications
Networking Technologies (InfiniBand, 40/100GigE,
Aries, and OmniPath)
Multi/Many-core Architectures
Accelerators (NVIDIA and MIC)
Middleware
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
Communication Library or Runtime for Programming Models
Point-to-point
Communication
Collective
Communication
Energy-
Awareness
Synchronization
and Locks
I/O and
File Systems
Fault
Tolerance
HPC-Connection (SC ‘16) 6 Network Based Computing Laboratory
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided)
– Scalable job start-up
• Scalable Collective communication – Offload
– Non-blocking
– Topology-aware
• Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node
• Support for efficient multi-threading
• Integrated Support for GPGPUs and Accelerators
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …)
• Virtualization
• Energy-Awareness
Broad Challenges in Designing Communication Middleware for (MPI+X) at Exascale
HPC-Connection (SC ‘16) 7 Network Based Computing Laboratory
• Extreme Low Memory Footprint – Memory per core continues to decrease
• D-L-A Framework
– Discover
• Overall network topology (fat-tree, 3D, …), Network topology for processes for a given job
• Node architecture, Health of network and node
– Learn
• Impact on performance and scalability
• Potential for failure
– Adapt
• Internal protocols and algorithms
• Process mapping
• Fault-tolerance solutions
– Low overhead techniques while delivering performance, scalability and fault-tolerance
Additional Challenges for Designing Exascale Software Libraries
HPC-Connection (SC ‘16) 8 Network Based Computing Laboratory
Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,690 organizations in 83 countries
– More than 402,000 (> 0.4 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘16 ranking)
• 1st ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 13th ranked 241,108-core cluster (Pleiades) at NASA
• 17th ranked 519,640-core cluster (Stampede) at TACC
• 40th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
Sunway TaihuLight at NSC, Wuxi, China (1st in Nov’16, 10,649,640 cores, 93 PFlops)
HPC-Connection (SC ‘16) 9 Network Based Computing Laboratory
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
Sep
-04
Feb
-05
Jul-
05
Dec
-05
May
-06
Oct
-06
Mar
-07
Au
g-0
7
Jan
-08
Jun
-08
No
v-0
8
Ap
r-0
9
Sep
-09
Feb
-10
Jul-
10
Dec
-10
May
-11
Oct
-11
Mar
-12
Au
g-1
2
Jan
-13
Jun
-13
No
v-1
3
Ap
r-1
4
Sep
-14
Feb
-15
Jul-
15
Dec
-15
May
-16
Oct
-16
Nu
mb
er
of
Do
wn
load
s
Timeline
MV
0.9
.4
MV
2 0
.9.0
MV
2 0
.9.8
MV
2 1
.0
MV
1.0
MV
2 1
.0.3
MV
1.1
MV
2 1
.4
MV
2 1
.5
MV
2 1
.6
MV
2 1
.7
MV
2 1
.8
MV
2 1
.9 M
V2
2.1
MV
2-G
DR
2.0
b
MV
2-M
IC 2
.0
MV
2-V
irt
2.1
rc2
MV
2-X
2.2
M
V2
2.2
M
V2
-GD
R 2
.2
MVAPICH/MVAPICH2 Release Timeline and Downloads
HPC-Connection (SC ‘16) 10 Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface (MPI)
PGAS (UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms
Point-to-
point
Primitives
Collectives
Algorithms
Energy-
Awareness
Remote
Memory
Access
I/O and
File Systems
Fault
Tolerance Virtualization
Active
Messages Job Startup
Introspection
& Analysis
Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP SR-
IOV
Multi
Rail
Transport Mechanisms
Shared
Memory CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
HPC-Connection (SC ‘16) 11 Network Based Computing Laboratory
MVAPICH2 Software Family High-Performance Parallel Programming Libraries
MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE
MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime
MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs
MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud
MVAPICH2-EA Energy aware and High-performance MPI
MVAPICH2-MIC Optimized MPI for clusters with Intel KNC
Microbenchmarks
OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs
Tools
OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration
OEMT Utility to measure the energy consumption of MPI applications
HPC-Connection (SC ‘16) 12 Network Based Computing Laboratory
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication
– Dynamic and Adaptive Tag Matching
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …)
• Integrated Support for GPGPUs
• Accelerating Graph Processing (Mizan) with MPI-3 RMA
• Optimized MVAPICH2 for KNL and Omni-Path
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPC-Connection (SC ‘16) 13 Network Based Computing Laboratory
One-way Latency: MPI over IB with MVAPICH2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8 Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.11 1.19
1.01 1.15
1.04
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch
0
20
40
60
80
100
120TrueScale-QDRConnectX-3-FDRConnectIB-DualFDRConnectX-4-EDROmni-Path
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
HPC-Connection (SC ‘16) 14 Network Based Computing Laboratory
TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch
Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch
Bandwidth: MPI over IB with MVAPICH2
0
5000
10000
15000
20000
25000TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
Omni-Path
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
21,227
12,161
21,983
6,228
24,136
0
2000
4000
6000
8000
10000
12000
14000 Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
12,590
3,373
6,356
12,083
12,366
HPC-Connection (SC ‘16) 15 Network Based Computing Laboratory
0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(u
s)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
Latest MVAPICH2 2.2
Intel Ivy-bridge 0.18 us
0.45 us
0
5000
10000
15000
Ban
dw
idth
(M
B/s
) Message Size (Bytes)
Bandwidth (Inter-socket) inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC
0
5000
10000
15000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth (Intra-socket) intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC
14,250 MB/s
13,749 MB/s
HPC-Connection (SC ‘16) 16 Network Based Computing Laboratory
• Applications no longer need to pin down underlying physical pages
• Memory Region (MR) are NEVER pinned by the OS
• Paged in by the HCA when needed
• Paged out by the OS when reclaimed
• ODP can be divided into two classes
– Explicit ODP
• Applications still register memory buffers for
communication, but this operation is used to define access
control for IO rather than pin-down the pages
– Implicit ODP
• Applications are provided with a special memory key that
represents their complete address space, does not need to
register any virtual address range
• Advantages
• Simplifies programming
• Unlimited MR sizes
• Physical memory optimization
On-Demand Paging (ODP)
1
10
100
1000
Exe
cuti
on
Tim
e (
s)
Applications (64 Processes)
Pin-down
ODP
M. Li, K. Hamidouche, X. Lu, H. Subramoni, J. Zhang, and D. K. Panda,
“Designing MPI Library with On-Demand Paging (ODP) of InfiniBand:
Challenges and Benefits”, SC 2016.
Wednesday 11/16/2016 @ 11:00 – 11:30 AM in Room 355-D
HPC-Connection (SC ‘16) 17 Network Based Computing Laboratory
Dynamic and Adaptive Tag Matching
Normalized Total Tag Matching Time at 512 Processes Normalized to Default (Lower is Better)
Normalized Memory Overhead per Process at 512 Processes Compared to Default (Lower is Better)
Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee]
Ch
alle
nge
Tag matching is a significant overhead for receivers
Existing Solutions are
- Static and do not adapt dynamically to communication pattern
- Do not consider memory overhead
Solu
tio
n
A new tag matching design
- Dynamically adapt to communication patterns
- Use different strategies for different ranks
- Decisions are based on the number of request object that must be traversed before hitting on the required one
Res
ult
s Better performance than other state-of-the art tag-matching schemes
Minimum memory consumption
Will be available in future MVAPICH2 releases
HPC-Connection (SC ‘16) 18 Network Based Computing Laboratory
• Scalability for million to billion processors
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …)
• Integrated Support for GPGPUs
• Accelerating Graph Processing (Mizan) with MPI-3 RMA
• Optimized MVAPICH2 for KNL and Omni-Path
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPC-Connection (SC ‘16) 19 Network Based Computing Laboratory
MVAPICH2-X for Hybrid MPI + PGAS Applications
• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI – Possible deadlock if both runtimes are not progressed
– Consumes more network resource
• Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF
– Available with since 2012 (starting with MVAPICH2-X 1.9)
– http://mvapich.cse.ohio-state.edu
HPC-Connection (SC ‘16) 20 Network Based Computing Laboratory
UPC++ Support in MVAPICH2-X
MPI + {UPC++} Application
GASNet Interfaces
UPC++ Interface
Network
Conduit (MPI)
MVAPICH2-X Unified Communication
Runtime (UCR)
MPI + {UPC++} Application
UPC++ Runtime
MPI Interfaces
• Full and native support for hybrid MPI + UPC++ applications
• Better performance compared to IBV and MPI conduits
• OSU Micro-benchmarks (OMB) support for UPC++
• Available since MVAPICH2-X (2.2rc1)
0
5
10
15
20
25
30
35
40
1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M
Tim
e (m
s)
Message Size (bytes)
GASNet_MPI
GASNET_IBV
MV2-X14x
Inter-node Broadcast (64 nodes 1:ppn)
MPI Runtime
HPC-Connection (SC ‘16) 21 Network Based Computing Laboratory
Application Level Performance with Graph500 and Sort Graph500 Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation,
Int'l Conference on Parallel Processing (ICPP '12), September 2012
05
101520253035
4K 8K 16K
Tim
e (
s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)13X
7.6X
• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design
• 8,192 processes - 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes - 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned
Global Address Space Programming Models (PGAS '13), October 2013.
Sort Execution Time
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (
seco
nd
s)
Input Data - No. of Processes
MPI Hybrid
51%
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• 4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min
- Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design
HPC-Connection (SC ‘16) 22 Network Based Computing Laboratory
• Scalability for million to billion processors
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …)
• Integrated Support for GPGPUs – CUDA-aware MPI
– GPUDirect RDMA (GDR) Support
– Control Flow Decoupling through GPUDirect Async
• Accelerating Graph Processing (Mizan) with MPI-3 RMA
• Optimized MVAPICH2 for KNL and Omni-Path
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPC-Connection (SC ‘16) 23 Network Based Computing Laboratory
At Sender:
At Receiver:
MPI_Recv(r_devbuf, size, …);
inside
MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
HPC-Connection (SC ‘16) 24 Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases • Support for MPI communication from NVIDIA GPU device memory
• High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective communication from GPU device buffers
• Unified memory
HPC-Connection (SC ‘16) 25 Network Based Computing Laboratory
0
1000
2000
3000
4000
1 4 16 64 256 1K 4K
MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Message Size (bytes)
Bi-
Ban
dw
idth
(M
B/s
)
0
5
10
15
20
25
30
0 2 8 32 128 512 2K
MV2-GDR2.2 MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Late
ncy
(u
s)
25
MVAPICH2-GDR-2.2 Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPU Mellanox Connect-X4 EDR HCA
CUDA 8.0 Mellanox OFED 3.0 with GPU-Direct-RDMA
10x 2X
11x
Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)
2.18us
0
500
1000
1500
2000
2500
3000
1 4 16 64 256 1K 4K
MV2-GDR2.2
MV2-GDR2.0b
MV2 w/o GDR
GPU-GPU Internode Bandwidth
Message Size (bytes)
Ban
dw
idth
(M
B/s
) 11X
2X
3X
HPC-Connection (SC ‘16) 26 Network Based Computing Laboratory
LENS (Oct '15) 26
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)
• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768
MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768
MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Ave
rage
Tim
e St
eps
per
se
con
d (
TPS)
Number of Processes
MV2 MV2+GDR
0
500
1000
1500
2000
2500
3000
3500
4 8 16 32Ave
rage
Tim
e St
eps
per
se
con
d (
TPS)
Number of Processes
64K Particles 256K Particles
2X 2X
HPC-Connection (SC ‘16) 27 Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96No
rmal
ize
d E
xecu
tio
n T
ime
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
0
0.2
0.4
0.6
0.8
1
1.2
4 8 16 32
No
rmal
ize
d E
xecu
tio
n T
ime
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes • 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data
Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content
/tasks/operational/meteoSwiss/
HPC-Connection (SC ‘16) 28 Network Based Computing Laboratory
Latency oriented: Send+kernel and Recv+kernel
MVAPICH2-GDS: Preliminary Results
0
5
10
15
20
25
30
8 32 128 512
Default
Enhanced-GDS
Message Size (bytes)
Late
ncy
(u
s)
0
5
10
15
20
25
30
35
16 64 256 1024
Default
Enhanced-GDS
Message Size (bytes)
Late
ncy
(u
s)
Throughput Oriented: back-to-back
• Latency Oriented: Able to hide the kernel launch overhead
– 25% improvement at 256 Bytes compared to default behavior
• Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks
– 14% improvement at 1KB message size
Intel Sandy Bridge, NVIDIA K20 and Mellanox FDR HCA
Will be available in a public release soon
HPC-Connection (SC ‘16) 29 Network Based Computing Laboratory
• Scalability for million to billion processors
• Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …)
• Integrated Support for GPGPUs
• Accelerating Graph Processing (Mizan) with MPI-3 RMA
• Optimized MVAPICH2 for KNL and Omni-Path
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
HPC-Connection (SC ‘16) 30 Network Based Computing Laboratory
Mizan-RMA: Accelerating Graph Processing Framework
• Mizan framework is available from: • https://thegraphsblog.wordpress.
com/the-graph-blog/mizan/ • Accelerate communication with MPI
one-sided programming model (RMA)? • Overlap communication with
computation • 2.5X improvement on 40 nodes • Will be released soon
M. Li, X. Lu, K. Hamidouche, J. Zhang, and D. K. Panda, Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA, Int’l
Conference on High Performance Computing, HiPC’16, To be presented
0
500
1000
1500
256 384 512 640
Exec
uti
on
Tim
e (s
)
Number of Cores
PageRank with Arabic Dataset
Default
Optimized
2.5x
HPC-Connection (SC ‘16) 31 Network Based Computing Laboratory
MVAPICH2 for Omni-Path and KNL
• MVAPICH2 has been supporting QLogic/PSM for many years with all different compute platforms
• Latest version (MVAPICH2 2.2 GA) supports
• Omni-Path (derivative of QLogic IB)
• Xeon family with Omni-Path
• KNL with Omni-Path
HPC-Connection (SC ‘16) 32 Network Based Computing Laboratory
• New designs
– Take advantage of the idle cores
– Dynamically configurable
– Take advantage of highly multithreaded cores
– Take advantage of MCDRAM of KNL processors
• Applicable to other programming models such as PGAS, Task-based, etc.
• Provides portability, performance, and applicability to runtime as well as
applications in a transparent manner
Enhanced Designs for KNL: MVAPICH2 Approach
HPC-Connection (SC ‘16) 33 Network Based Computing Laboratory
Performance Benefits of the Enhanced Designs
• New designs to exploit high concurrency and MCDRAM of KNL
• Significant improvements for large message sizes
• Benefits seen in varying message size as well as varying MPI processes
Very Large Message Bi-directional Bandwidth 16-process Intra-node All-to-All Intra-node Broadcast with 64MB Message
0
2000
4000
6000
8000
10000
2M 4M 8M 16M 32M 64M
Ban
dw
idth
(M
B/s
)
Message size
MVAPICH2 MVAPICH2-Optimized
0
10000
20000
30000
40000
50000
60000
4 8 16
Late
ncy
(u
s)
No. of processes
MVAPICH2 MVAPICH2-Optimized
27%
0
50000
100000
150000
200000
250000
300000
1M 2M 4M 8M 16M 32M
Late
ncy
(u
s)
Message size
MVAPICH2 MVAPICH2-Optimized17.2%
52%
HPC-Connection (SC ‘16) 34 Network Based Computing Laboratory
Performance Benefits of the Enhanced Designs
0
10000
20000
30000
40000
50000
60000
1M 2M 4M 8M 16M 32M 64M
Ban
dw
idth
(M
B/s
)
Message Size (bytes)
MV2_Opt_DRAM MV2_Opt_MCDRAM
MV2_Def_DRAM MV2_Def_MCDRAM 30%
0
50
100
150
200
250
300
4:268 4:204 4:64
Tim
e (s
)
MPI Processes : OMP Threads
MV2_Def_DRAM MV2_Opt_DRAM
15%
Multi-Bandwidth using 32 MPI processes CNTK: MLP Training Time using MNIST (BS:64)
• Benefits observed on training time of Multi-level Perceptron (MLP) model on MNIST dataset
using CNTK Deep Learning Framework
Enhanced Designs will be available in upcoming MVAPICH2 releases
HPC-Connection (SC ‘16) 35 Network Based Computing Laboratory
MVAPICH2 – Plans for Exascale
• Performance and Memory scalability toward 1M cores
• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + UPC++, MPI + CAF …) • MPI + Task*
• Enhanced Optimization for GPU Support and Accelerators
• Taking advantage of advanced features of Mellanox InfiniBand • Switch-IB2 SHArP*
• GID-based support*
• Enhanced communication schemes for upcoming architectures • Knights Landing with MCDRAM*
• NVLINK*
• CAPI*
• Extended topology-aware collectives
• Extended Energy-aware designs and Virtualization Support
• Extended Support for MPI Tools Interface (as in MPI 3.1)
• Extended Checkpoint-Restart and migration support with SCR
• Support for * features will be available in future MVAPICH2 Releases
HPC-Connection (SC ‘16) 36 Network Based Computing Laboratory
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
– Many discussions towards Partitioned Global Address Space (PGAS)
• UPC, OpenSHMEM, CAF, UPC++ etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC, UPC++)
• Deep Learning
– Caffe, CNTK, TensorFlow, and many more
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Spark and Hadoop (HDFS, HBase, MapReduce)
– Memcached is also used for Web 2.0
Three Major Computing Categories
HPC-Connection (SC ‘16) 37 Network Based Computing Laboratory
• Deep Learning frameworks are a different game
altogether
– Unusually large message sizes (order of
megabytes)
– Most communication based on GPU buffers
• How to address these newer requirements?
– GPU-specific Communication Libraries (NCCL)
• NVidia's NCCL library provides inter-GPU
communication
– CUDA-Aware MPI (MVAPICH2-GDR)
• Provides support for GPU-based communication
• Can we exploit CUDA-Aware MPI and NCCL to
support Deep Learning applications?
Deep Learning: New Challenges for MPI Runtimes
1
32
4
InternodeComm.(Knomial)
1 2
CPU
PLX
3 4
PLX
Intranode Comm.(NCCLRing)
RingDirection
Hierarchical Communication (Knomial + NCCL ring)
HPC-Connection (SC ‘16) 38 Network Based Computing Laboratory
• NCCL has some limitations
– Only works for a single node, thus, no scale-out on
multiple nodes
– Degradation across IOH (socket) for scale-up (within a
node)
• We propose optimized MPI_Bcast
– Communication of very large GPU buffers (order of
megabytes)
– Scale-out on large number of dense multi-GPU nodes
• Hierarchical Communication that efficiently exploits:
– CUDA-Aware MPI_Bcast in MV2-GDR
– NCCL Broadcast primitive
Efficient Broadcast: MVAPICH2-GDR and NCCL
1
10
100
1000
10000
100000
1 8
64
51
2
4K
32
K
25
6K
2M
16
M
12
8M
Late
ncy
(u
s)
Log
Scal
e
Message Size
MV2-GDR MV2-GDR-Opt
100x
0
10
20
30
2 4 8 16 32 64Tim
e (
seco
nd
s)
Number of GPUs
MV2-GDR MV2-GDR-Opt
Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement )
Performance Benefits: OSU Micro-benchmarks
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up]
HPC-Connection (SC ‘16) 39 Network Based Computing Laboratory
• Can we optimize MVAPICH2-GDR to
efficiently support DL frameworks?
– We need to design large-scale
reductions using CUDA-Awareness
– GPU performs reduction using
kernels
– Overlap of computation and
communication
– Hierarchical Designs
• Proposed designs achieve 2.5x
speedup over MVAPICH2-GDR and
133x over OpenMPI
Efficient Reduce: MVAPICH2-GDR
0.001
0.1
10
1000
100000
4M 8M 16M 32M 64M 128M
Late
ncy
(m
illis
eco
nd
s) –
Lo
g Sc
ale
Message Size (MB)
Optimized Large-Size Reduction
MV2-GDR OpenMPI MV2-GDR-Opt
HPC-Connection (SC ‘16) 40 Network Based Computing Laboratory
0
100
200
300
2 4 8 16 32 64 128
Late
ncy
(m
s)
Message Size (MB)
Reduce – 192 GPUs
Large Message Optimized Collectives for Deep Learning
0
100
200
128 160 192
Late
ncy
(m
s)
No. of GPUs
Reduce – 64 MB
0
100
200
300
16 32 64
Late
ncy
(m
s)
No. of GPUs
Allreduce - 128 MB
0
50
100
2 4 8 16 32 64 128
Late
ncy
(m
s)
Message Size (MB)
Bcast – 64 GPUs
0
50
100
16 32 64
Late
ncy
(m
s)
No. of GPUs
Bcast 128 MB
• MV2-GDR provides
optimized collectives for
large message sizes
• Optimized Reduce,
Allreduce, and Bcast
• Good scaling with large
number of GPUs
• Available in MVAPICH2-
GDR 2.2GA
0
100
200
300
2 4 8 16 32 64 128La
ten
cy (
ms)
Message Size (MB)
Allreduce – 64 GPUs
HPC-Connection (SC ‘16) 41 Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across different
sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on
CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on
ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
nin
g Ti
me
(sec
on
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case OSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
HPC-Connection (SC ‘16) 42 Network Based Computing Laboratory
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant
Programming Model
– Many discussions towards Partitioned Global Address Space (PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC, UPC++)
• Deep Learning
– Caffe, CNTK, TensorFlow, and many more
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Spark and Hadoop (HDFS, HBase, MapReduce)
– Memcached is also used for Web 2.0
Three Major Computing Categories
HPC-Connection (SC ‘16) 43 Network Based Computing Laboratory
How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a “convergent trajectory”!
What are the major
bottlenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?
Can the bottlenecks be alleviated with new
designs by taking advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applications?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluating the
performance of Big
Data middleware on
HPC clusters?
HPC-Connection (SC ‘16) 44 Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)
Storage Technologies (HDD, SSD, and NVMe-SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
Upper level
Changes?
HPC-Connection (SC ‘16) 45 Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, and HBase Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 195 organizations from 27 countries
• More than 18,600 downloads from the project site
• RDMA for Impala (upcoming)
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
HPC-Connection (SC ‘16) 46 Network Based Computing Laboratory
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
HPC-Connection (SC ‘16) 47 Network Based Computing Laboratory
• RDMA-based Designs and Performance Evaluation
– HDFS
– MapReduce
– Spark
Acceleration Case Studies and Performance Evaluation
HPC-Connection (SC ‘16) 48 Network Based Computing Laboratory
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the heterogeneous
storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on data usage
pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
Enhanced HDFS with In-Memory and Heterogeneous Storage
Hybrid Replication
Data Placement Policies
Eviction/Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters
with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
HPC-Connection (SC ‘16) 49 Network Based Computing Laboratory
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
OSU Design
Applications
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in
MapReduce over High Performance Interconnects, ICS, June 2014
HPC-Connection (SC ‘16) 50 Network Based Computing Laboratory
0
50
100
150
200
250
80 100 120
Exe
cuti
on
Tim
e (
s)
Data Size (GB)
IPoIB (FDR)
0
50
100
150
200
250
80 100 120
Exe
cuti
on
Tim
e (
s)
Data Size (GB)
IPoIB (FDR)
Performance Benefits – RandomWriter & TeraGen in TACC-Stampede
Cluster with 32 Nodes with a total of 128 maps
• RandomWriter
– 3-4x improvement over IPoIB
for 80-120 GB file size
• TeraGen
– 4-5x improvement over IPoIB
for 80-120 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
HPC-Connection (SC ‘16) 51 Network Based Computing Laboratory
0
100
200
300
400
500
600
700
800
900
80 100 120
Exe
cuti
on
Tim
e (
s)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
100
200
300
400
500
600
80 100 120
Exe
cuti
on
Tim
e (
s)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
Performance Benefits – Sort & TeraSort in TACC-Stampede
Cluster with 32 Nodes with a total of 128 maps and 64 reduces
• Sort with single HDD per node
– 40-52% improvement over IPoIB
for 80-120 GB data
• TeraSort with single HDD per node
– 42-44% improvement over IPoIB
for 80-120 GB data
Reduced by 52% Reduced by 44%
Cluster with 32 Nodes with a total of 128 maps and 57 reduces
HPC-Connection (SC ‘16) 52 Network Based Computing Laboratory
• RDMA-based Designs and Performance Evaluation
– HDFS
– MapReduce
– Spark
Acceleration Case Studies and Performance Evaluation
HPC-Connection (SC ‘16) 53 Network Based Computing Laboratory
• Design Features
– RDMA based shuffle plugin
– SEDA-based architecture
– Dynamic connection management and sharing
– Non-blocking data transfer
– Off-JVM-heap buffer management
– InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High
Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks (IB, iWARP, RoCE ..)
Apache Spark Benchmarks/Applications/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Native RDMA-based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-Sort)
Block Transfer Service (Netty, NIO, RDMA-Plugin)
Netty
Server
NIO
Server RDMA
Server
Netty
Client
NIO
Client RDMA
Client
HPC-Connection (SC ‘16) 54 Network Based Computing Laboratory
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Tim
e (
sec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Tim
e (
sec)
Data Size (GB)
IPoIB
RDMA
74% 80%
HPC-Connection (SC ‘16) 55 Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA-based design for Spark 1.5.1
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Tim
e (
sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Tim
e (
sec)
Data Size (GB)
IPoIB
RDMA
43% 37%
HPC-Connection (SC ‘16) 56 Network Based Computing Laboratory
Performance Evaluation on SDSC Comet: Astronomy Application
• Kira Toolkit1: Distributed astronomy image
processing toolkit implemented using Apache Spark.
• Source extractor application, using a 65GB dataset
from the SDSS DR2 survey that comprises 11,150
image files.
• Compare RDMA Spark performance with the
standard apache implementation using IPoIB.
1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A.
Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An
Astronomy Use Case. CoRR, vol: abs/1507.03325, Aug 2015.
0
20
40
60
80
100
120
RDMA Spark Apache Spark(IPoIB)
21 %
Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores.
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
HPC-Connection (SC ‘16) 57 Network Based Computing Laboratory
• Exascale systems will be constrained by – Power
– Memory per core
– Data movement cost
– Faults
• Programming Models, Runtimes and Middleware need to be designed for
– Scalability
– Performance
– Fault-resilience
– Energy-awareness
– Programmability
– Productivity
• Highlighted some of the issues and challenges
• Need continuous innovation on all these fronts
Looking into the Future ….
HPC-Connection (SC ‘16) 58 Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
HPC-Connection (SC ‘16) 59 Network Based Computing Laboratory
Personnel Acknowledgments Current Students
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– S. Guganani (Ph.D.)
– J. Hashmi (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– K. Hamidouche
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– M. Arnold
– J. Perkins
Current Research Specialist
– J. Smith
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– S. Marcarelli
– J. Vienne
– H. Wang
HPC-Connection (SC ‘16) 60 Network Based Computing Laboratory
Thank You!
The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/