63
MVAPICH2 and MVAPICH2X Projects: Latest Developments and Future Plans Dhabaleswar K. (DK) Panda The Ohio State University Email: [email protected] h<p://www.cse.ohiostate.edu/~panda Talk at HPC Advisory Council European Conference (2014) by

MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Embed Size (px)

DESCRIPTION

In this video from the 2014 HPC Advisory Council Europe Conference, DK Panda from Ohio State University presents: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans. This talk will focus on latest developments and future plans for the MVAPICH2 and MVAPICH2-X projects. For the MVAPICH2 project, we will focus on scalable and highly-optimized designs for pt-to-pt communication (two-sided and one-sided MPI-3 RMA), collective communication (blocking and MPI-3 non-blocking), support for GPGPUs and Intel MIC, support for MPI-T interface and schemes for fault-tolerance/fault-resilience. For the MVAPICH2-X project, will focus on efficient support for hybrid MPI and PGAS (UPC and OpenSHMEM) programming model with unified runtime." Watch the video presentation: http://wp.me/p3RLHQ-coF

Citation preview

Page 1: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MVAPICH2  and  MVAPICH2-­‐X  Projects:    Latest  Developments  and  Future  Plans  

Dhabaleswar  K.  (DK)  Panda  The  Ohio  State  University  

E-­‐mail:  [email protected]­‐state.edu  h<p://www.cse.ohio-­‐state.edu/~panda  

Talk  at    HPC  Advisory  Council  European  Conference  (2014)      

by  

Page 2: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

High-­‐End  CompuPng  (HEC):  PetaFlop  to  ExaFlop  

HPC  Advisory  Council  European  Conference,  June  '14   2  

Expected  to  have  an  ExaFlop  system  in  2020-­‐2022!  

100 PFlops in 2015

1 EFlops in 2018?

Page 3: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   3  

Trends  for  Commodity  CompuPng  Clusters  in  the  Top  500  List  (hWp://www.top500.org)  

0  10  20  30  40  50  60  70  80  90  100  

0  50  

100  150  200  250  300  350  400  450  500  

Percen

tage  of  C

lusters    

Num

ber  o

f  Clusters  

Timeline  

Percentage  of  Clusters  Number  of  Clusters  

Page 4: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Drivers  of  Modern  HPC  Cluster  Architectures  

•  MulR-­‐core  processors  are  ubiquitous  

•  InfiniBand  very  popular  in  HPC  clusters  •     

•  Accelerators/Coprocessors  becoming  common  in  high-­‐end  systems  

•  Pushing  the  envelope  for  Exascale  compuRng      

 

 

Accelerators  /  Coprocessors    high  compute  density,  high  performance/waW  

>1  TFlop  DP  on  a  chip    

High  Performance  Interconnects  -­‐  InfiniBand  <1usec  latency,  >100Gbps  Bandwidth      

Tianhe  –  2  (1)   Titan  (2)   Stampede  (6)   Tianhe  –  1A  (10)  

4  

MulP-­‐core  Processors  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 5: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  207  IB  Clusters  (41%)  in  the  November  2013  Top500  list      

           (h<p://www.top500.org)  

•  InstallaRons  in  the  Top  40  (19  systems):  

HPC  Advisory  Council  European  Conference,  June  '14  

Large-­‐scale  InfiniBand  InstallaPons  

519,640  cores  (Stampede)  at  TACC  (7th)   70,560  cores  (Helios)  at  Japan/IFERC  (24th)  

147,  456  cores  (Super  MUC)  in    Germany  (10th)   138,368  cores  (Tera-­‐100)  at  France/CEA  (29th)  

74,358  cores  (Tsubame  2.5)  at  Japan/GSIC  (11th)   60,000-­‐cores,  iDataPlex  DX360M4  at  Germany/Max-­‐Planck  (31st)  

194,616  cores  (Cascade)  at  PNNL  (13th)   53,504  cores  (PRIMERGY)  at  Australia/NCI  (32nd)  

110,400  cores  (Pangea)  at  France/Total  (14th)   77,520  cores  (Conte)  at  Purdue  University  (33rd)  

96,192  cores  (Pleiades)  at  NASA/Ames  (16th)   48,896  cores  (MareNostrum)  at  Spain/BSC  (34th)  

73,584  cores  (Spirit)  at  USA/Air  Force  (18th)   222,072  (PRIMERGY)  at  Japan/Kyushu  (36th)    

77,184  cores  (Curie  thin  nodes)  at  France/CEA  (20th)   78,660  cores  (Lomonosov)  in  Russia  (37th  )  

120,  640  cores  (Nebulae)  at  China/NSCS  (21st)   137,200  cores  (Sunway  Blue  Light)    in  China  40th)  

72,288  cores  (Yellowstone)  at  NCAR  (22nd)   and  many  more!  

5  

Page 6: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   6  

Parallel  Programming  Models  Overview  

P1   P2   P3  

Shared  Memory  

P1   P2   P3  

Memory   Memory   Memory  

P1   P2   P3  

Memory   Memory   Memory  

Logical  shared  memory  

Shared  Memory  Model  

SHMEM,  DSM  Distributed  Memory  Model    

MPI  (Message  Passing  Interface)  

ParRRoned  Global  Address  Space  (PGAS)  

Global  Arrays,  UPC,  Chapel,  X10,  CAF,  …  

•  Programming  models  provide  abstract  machine  models  

•  Models  can  be  mapped  on  different  types  of  systems  –  Distributed  Shared  Memory  (DSM),  MPI  within  a  node,  etc.  

–  Logical  shared  memory  across  nodes  (PGAS)  

Page 7: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

7  

ParPPoned  Global  Address  Space  (PGAS)  Models  

HPC  Advisory  Council  European  Conference,  June  '14  

•  Key  features  -  Simple  shared  memory  abstracRons    

-  Light  weight  one-­‐sided  communicaRon    

-  Easier  to  express  irregular  communicaRon  

•  Different  approaches  to  PGAS    -  Languages    

•  Unified  Parallel  C  (UPC)  

•  Co-­‐Array  Fortran  (CAF)  

•  X10  

-  Libraries  •  OpenSHMEM  

•  Global  Arrays  

•  Chapel  

Page 8: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Hierarchical  architectures  with  mulRple  address  spaces  

•  (MPI  +  PGAS)  Model  –  MPI  across  address  spaces  

–  PGAS  within  an  address  space  

•  MPI  is  good  at  moving  data  between  address  spaces  

•  Within  an  address  space,  MPI  can  interoperate  with  other  shared  memory  programming  models    

•  Can  co-­‐exist  with  OpenMP  for  offloading  computaRon  

•  ApplicaRons  can  have  kernels  with  different  communicaRon  pa<erns  

•  Can  benefit  from  different  models  

•  Re-­‐wriRng  complete  applicaRons  can  be  a  huge  effort  

•  Port  criRcal  kernels  to  the  desired  model  instead  

HPC  Advisory  Council  European  Conference,  June  '14   8  

MPI+PGAS  for  Exascale  Architectures  and  ApplicaPons  

Page 9: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Hybrid  (MPI+PGAS)  Programming  

•  ApplicaRon  sub-­‐kernels  can  be  re-­‐wri<en  in  MPI/PGAS  based  on  communicaRon  characterisRcs  

•  Benefits:  –  Best  of  Distributed  CompuRng  Model  

–  Best  of  Shared  Memory  CompuRng  Model  

•  Exascale  Roadmap*:    –  “Hybrid  Programming  is  a  pracRcal  way  to  

 program  exascale  systems”  

*  The  InternaConal  Exascale  SoDware  Roadmap,  Dongarra,  J.,  Beckman,  P.  et  al.,  Volume  25,  Number  1,  2011,  InternaConal  Journal  of  High  Performance  Computer  ApplicaCons,  ISSN  1094-­‐3420  

Kernel  1  MPI  

Kernel  2  MPI  

Kernel  3  MPI  

Kernel  N  MPI  

HPC  ApplicaPon  

Kernel  2  PGAS  

Kernel  N  PGAS  

HPC  Advisory  Council  European  Conference,  June  '14   9  

Page 10: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Designing  Sogware  Libraries  for  MulP-­‐Petaflop  and  Exaflop  Systems:  Challenges    

Programming  Models  MPI,  PGAS  (UPC,  Global  Arrays,  OpenSHMEM),  CUDA,  OpenACC,  Cilk,  Hadoop,  MapReduce,  etc.  

ApplicaPon  Kernels/ApplicaPons    

Networking  Technologies  (InfiniBand,  40/100GigE,  

Aries,  BlueGene)  

 MulP/Many-­‐core  Architectures  

Point-­‐to-­‐point  CommunicaPon  (two-­‐sided    &  one-­‐sided)    

CollecPve  CommunicaPon    

SynchronizaPon  &  Locks    

I/O  &  File  Systems  

Fault  Tolerance  

CommunicaPon  Library  or  RunPme  for  Programming  Models  

10  

Accelerators  (NVIDIA  and  MIC)  

Middleware    

HPC  Advisory  Council  European  Conference,  June  '14  

Co-­‐Design  OpportuniPes  

and  Challenges  

across  Various  Layers  

 

Performance  Scalability  Fault-­‐

Resilience  

Page 11: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  

•  Balancing  intra-­‐node  and  inter-­‐node  communicaRon  for  next  generaRon  mulR-­‐core  (128-­‐1024  cores/node)  –  MulRple  end-­‐points  per  node  

•  Support  for  efficient  mulR-­‐threading  •  Support  for  GPGPUs  and  Accelerators  •  Scalable  CollecRve  communicaRon  

–  Offload  –  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Fault-­‐tolerance/resiliency  •  QoS  support  for  communicaRon  and  I/O  •  Support  for  Hybrid  MPI+PGAS  programming  (MPI  +  OpenMP,  MPI  +  UPC,  

MPI  +  OpenSHMEM,  …)        

Challenges  in  Designing    (MPI+X)  at  Exascale  

11  HPC  Advisory  Council  European  Conference,  June  '14  

Page 12: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  High  Performance  open-­‐source  MPI  Library  for  InfiniBand,  10Gig/iWARP,  and  RDMA  over  Converged  Enhanced  Ethernet  (RoCE)  

–  MVAPICH  (MPI-­‐1),  MVAPICH2  (MPI-­‐2.2  and  MPI-­‐3.0),  Available  since  2002  

–  MVAPICH2-­‐X  (MPI  +  PGAS),  Available  since  2012  

–  Support  for  GPGPUs  and  MIC  

–  Used  by  more  than    2,150  organizaPons    in  72  countries  

–  More  than  214,000  downloads  from  OSU  site  directly  

–  Empowering  many  TOP500  clusters  •   7th  ranked  519,640-­‐core  cluster  (Stampede)  at    TACC  

•  11th    ranked  74,358-­‐core  cluster  (Tsubame  2.5)  at  Tokyo  InsRtute  of  Technology  

•  16th  ranked  96,192-­‐core  cluster  (Pleiades)  at  NASA    

•  75th  ranked    16,896-­‐core  cluster  (Keenland)  at  GaTech  and    many  others  .  .  .  

–  Available  with  sotware  stacks  of  many  IB,  HSE,  and  server  vendors  including  Linux  Distros  (RedHat  and  SuSE)  

–  h<p://mvapich.cse.ohio-­‐state.edu  

•  Partner  in  the  U.S.  NSF-­‐TACC  Stampede  System  

MVAPICH2/MVAPICH2-­‐X  Sogware  

12  HPC  Advisory  Council  European  Conference,  June  '14  

Page 13: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Released  on  06/20/14  

•  Major  Features  and  Enhancements  –  Based  on  MPICH-­‐3.1  

–  MPI-­‐3  RMA  Support  

–  Support  for  Non-­‐blocking  collecRves    

–  MPI-­‐T  support    

–  CMA  support  is  default  for  intra-­‐node  communicaRon  

–  OpRmizaRon  of  collecRves  with  CMA  support  

–  Large  message  transfer  support  

–  Reduced  memory  footprint  

–  Improved  Job  start-­‐up  Rme  

–  OpRmizaRon  of  collecRves    and  tuning  for  mulRple  plaworms  

–  Updated  hwloc  to  version  1.9  

•  MVAPICH2-­‐X  2.0  GA    supports  hybrid  MPI  +  PGAS  (UPC  and  OpenSHMEM)  programming  models  –  Based  on  MVAPICH2  2.0  GA  including  MPI-­‐3  features  

–  Compliant  with  UPC  2.18.0  and  OpenSHMEM  v1.0f    

13  

MVAPICH2  2.0  GA  and  MVAPICH2-­‐X  2.0  GA  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 14: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  –  Support  with  Direct  Connected  Transport  (DCT)  

•  CollecRve  communicaRon  –  Hardware-­‐mulRcast-­‐based  –  Offload  and  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Support  for  GPGPUs  •  Support  for  Intel  MICs  •  Fault-­‐tolerance  •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with  

Unified  RunRme        

Overview  of  A  Few  Challenges  being  Addressed  by  MVAPICH2/MVAPICH2-­‐X  for  Exascale  

14  HPC  Advisory  Council  European  Conference,  June  '14  

Page 15: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   15  

One-­‐way  Latency:  MPI  over  IB  with  MVAPICH2  

0.00  

1.00  

2.00  

3.00  

4.00  

5.00  

6.00  Small  Message  Latency  

Message  Size  (bytes)  

Latency  (us)  

1.66  

1.56  

1.64  

1.82  

0.99  1.09  

1.12  

DDR,  QDR  -­‐  2.4  GHz  Quad-­‐core  (Westmere)  Intel  PCI  Gen2  with  IB  switch  FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch  

ConnectIB-­‐Dual  FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch  ConnectIB-­‐Dual  FDR  -­‐  2.8  GHz  Deca-­‐core  (IvyBridge)  Intel  PCI  Gen3  with  IB  switch  

0.00  

50.00  

100.00  

150.00  

200.00  

250.00  Qlogic-­‐DDR  Qlogic-­‐QDR  ConnectX-­‐DDR  ConnectX2-­‐PCIe2-­‐QDR  ConnectX3-­‐PCIe3-­‐FDR  Sandy-­‐ConnectIB-­‐DualFDR  Ivy-­‐ConnectIB-­‐DualFDR  

Large  Message  Latency  

Message  Size  (bytes)  

Latency  (us)  

Page 16: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   16  

Bandwidth:  MPI  over  IB  with  MVAPICH2  

0  

2000  

4000  

6000  

8000  

10000  

12000  

14000  UnidirecPonal  Bandwidth  

Band

width  (M

Bytes/sec)  

Message  Size  (bytes)  

3280  

3385  

1917  1706  

6343  

12485  

12810  

0  

5000  

10000  

15000  

20000  

25000  

30000  

4   16   64   256  1024   4K   16K   64K  256K  1M  

Qlogic-­‐DDR  Qlogic-­‐QDR  ConnectX-­‐DDR  ConnectX2-­‐PCIe2-­‐QDR  ConnectX3-­‐PCIe3-­‐FDR  Sandy-­‐ConnectIB-­‐DualFDR  Ivy-­‐ConnectIB-­‐DualFDR  

BidirecPonal  Bandwidth  

Band

width  (M

Bytes/sec)  

Message  Size  (bytes)  

3341  3704  

4407  

11643  

6521  

21025  

24727  

DDR,  QDR  -­‐  2.4  GHz  Quad-­‐core  (Westmere)  Intel  PCI  Gen2  with  IB  switch  FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch  

ConnectIB-­‐Dual  FDR  -­‐  2.6  GHz  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  IB  switch  ConnectIB-­‐Dual  FDR  -­‐  2.8  GHz  Deca-­‐core  (IvyBridge)  Intel  PCI  Gen3  with  IB  switch  

 

Page 17: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

0  

0.2  

0.4  

0.6  

0.8  

1  

0   1   2   4   8   16   32   64   128   256   512   1K  

Latency  (us)  

Message  Size  (Bytes)  

Latency  

Intra-­‐Socket   Inter-­‐Socket  

MVAPICH2  Two-­‐Sided  Intra-­‐Node  Performance  (Shared  memory  and  Kernel-­‐based  Zero-­‐copy  Support  (LiMIC  and  CMA))  

17  

Latest  MVAPICH2  2.0  

Intel  Ivy-­‐bridge  

0.18  us  

0.45  us  

0  2000  4000  6000  8000  10000  12000  14000  16000  

Band

width  (M

B/s)  

Message  Size  (Bytes)  

Bandwidth  (Inter-­‐socket)  

inter-­‐Socket-­‐CMA  inter-­‐Socket-­‐Shmem  inter-­‐Socket-­‐LiMIC  

0  2000  4000  6000  8000  

10000  12000  14000  16000  

Band

width  (M

B/s)  

Message  Size  (Bytes)  

Bandwidth  (Intra-­‐socket)  

intra-­‐Socket-­‐CMA  intra-­‐Socket-­‐Shmem  intra-­‐Socket-­‐LiMIC  

14,250  MB/s   13,749  MB/s  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 18: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

1   2   4   8   16   32   64   128  256  512   1K   2K   4K  

Latency  (us)  

Message  Size  (Bytes)  

Inter-­‐node  Get/Put  Latency  

Get   Put  

MPI-­‐3  RMA  Get/Put  with  Flush  Performance    

18  

Latest  MVAPICH2  2.0,  Intel  Sandy-­‐bridge  with  Connect-­‐IB  (single-­‐port)  

2.04  us  

1.56  us  

0  

5000  

10000  

15000  

20000  

1   2   4   8   16  

32  

64  

128  

256  

512   1K  

2K  

4K  

8K  

16K  

32K  

64K  

128K

 256K

 Band

width  (M

bytes/sec)  

Message  Size  (Bytes)  

Intra-­‐socket  Get/Put  Bandwidth  

Get   Put  

0  

0.05  

0.1  

0.15  

0.2  

0.25  

0.3  

0.35  

1   2   4   8   16   32   64  128  256  512  1K   2K   4K   8K  

Latency  (us)  

Message  Size  (Bytes)  

Intra-­‐socket  Get/Put  Latency  

Get   Put  

0.08  us  

HPC  Advisory  Council  European  Conference,  June  '14  

0  1000  2000  3000  4000  5000  6000  7000  8000  

1   4   16   64   256   1K   4K   16K   64K   256K  

Band

width  (M

bytes/sec)  

Message  Size  (Bytes)  

Inter-­‐node  Get/Put  Bandwidth  

Get   Put  6881  

6876  

14926  

15364  

Page 19: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Memory  usage  for  32K  processes  with  8-­‐cores  per  node  can  be  54  MB/process  (for  connecRons)    

•  NAMD  performance  improves  when  there  is  frequent  communicaRon  to  many  peers  

HPC  Advisory  Council  European  Conference,  June  '14  

Minimizing  memory  footprint  with  XRC  and  Hybrid  Mode  Memory  Usage   Performance  on  NAMD  (1024  cores)  

19  

•  Both  UD  and  RC/XRC  have  benefits    

•  Hybrid  for  the  best  of  both  

•  Available  since  MVAPICH2  1.7  as  integrated  interface  

•  RunRme  Parameters:    RC  -­‐  default;      

•  UD  -­‐  MV2_USE_ONLY_UD=1  

•  Hybrid  -­‐    MV2_HYBRID_ENABLE_THRESHOLD=1    

M.  Koop,  J.  Sridhar  and  D.  K.  Panda,  “Scalable  MPI  Design  over  InfiniBand  using  eXtended  Reliable  ConnecPon,”  Cluster  ‘08  

0  

100  

200  

300  

400  

500  

1   4   16   64   256   1K   4K   16K  Mem

ory  (M

B/process)  

ConnecPons  

MVAPICH2-­‐RC  MVAPICH2-­‐XRC  

0  

0.5  

1  

1.5  

apoa1 er-gre f1atpase jac

Normalized

 Tim

e  

Dataset  

MVAPICH2-­‐RC   MVAPICH2-­‐XRC  

0  

2  

4  

6  

128   256   512   1024  

Time  (us)  

Number  of  Processes  

UD   Hybrid   RC  

26%   40%   30%  38%  

Page 20: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   20  

Dynamic  Connected  (DC)  Transport  in  MVAPICH2  

Nod

e 0

P1

P0

Node 1

P3

P2 Node 3

P7

P6

Nod

e 2

P5

P4

IB Network

•  Constant  connecRon  cost  (One  QP  for  any  peer)  •  Full  Feature  Set  (RDMA,  Atomics  etc)  •  Separate  objects  for  send  (DC  IniRator)  and  receive  (DC  

Target)  –  DC  Target  idenRfied  by  “DCT  Number”  –  Messages  routed  with  (DCT  Number,  LID)  –  Requires  same  “DC  Key”  to  enable  communicaRon  

•  IniRal  study  done  in  MVAPICH2  –  More  details  in  Research  Papers  06,  Wednesday  June  25;  

12:00  PM  to  12:30  PM  

0  

0.5  

1  

160   320   620  Normalized

 ExecuPo

n  Time  

Number  of  Processes  

NAMD  -­‐  Apoa1:  Large  data  set  RC   DC-­‐Pool   UD   XRC  

DCT  support  available  publicly  in  Mellanox  OFED  –  2.2.01  

10  22  

47  97  

1   1   1  2  

10   10   10   10  

1   1  3  

5  

1  

10  

100  

80   160   320   640  

Conn

ecPo

n  Mem

ory  (KB)  

Number  of  Processes  

Memory  Footprint  for  Alltoall  RC   DC-­‐Pool  UD   XRC  

Page 21: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  –  Support  with  Direct  Connected  Transport  (DCT)  

•  CollecRve  communicaRon  –  Hardware-­‐mulRcast-­‐based  –  Offload  and  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Support  for  GPGPUs  •  Support  for  Intel  MICs  •  Fault-­‐tolerance  •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with  

Unified  RunRme        

Overview  of  A  Few  Challenges  being  Addressed  by  MVAPICH2/MVAPICH2-­‐X  for  Exascale  

21  HPC  Advisory  Council  European  Conference,  June  '14  

Page 22: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   22  

Hardware  MulPcast-­‐aware  MPI_Bcast  on  Stampede  

0  5  

10  15  20  25  30  35  40  

2   8   32   128   512  

Latency    (u

s)  

Message  Size  (Bytes)  

Small  Messages  (102,400  Cores)  Default  MulPcast  

ConnectX-­‐3-­‐FDR  (54  Gbps):  2.7  GHz  Dual  Octa-­‐core  (SandyBridge)  Intel  PCI  Gen3  with  Mellanox  IB  FDR  switch  

0  

100  

200  

300  

400  

500  

2K   8K   32K   128K  

Latency    (u

s)  

Message  Size  (Bytes)  

Large  Messages  (102,400  Cores)  Default  MulPcast  

0  5  10  15  20  25  30  

Latency  (us)  

Number  of  Nodes  

16  Byte  Message  

Default  

MulPcast  

0  

50  

100  

150  

200  

Latency  (us)  

Number  of  Nodes  

32  KByte  Message  

Default  

MulPcast  

Page 23: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

0  

1  

2  

3  

4  

5  

512   600   720   800  ApplicaP

on  Run

-­‐Tim

e  (s)  

Data  Size  

0  5  

10  15  

64   128   256   512  Run-­‐Time  (s)  

Number  of  Processes  

PCG-­‐Default   Modified-­‐PCG-­‐Offload  

ApplicaPon  benefits  with  Non-­‐Blocking  CollecPves  based  on  CX-­‐3  CollecPve  Offload  

23  

Modified    P3DFFT  with  Offload-­‐Alltoall  does  up  to  17%  be<er  than  default  version  (128  Processes)  

K.  Kandalla,  et.  al..  High-­‐Performance  and  Scalable  Non-­‐Blocking  All-­‐to-­‐All  with  CollecPve  Offload  on  InfiniBand  Clusters:  A  Study  with  Parallel  3D  FFT,  ISC  2011  

HPC  Advisory  Council  European  Conference,  June  '14  

17%  

0  0.2  0.4  0.6  0.8  1  

1.2  

10   20   30   40   50   60   70  

Normalized

   Pe

rforman

ce    

HPL-­‐Offload   HPL-­‐1ring   HPL-­‐Host  

HPL  Problem  Size  (N)  as  %  of  Total  Memory  

4.5%  

Modified    HPL  with  Offload-­‐Bcast  does  up  to  4.5%  be<er  than  default  version  (512  Processes)  

Modified    Pre-­‐Conjugate  Gradient  Solver  with  Offload-­‐Allreduce  does  up  to  21.8%  be<er  than  default  version  

K.  Kandalla,  et.  al,  Designing  Non-­‐blocking  Broadcast  with  CollecPve  Offload  on  InfiniBand  Clusters:  A  Case  Study  with  HPL,  HotI  2011  

K.  Kandalla,  et.  al.,  Designing  Non-­‐blocking  Allreduce  with  CollecPve  Offload  on  InfiniBand  Clusters:  A  Case  Study  with  Conjugate  Gradient  Solvers,  IPDPS  ’12  

21.8%  

Can  Network-­‐Offload  based  Non-­‐Blocking  Neighborhood  MPI  CollecPves  Improve  CommunicaPon  Overheads  of  Irregular  Graph    Algorithms?  K.  Kandalla,  A.  Buluc,  H.  Subramoni,  K.  Tomko,  J.  Vienne,  L.  Oliker,  and  D.  K.  Panda,  IWPAPS’  12  

Page 24: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   24  

Network-­‐Topology-­‐Aware  Placement  of  Processes  Can  we  design  a  highly  scalable  network  topology  detecRon  service  for  IB?    How  do  we  design  the  MPI  communicaRon  library  in  a  network-­‐topology-­‐aware  manner  to  efficiently  leverage  the  topology  informaRon  generated  by  our  service?    What  are  the  potenRal  benefits  of  using  a  network-­‐topology-­‐aware  MPI  library  on  the  performance  of  parallel  scienRfic  applicaRons?  

Overall  performance  and  Split  up  of  physical  communicaPon  for  MILC  on  Ranger  

Performance  for  varying  system  sizes   Default  for  2048  core  run   Topo-­‐Aware  for  2048  core  run  

15%  

H.  Subramoni,  S.  Potluri,  K.  Kandalla,  B.  Barth,  J.  Vienne,  J.  Keasler,  K.  Tomko,  K.  Schulz,  A.  Moody,  and  D.  K.  Panda,  Design  of  a  Scalable  InfiniBand  Topology  Service  to  Enable  Network-­‐Topology-­‐Aware  Placement  of  Processes,  SC'12  .  BEST    Paper  and  BEST  STUDENT  Paper  Finalist  

•     Reduce  network  topology  discovery  Pme  from  O(N2hosts)  to  O(Nhosts)  

•     15%  improvement  in  MILC  execuPon  Pme  @  2048  cores  •     15%  improvement  in  Hypre  execuPon  Pme  @  1024  cores            

Page 25: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

25  

Power  and  Energy  Savings  with  Power-­‐Aware  CollecPves  Performance  and  Power  Comparison  :  MPI_Alltoall  with  64  processes  on  8  nodes  

K.  Kandalla,  E.  P.  Mancini,  S.  Sur  and  D.  K.  Panda,    “Designing  Power  Aware  CollecPve  CommunicaPon  Algorithms  for  Infiniband  Clusters”,    ICPP  ‘10  

CPMD  ApplicaPon  Performance  and  Power  Savings    

5% 32%

HPC  Advisory  Council  European  Conference,  June  '14  

Page 26: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  –  Support  with  Direct  Connected  Transport  (DCT)  

•  CollecRve  communicaRon  –  Hardware-­‐mulRcast-­‐based  –  Offload  and  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Support  for  GPGPUs  •  Support  for  Intel  MICs  •  Fault-­‐tolerance  •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with  

Unified  RunRme        

Overview  of  A  Few  Challenges  being  Addressed  by  MVAPICH2/MVAPICH2-­‐X  for  Exascale  

26  HPC  Advisory  Council  European  Conference,  June  '14  

Page 27: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Before  CUDA  4:  AddiRonal  copies    –  Low  performance  and  low  producRvity      

•  Ater  CUDA  4:  Host-­‐based  pipeline    –  Unified  Virtual  Address  –  Pipeline  CUDA  copies  with  IB  transfers    –  High  performance  and  high  producRvity    

•   Ater  CUDA  5.5:  GPUDirect-­‐RDMA  support    –  GPU  to  GPU  direct  transfer    –  Bypass  the  host  memory    

–  Hybrid  design  to  avoid  PCI  bo<lenecks    

HPC  Advisory  Council  European  Conference,  June  '14   27  

MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI    

InfiniBand  

GPU  

GPU  Memor

y  

CPU  

Chip  set  

Page 28: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MVAPICH2  1.8,  1.9,  2.0  Releases  •  Support  for  MPI  communicaRon  from  NVIDIA  GPU  device  

memory  •  High  performance  RDMA-­‐based  inter-­‐node  point-­‐to-­‐point  

communicaRon  (GPU-­‐GPU,  GPU-­‐Host  and  Host-­‐GPU)  •  High  performance  intra-­‐node  point-­‐to-­‐point  

communicaRon  for  mulR-­‐GPU  adapters/node  (GPU-­‐GPU,  GPU-­‐Host  and  Host-­‐GPU)  

•  Taking  advantage  of  CUDA  IPC  (available  since  CUDA  4.1)  in  intra-­‐node  communicaRon  for  mulRple  GPU  adapters/node  

•  OpRmized  and  tuned  collecRves  for  GPU  device  buffers  •  MPI  datatype  support  for  point-­‐to-­‐point  and  collecRve  

communicaRon  from  GPU  device  buffers  

28  HPC  Advisory  Council  European  Conference,  June  '14  

Page 29: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Hybrid  design  using  GPUDirect  RDMA  

–  GPUDirect  RDMA  and  Host-­‐based  pipelining  

–  Alleviates  P2P  bandwidth  bo<lenecks  on  SandyBridge  and  IvyBridge  

•  Support  for  communicaRon  using  mulR-­‐rail  

•  Support  for  Mellanox  Connect-­‐IB  and  ConnectX  VPI  adapters  

•  Support  for  RoCE  with  Mellanox  ConnectX  VPI  adapters  

HPC  Advisory  Council  European  Conference,  June  '14   29  

GPUDirect  RDMA  (GDR)  with  CUDA  

IB  Adapter  

System  Memory  

GPU  Memory  

GPU  

CPU  

Chipset  

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB  E5-­‐2670  

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB  E5-­‐2680V2  

SNB  E5-­‐2670  /  

IVB  E5-­‐2680V2  

S.  Potluri,  K.  Hamidouche,  A.  Venkatesh,  D.  Bureddy  and  D.  K.  Panda,  Efficient  Inter-­‐node  MPI  CommunicaPon  using  GPUDirect  RDMA  for  InfiniBand  Clusters  with  NVIDIA  GPUs,  Int'l  Conference  on  Parallel  Processing  (ICPP  '13)  

Page 30: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

30  

Performance  of  MVAPICH2  with  GPUDirect-­‐RDMA:  Latency  GPU-­‐GPU  Internode  MPI  Latency  

0  

100  

200  

300  

400  

500  

600  

700  

800  

8K   32K   128K   512K   2M  

1-­‐Rail  2-­‐Rail  1-­‐Rail-­‐GDR  2-­‐Rail-­‐GDR  

Large  Message  Latency  

Message  Size  (Bytes)  

Latency  (us)  

Based  on  MVAPICH2-­‐2.0b  Intel  Ivy  Bridge  (E5-­‐2680  v2)  node  with  20  cores  

NVIDIA  Tesla  K40c  GPU,  Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA  CUDA  5.5,  Mellanox  OFED  2.0  with  GPUDirect-­‐RDMA  Plug-­‐in  

10  %  

0  

5  

10  

15  

20  

25  

1   4   16   64   256   1K   4K  

1-­‐Rail  2-­‐Rail  1-­‐Rail-­‐GDR  2-­‐Rail-­‐GDR  

Small  Message  Latency  

Message  Size  (Bytes)  

Latency  (us)  

67  %  

5.49  usec  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 31: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

31  

Performance  of  MVAPICH2  with  GPUDirect-­‐RDMA:  Bandwidth  GPU-­‐GPU  Internode  MPI  Uni-­‐DirecPonal  Bandwidth  

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1800  

2000  

1   4   16   64   256   1K   4K  

1-­‐Rail  2-­‐Rail  1-­‐Rail-­‐GDR  2-­‐Rail-­‐GDR  

Small  Message  Bandwidth  

Message  Size  (Bytes)  

Band

width  (M

B/s)  

0  

2000  

4000  

6000  

8000  

10000  

12000  

8K   32K   128K   512K   2M  

1-­‐Rail  2-­‐Rail  1-­‐Rail-­‐GDR  2-­‐Rail-­‐GDR  

Large  Message  Bandwidth  

Message  Size  (Bytes)  

Band

width  (M

B/s)  

5x  

9.8  GB/s  

HPC  Advisory  Council  European  Conference,  June  '14  

Based  on  MVAPICH2-­‐2.0b  Intel  Ivy  Bridge  (E5-­‐2680  v2)  node  with  20  cores  

NVIDIA  Tesla  K40c  GPU,  Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA  CUDA  5.5,  Mellanox  OFED  2.0  with  GPUDirect-­‐RDMA  Plug-­‐in  

Page 32: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

32  

Performance  of  MVAPICH2  with  GPUDirect-­‐RDMA:  Bi-­‐Bandwidth  GPU-­‐GPU  Internode  MPI  Bi-­‐direcPonal  Bandwidth  

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1800  

2000  

1   4   16   64   256   1K   4K  

1-­‐Rail  2-­‐Rail  1-­‐Rail-­‐GDR  2-­‐Rail-­‐GDR  

Small  Message  Bi-­‐Bandwidth  

Message  Size  (Bytes)  

Bi-­‐Ban

dwidth  (M

B/s)  

0  

5000  

10000  

15000  

20000  

25000  

8K   32K   128K   512K   2M  

1-­‐Rail  2-­‐Rail  1-­‐Rail-­‐GDR  2-­‐Rail-­‐GDR  

Large  Message  Bi-­‐Bandwidth  

Message  Size  (Bytes)  

Bi-­‐Ban

dwidth  (M

B/s)  

4.3x  

19  %  

19  GB/s  

HPC  Advisory  Council  European  Conference,  June  '14  

Based  on  MVAPICH2-­‐2.0b  Intel  Ivy  Bridge  (E5-­‐2680  v2)  node  with  20  cores  

NVIDIA  Tesla  K40c  GPU,  Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA  CUDA  5.5,  Mellanox  OFED  2.0  with  GPUDirect-­‐RDMA  Plug-­‐in  

Page 33: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  MVAPICH2-­‐2.0b  with  GDR  support  can  be  downloaded  from  h<ps://mvapich.cse.ohio-­‐state.edu/download/mvapich2gdr/  

•  System  sotware  requirements  •  Mellanox  OFED  2.1  or  later  

•  NVIDIA  Driver  331.20  or  later  

•  NVIDIA  CUDA  Toolkit  5.5  or  later  

•  Plugin  for  GPUDirect  RDMA  

–  h<p://www.mellanox.com/page/products_dyn?product_family=116  

•  Has  opRmized  designs  for  point-­‐to-­‐point  communicaRon  using  GDR  

•  Contact  MVAPICH  help  list  with  any  quesRons  related  to  the  package    

mvapich-­‐[email protected]­‐state.edu  

Using  MVAPICH2-­‐GPUDirect  Version  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 34: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

New  Design  of  MVAPICH2  with  GPUDirect-­‐RDMA  

0  

2  

4  

6  

8  

10  

12  

1   4   16   64   256   1024   4096  

MV2-­‐GDR  2.0b  

MV2-­‐GDR-­‐New  (Loopback)  

MV2-­‐GDR-­‐New  (Fast  Copy)  

Small  Message  Latency  

Message  Size  (Bytes)  

Latency  (us)  

7.3  usec  

5.0  usec  

3.5  usec  

52%  

GPU-­‐GPU  Internode  MPI  Latency  

Intel  Ivy  Bridge  (E5-­‐2630  v2)  node  with  12  cores  w/  NVIDIA  Tesla  K20c  GPU,    Mellanox  Connect-­‐IB  Dual-­‐FDR  HCA  

CUDA  6.0,  Mellanox  OFED  2.1  with  GPUDirect-­‐RDMA  Plug-­‐in  HPC  Advisory  Council  European  Conference,  June  '14  

Page 35: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

0  

500  

1000  

1500  

2000  

2500  

3000  

4   8   16   32   64  

Average  TPS

Number  of  GPU  Nodes

MV2-­‐2.0b-­‐GDR   MV2-­‐NewGDR-­‐Loopback   MV2-­‐NewGDR-­‐Fastcopy  

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  Strong Scaling: fixed 64K particles •  Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs •  Weak Sailing: fixed 2K particles / GPU •  Loopback  and  Fastcopy  get  up  to  54%  and  56%  improvement  for  16  GPUs  

HOOMD-blue Strong Scaling

Application-Level Evaluation (HOOMD-blue)

HOOMD-blue Weak Scaling

48%  47%  56%  53%  

0  500  

1000  1500  2000  2500  3000  3500  4000  4500  

4   8   16   32   64  

Average  TPS

Number  of  GPU  Nodes

MV2-­‐2.0b-­‐GDR   MV2-­‐NewGDR-­‐Loopback   MV2-­‐NewGDR-­‐Fastcopy  

ApplicaPon-­‐Level  EvaluaPon  (HOOMD-­‐blue)  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 36: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MPI-­‐3  RMA  Support  with  GPUDirect  RDMA  

Intel  Ivy  Bridge  (E5-­‐2630  v2)  node  with  12  cores  NVIDIA  Tesla  K20c  GPU,  Mellanox  Connect-­‐IB  HCA  

CUDA  6.0,  Mellanox  OFED  2.1  with  GPUDirect-­‐RDMA  Plugin  

MPI-3 RMA provides flexible synchronization and completion primitives

0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

1   4   16   64   256   1024  

MV2-­‐GDR  2.0b  

MV2-­‐GDR-­‐New  (Send/Recv)  

MV2-­‐GDR-­‐New  (Put+Flush)  

Small  Message  Latency  

Message  Size  (Bytes)  

Latency  (usec)  

39%  

0  

500000  

1000000  

1500000  

2000000  

2500000  

1   4   16   64   256   1024  

MV2-­‐GDR  2.0b  

MV2-­‐GDR-­‐New  (Send/Recv)  

MV2-­‐GDR-­‐New  (Put+Flush)  

Small  Message  Rate  

Message  Size  (Bytes)  

Message  Rate  (M

sgs/sec)  

2.8usec  

New  release  with  the  advanced  designs  coming  soon  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 37: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  –  Support  with  Direct  Connected  Transport  (DCT)  

•  CollecRve  communicaRon  –  Hardware-­‐mulRcast-­‐based  –  Offload  and  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Support  for  GPGPUs  •  Support  for  Intel  MICs  •  Fault-­‐tolerance  •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with  

Unified  RunRme        

Overview  of  A  Few  Challenges  being  Addressed  by  MVAPICH2/MVAPICH2-­‐X  for  Exascale  

37  HPC  Advisory  Council  European  Conference,  June  '14  

Page 38: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MPI  ApplicaPons  on  MIC  Clusters  

Xeon   Xeon  Phi  

MulR-­‐core  Centric  

Many-­‐core  Centric  

MPI  Program  

MPI  Program  

Offloaded  ComputaRon  

MPI  Program   MPI  Program  

MPI  Program  

Host-­‐only  

Offload    (/reverse  Offload)  

Symmetric  

Coprocessor-­‐only  

•   Flexibility  in  launching  MPI  jobs  on  clusters  with  Xeon  Phi    

38  HPC  Advisory  Council  European  Conference,  June  '14  

Page 39: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Data  Movement  on  Intel  Xeon  Phi  Clusters  

CPU   CPU  QPI  

MIC  

PCIe  

MIC  

MIC  

CPU  

MIC  

IB  

Node  0   Node  1   1.  Intra-­‐Socket  2.  Inter-­‐Socket  3.  Inter-­‐Node  4.  Intra-­‐MIC  5.  Intra-­‐Socket  MIC-­‐MIC  6.  Inter-­‐Socket  MIC-­‐MIC  7.  Inter-­‐Node  MIC-­‐MIC  8.  Intra-­‐Socket  MIC-­‐Host  

10.  Inter-­‐Node  MIC-­‐Host  9.  Inter-­‐Socket  MIC-­‐Host  

MPI  Process  

39  

11.  Inter-­‐Node  MIC-­‐MIC  with  IB  adapter    on  remote  socket    

and  more  .  .  .  

•   CriRcal  for  runRmes  to  opRmize  data  movement,  hiding  the  complexity  

•   Connected  as  PCIe  devices  –  Flexibility  but  Complexity  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 40: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

40  

MVAPICH2-­‐MIC  Design  for  Clusters  with  IB  and    MIC  

•  Offload  Mode    

•  Intranode  CommunicaRon  

•  Coprocessor-­‐only  Mode  

•  Symmetric  Mode    

•  Internode  CommunicaRon    

•  Coprocessors-­‐only    

•  Symmetric  Mode  

•  MulR-­‐MIC  Node  ConfiguraRons  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 41: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Direct-­‐IB  CommunicaPon  

41  

 IB  Write  to  Xeon  Phi                                (P2P  Write)  

IB  Read  from  Xeon  Phi  (P2P  Read)  

CPU  

Xeon  Phi  

PCIe  

QPI  

E5-­‐2680  v2  E5-­‐2670  

 

5280  MB/s  (83%)  

IB  FDR  HCA    

   MT4099  

6396  MB/s  (100%)  

962  MB/s  (15%)   3421  MB/s  (54%)  

1075  MB/s  (17%)   1179  MB/s  (19%)  

370  MB/s  (6%)   247  MB/s  (4%)  

(SandyBridge)   (IvyBridge)  

•  IB-­‐verbs  based  communicaPon  from  Xeon  Phi  

•  No  porPng  effort  for  MPI  libraries  with  IB  support  

Peak  IB  FDR  Bandwidth:  6397  MB/s        

Intra-­‐socket  

 IB  Write  to  Xeon  Phi                                (P2P  Write)  

IB  Read  from  Xeon  Phi  (P2P  Read)  Inter-­‐socket  

•  Data  movement  between  IB  HCA  and  Xeon  Phi:    implemented  as  peer-­‐to-­‐peer  transfers  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 42: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MIC-­‐Remote-­‐MIC  P2P  CommunicaPon  

42  HPC  Advisory  Council  European  Conference,  June  '14  

Bandwidth  

BeWer   Be

Wer  

BeWer  

Latency  (Large  Messages)  

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/s

ec)

Message Size (Bytes)

5236  

Intra-­‐socket  P2P  

Inter-­‐socket  P2P  

0 2000 4000 6000 8000

10000 12000 14000 16000

8K 32K 128K 512K 2M

Lat

ency

(use

c)

Message Size (Bytes)

Latency  (Large  Messages)  

0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/s

ec)

Message Size (Bytes)

BeWer  

5594  

Bandwidth  

Page 43: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

InterNode  –  Symmetric  Mode  -­‐  P3DFFT  Performance  

43  HPC  Advisory  Council  European  Conference,  June  '14  

•     Note  •   ApplicaRon  performance  in  symmetric  mode  closely  Red  with  load  balancing  of  computaRon            between  host  and  the  Xeon  Phi  

•  Try  different  decomposiRons  /  threading  levels  to  idenRfy  best  se|ngs  for  target  applicaRon  

3DStencil  Comm.  Kernel  

 

P3DFFT  –  512x512x4K  

 

0

2

4

6

8

10

4+4 8+8

Tim

e pe

r St

ep (s

ec)

Processes on Host+MIC (32 Nodes)

0

5

10

15

20

2+2 2+8

Com

m p

er S

tep

(sec

)

Processes on Host-MIC - 8 Threads/Proc (8 Nodes)

50.2%  

MV2-­‐INTRA-­‐OPT   MV2-­‐MIC  

16.2%  

S.  Potluri,  D.  Bureddy,  K.  Hamidouche,  A.  Venkatesh,  K.  Kandalla,  H.  Subramoni  and  D.  K.  Panda,  MVAPICH-­‐PRISM:  A  Proxy-­‐based  CommunicaPon  Framework  using  InfiniBand  and  SCIF  for  Intel  MIC  

Clusters  Int'l  Conference  on  SupercompuPng  (SC  '13),  November  2013.    

Page 44: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

44  

OpPmized  MPI  CollecPves  for  MIC  Clusters  (Allgather  &  Alltoall)  

HPC  Advisory  Council  European  Conference,  June  '14  

A.  Venkatesh,  S.  Potluri,  R.  Rajachandrasekar,  M.  Luo,  K.  Hamidouche  and  D.  K.  Panda  -­‐  High  Performance  Alltoall  and  Allgather  designs  for  InfiniBand  MIC  Clusters;  IPDPS’14,  May  2014  

0  

5000  

10000  

15000  

20000  

25000  

1   2   4   8   16   32   64   128  256  512   1K  

Latency  (usecs)  

Message  Size  (Bytes)  

32-­‐Node-­‐Allgather  (16H  +  16  M)  Small  Message  Latency  

MV2-­‐MIC  

MV2-­‐MIC-­‐Opt  

0  

500  

1000  

1500  

8K   16K   32K   64K   128K  256K  512K   1M  

Latency  (usecs)  x  1000  

Message  Size  (Bytes)  

32-­‐Node-­‐Allgather  (8H  +  8  M)  Large  Message  Latency  

MV2-­‐MIC  

MV2-­‐MIC-­‐Opt  

0  

200  

400  

600  

800  

4K   8K   16K   32K   64K   128K  256K  512K  

Latency  (usecs)   x  1000  

Message  Size  (Bytes)  

32-­‐Node-­‐Alltoall  (8H  +  8  M)  Large  Message  Latency  

MV2-­‐MIC  

MV2-­‐MIC-­‐Opt  

0  

10  

20  

30  

40  

50  

MV2-­‐MIC-­‐Opt   MV2-­‐MIC  

ExecuP

on  Tim

e  (secs)  

32  Nodes  (8H  +  8M),  Size  =  2K*2K*1K  

P3DFFT  Performance  CommunicaRon    

ComputaRon  

76%  58%  

55%  

Page 45: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  –  Support  with  Direct  Connected  Transport  (DCT)  

•  CollecRve  communicaRon  –  Hardware-­‐mulRcast-­‐based  –  Offload  and  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Support  for  GPGPUs  •  Support  for  Intel  MICs  •  Fault-­‐tolerance  •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with  

Unified  RunRme        

Overview  of  A  Few  Challenges  being  Addressed  by  MVAPICH2/MVAPICH2-­‐X  for  Exascale  

45  HPC  Advisory  Council  European  Conference,  June  '14  

Page 46: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Component  failures  are  common  in  large-­‐scale  clusters  

•  Imposes  need  on  reliability  and  fault  tolerance  

•  MulRple  challenges:  –  Checkpoint-­‐Restart  vs.  Process  MigraRon  

–  Benefits  of  Scalable  Checkpoint  Restart  (SCR)  Support  •  ApplicaRon  guided  •  ApplicaRon  transparent  

HPC  Advisory  Council  European  Conference,  June  '14   46  

Fault  Tolerance  

Page 47: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   47  

Checkpoint-­‐Restart  vs.  Process  MigraPon  Low  Overhead  Failure  PredicPon  with  IPMI  

X.  Ouyang,  R.  Rajachandrasekar,  X.  Besseron,  D.  K.  Panda,  High  Performance  Pipelined  Process  MigraPon  with  RDMA,  CCGrid  2011    X.  Ouyang,  S.  Marcarelli,  R.  Rajachandrasekar  and  D.  K.  Panda,    RDMA-­‐Based  Job  MigraPon  Framework  for  MPI  over  InfiniBand,  Cluster  2010    

10.7X  

2.3X  

4.3X  

1  

Time  to  migrate  8  MPI  processes  

•   Job-­‐wide  Checkpoint/Restart  is  not  scalable  •   Job-­‐pause  and  Process  MigraRon  framework  can  deliver  pro-­‐acRve  fault-­‐tolerance  •   Also  allows  for  cluster-­‐wide  load-­‐balancing  by  means  of  job  compacRon  

0

5000

10000

15000

20000

25000

30000

Migration w/o RDMA

CR (ext3) CR (PVFS)

Exe

cutio

n Ti

me

(sec

onds

)

Job Stall

Checkpoint (Migration)

Resume

Restart

2.03x  

4.49x  

LU  Class  C  Benchmark  (64  Processes)  

Page 48: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MulP-­‐Level  CheckpoinPng  with  ScalableCR  (SCR)  

48  HPC  Advisory  Council  European  Conference,  June  '14  

Checkpoint  Cost  a

nd  Resilien

cy  

Low  

High  

•  LLNL’s  Scalable  Checkpoint/Restart  library  

•  Can  be  used  for  applicaRon  guided  and  applicaRon  transparent  checkpoinRng  

•  EffecRve  uRlizaRon  of  storage  hierarchy  –  Local:  Store  checkpoint  data  on  node’s  local  

storage,  e.g.  local  disk,  ramdisk  

–  Partner:  Write  to  local  storage  and  on  a  partner  node  

–  XOR:  Write  file  to  local  storage  and  small  sets  of  nodes  collecRvely  compute  and  store  parity  redundancy  data  (RAID-­‐5)  

–  Stable  Storage:  Write  to  parallel  file  system  

Page 49: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

ApplicaPon-­‐guided  MulP-­‐Level  CheckpoinPng  

49  HPC  Advisory  Council  European  Conference,  June  '14  

0  

20  

40  

60  

80  

100  

PFS   MVAPICH2+SCR  (Local)  

MVAPICH2+SCR  (Partner)  

MVAPICH2+SCR  (XOR)  

Checkpoint  W

riRng  Tim

e  (s)  

RepresentaRve  SCR-­‐Enabled  ApplicaRon  

•  Checkpoint  wriRng  phase  Rmes  of  representaRve  SCR-­‐enabled  MPI  applicaRon    

•  512  MPI  processes  (8  procs/node)  

•  Approx.  51  GB  checkpoints  

Page 50: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

Transparent  MulP-­‐Level  CheckpoinPng  

50  HPC  Advisory  Council  European  Conference,  June  '14  

•  ENZO  Cosmology  applicaPon  –  RadiaRon  Transport  workload  

•  Using  MVAPICH2’s  CR  protocol  instead  of  the  applicaRon’s  in-­‐built  CR  mechanism  

•  512  MPI  processes  running  on  the  Stampede  Supercomputer@TACC  

•  Approx.  12.8  GB  checkpoints:  ~2.5X  reducPon  in  checkpoint  I/O  costs  

Page 51: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Scalability  for  million  to  billion  processors  –  Support  for  highly-­‐efficient  inter-­‐node  and  intra-­‐node  communicaRon  (both  two-­‐sided  

and  one-­‐sided)  –  Extremely  minimum  memory  footprint  –  Support  with  Direct  Connected  Transport  (DCT)  

•  CollecRve  communicaRon  –  Hardware-­‐mulRcast-­‐based  –  Offload  and  Non-­‐blocking  –  Topology-­‐aware  –  Power-­‐aware  

•  Support  for  GPGPUs  •  Support  for  Intel  MICs  •  Fault-­‐tolerance  •  Hybrid  MPI+PGAS  programming  (MPI  +  OpenSHMEM,  MPI  +  UPC,  …)  with  

Unified  RunRme        

Overview  of  A  Few  Challenges  being  Addressed  by  MVAPICH2/MVAPICH2-­‐X  for  Exascale  

51  HPC  Advisory  Council  European  Conference,  June  '14  

Page 52: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MVAPICH2-­‐X  for  Hybrid  MPI  +  PGAS  ApplicaPons  

HPC  Advisory  Council  European  Conference,  June  '14   52  

MPI  ApplicaPons,  OpenSHMEM  ApplicaPons,  UPC  ApplicaPons,  Hybrid  (MPI  +  PGAS)  ApplicaPons  

Unified  MVAPICH2-­‐X  RunPme  

InfiniBand,  RoCE,  iWARP  

OpenSHMEM  Calls   MPI  Calls  UPC  Calls  

•  Unified  communicaRon  runRme  for  MPI,  UPC,  OpenSHMEM  available  with  MVAPICH2-­‐X  1.9  onwards!    –  h<p://mvapich.cse.ohio-­‐state.edu  

•  Feature  Highlights  –  Supports  MPI(+OpenMP),  OpenSHMEM,  UPC,  MPI(+OpenMP)  +  OpenSHMEM,  

MPI(+OpenMP)  +  UPC    –  MPI-­‐3  compliant,  OpenSHMEM  v1.0  standard  compliant,  UPC  v1.2  standard  

compliant  –  Scalable  Inter-­‐node  and  Intra-­‐node  communicaRon  –  point-­‐to-­‐point  and  collecRves  

Page 53: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

OpenSHMEM  CollecPve  CommunicaPon  Performance  

53  

Reduce  (1,024  processes)   Broadcast  (1,024  processes)  

Collect  (1,024  processes)   Barrier  

0  50  100  150  200  250  300  350  400  

128   256   512   1024   2048  

Time  (us)  

No.  of  Processes  

1  10  

100  1000  

10000  100000  

1000000  10000000  

Time  (us)  

Message  Size  

180X  

1  

10  

100  

1000  

10000  

4   16   64   256   1K   4K   16K   64K  256K  

Time  (us)  

Message  Size  

18X  

1  

10  

100  

1000  

10000  

100000  

1   4   16   64   256   1K   4K   16K   64K  

Time  (us)  

Message  Size  

MV2X-­‐SHMEM  Scalable-­‐SHMEM  OMPI-­‐SHMEM  

12X  

3X  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 54: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

OpenSHMEM  ApplicaPon  EvaluaPon  

54  

0  

100  

200  

300  

400  

500  

600  

256   512  

Time  (s)  

Number  of  Processes  

UH-­‐SHMEM  

OMPI-­‐SHMEM  (FCA)  

Scalable-­‐SHMEM  (FCA)  

MV2X-­‐SHMEM  

0  

10  

20  

30  

40  

50  

60  

70  

80  

256   512  

Time  (s)  

Number  of  Processes  

UH-­‐SHMEM  

OMPI-­‐SHMEM  (FCA)  

Scalable-­‐SHMEM  (FCA)  

MV2X-­‐SHMEM  

Heat  Image   Daxpy  

•  Improved  performance  for  OMPI-­‐SHMEM  and  Scalable-­‐SHMEM  with  FCA  •  ExecuRon  Rme  for  2DHeat  Image  at  512  processes  (sec):  

-  UH-­‐SHMEM  –  523,  OMPI-­‐SHMEM  –  214,  Scalable-­‐SHMEM  –  193,  MV2X-­‐SHMEM  –  169  

•  ExecuRon  Rme  for  DAXPY  at  512  processes  (sec):  -  UH-­‐SHMEM  –  57,  OMPI-­‐SHMEM  –  56,  Scalable-­‐SHMEM  –  9.2,  MV2X-­‐

SHMEM  –  5.2  J.  Jose,  J.  Zhang,  A.  Venkatesh,  S.  Potluri,  and  D.  K.  Panda,  A  Comprehensive  Performance  EvaluaPon  of  OpenSHMEM  Libraries  on  InfiniBand  Clusters,  OpenSHMEM  Workshop  (OpenSHMEM’14),  March  2014    

HPC  Advisory  Council  European  Conference,  June  '14  

Page 55: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14   55  

Hybrid  MPI+OpenSHMEM  Graph500  Design    ExecuPon  Time  

J.  Jose,  S.  Potluri,  K.  Tomko  and  D.  K.  Panda,  Designing  Scalable  Graph500  Benchmark  with  Hybrid  MPI+OpenSHMEM  Programming  Models,  InternaPonal  SupercompuPng  Conference  (ISC’13),  June  2013  

0  1  2  3  4  5  6  7  8  9  

26   27   28   29  

Billion

s  of  T

raversed

 Edges  Per  

Second

 (TEP

S)  

Scale  

MPI-­‐Simple  MPI-­‐CSC  MPI-­‐CSR  Hybrid(MPI+OpenSHMEM)  

0  1  2  3  4  5  6  7  8  9  

1024   2048   4096   8192  

Billion

s  of  T

raversed

 Edges  Per  

Second

 (TEP

S)  

#  of  Processes  

MPI-­‐Simple  MPI-­‐CSC  MPI-­‐CSR  Hybrid(MPI+OpenSHMEM)  

Strong  Scalability  

Weak  Scalability  

J.  Jose,  K.  Kandalla,  M.  Luo  and  D.  K.  Panda,  SupporPng  Hybrid  MPI  and  OpenSHMEM  over  InfiniBand:  Design  and  Performance  EvaluaPon,  Int'l  Conference  on  Parallel  Processing  (ICPP  '12),  September  2012  

0  5  

10  15  20  25  30  35  

4K   8K   16K  

Time  (s)  

No.  of  Processes  

MPI-­‐Simple  MPI-­‐CSC  MPI-­‐CSR  Hybrid  (MPI+OpenSHMEM)  

13X  

7.6X  

•  Performance  of  Hybrid  (MPI+OpenSHMEM)  Graph500  Design  

•  8,192  processes    -­‐  2.4X  improvement  over  MPI-­‐CSR    -­‐  7.6X  improvement  over  MPI-­‐Simple  

•  16,384  processes    -­‐  1.5X  improvement  over  MPI-­‐CSR    -­‐  13X  improvement  over  MPI-­‐Simple  

 

Page 56: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

56  

Hybrid  MPI+OpenSHMEM  Sort  ApplicaPon    ExecuPon  Time   Strong  Scalability  

•  Performance  of  Hybrid  (MPI+OpenSHMEM)  Sort  ApplicaRon  •  ExecuRon  Time  

 -­‐  4TB  Input  size  at  4,096  cores:  MPI  –  2408  seconds,  Hybrid:  1172  seconds      -­‐  51%  improvement  over  MPI-­‐based  design  

•  Strong  Scalability  (configuraRon:  constant  input  size  of  500GB)    -­‐  At  4,096  cores:  MPI  –  0.16  TB/min,  Hybrid  –  0.36  TB/min    -­‐  55%  improvement  over  MPI  based  design  

 

0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  

512   1,024   2,048   4,096  

Sort  Rate  (TB/min)  

No.  of  Processes  

MPI  Hybrid  

0  

500  

1000  

1500  

2000  

2500  

3000  

500GB-­‐512   1TB-­‐1K   2TB-­‐2K   4TB-­‐4K  

Time  (secon

ds)  

Input  Data  -­‐  No.  of  Processes  

MPI  

Hybrid  

51%  55%  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 57: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Performance  and  Memory  scalability  toward  900K-­‐1M  cores  –  Enhanced  design  with  Dynamically  Connected  Transport  (DCT)  service  and  Connect-­‐IB    

•  Enhanced  OpRmizaRon  for  GPGPU  and  Coprocessor  Support  –  Extending  the  GPGPU  support  (GPU-­‐Direct  RDMA)  with  CUDA  6.5  and  Beyond  –  Support  for  Intel  MIC  (Knight  Landing)  

•  Taking  advantage  of  CollecRve  Offload  framework  –  Including  support  for  non-­‐blocking  collecRves  (MPI  3.0)  

•  RMA  support  (as  in  MPI  3.0)  •  Extended  topology-­‐aware  collecRves  •  Power-­‐aware  collecRves  •  Support  for  MPI  Tools  Interface  (as  in  MPI  3.0)  •  Checkpoint-­‐Restart  and  migraRon  support  with  in-­‐memory  checkpoinRng  •  Hybrid  MPI+PGAS  programming  support  with  GPGPUs  and  Accelertors          

MVAPICH2/MVPICH2-­‐X  –  Plans  for  Exascale  

57  HPC  Advisory  Council  European  Conference,  June  '14  

Page 58: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  InfiniBand  with  RDMA  feature  is  gaining  momentum  in  HPC  systems  with  best  performance  and  greater  usage  

•  As  the  HPC  community  moves  to  Exascale,  new  soluRons  are  needed  in  the  MPI  and  Hybrid  MPI+PGAS  stacks  for  supporRng  GPUs  and  Accelerators  

•  Demonstrated  how  such  soluRons  can  be  designed  with  MVAPICH2/MVAPICH2-­‐X  and  their  performance  benefits  

•  Such  designs  will  allow  applicaRon  scienRsts    and  engineers  to  take  advantage  of  upcoming  exascale  systems  

58  

Concluding  Remarks  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 59: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14  

Funding  Acknowledgments  

Funding  Support  by  

Equipment  Support  by  

59  

Page 60: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14  

Personnel  Acknowledgments  Current  Students    

–  S.  Chakraborthy    (Ph.D.)  

–  N.  Islam  (Ph.D.)  

–  J.  Jose  (Ph.D.)  

–  M.  Li  (Ph.D.)  

–  R.  Rajachandrasekhar  (Ph.D.)  

Past  Students    –  P.  Balaji  (Ph.D.)  

–  D.  BunRnas  (Ph.D.)  

–  S.  Bhagvat  (M.S.)  

–  L.  Chai  (Ph.D.)  

–  B.  Chandrasekharan  (M.S.)  

–  N.  Dandapanthula  (M.S.)  

–  V.  Dhanraj  (M.S.)  

–  T.  Gangadharappa  (M.S.)  

–  K.  Gopalakrishnan  (M.S.)  

–  G.  Santhanaraman  (Ph.D.)  –  A.  Singh  (Ph.D.)  

–  J.  Sridhar  (M.S.)  

–  S.  Sur  (Ph.D.)  

–  H.  Subramoni  (Ph.D.)  

–  K.  Vaidyanathan  (Ph.D.)  

–  A.  Vishnu  (Ph.D.)  

–  J.  Wu  (Ph.D.)  

–  W.  Yu  (Ph.D.)  

60  

Past  Research  ScienCst  –  S.  Sur  

Current  Post-­‐Docs  –  X.  Lu  

–  K.  Hamidouche  

Current  Programmers  –  M.  Arnold  

–  J.  Perkins  

Past  Post-­‐Docs  –  H.  Wang  –  X.  Besseron  –  H.-­‐W.  Jin  

–  M.  Luo  

–  W.  Huang  (Ph.D.)  

–  W.  Jiang  (M.S.)  

–  S.  Kini  (M.S.)  

–  M.  Koop  (Ph.D.)  

–  R.  Kumar  (M.S.)  

–  S.  Krishnamoorthy  (M.S.)  

–  K.  Kandalla  (Ph.D.)  

–  P.  Lai  (M.S.)  

–  J.  Liu  (Ph.D.)  

–  M.  Luo  (Ph.D.)  

–  A.  Mamidala  (Ph.D.)  

–  G.  Marsh  (M.S.)  

–  V.  Meshram  (M.S.)  

–  S.  Naravula  (Ph.D.)  

–  R.  Noronha  (Ph.D.)  

–  X.  Ouyang  (Ph.D.)  

–  S.  Pai  (M.S.)  

–  S.  Potluri  (Ph.D.)  

 

–  M.  Rahman  (Ph.D.)  –  D.  Shankar  (Ph.D.)  

–  R.  Shir  (Ph.D.)  

–  A.  Venkatesh  (Ph.D.)  

–  J.  Zhang  (Ph.D.)  

–  E.  Mancini  –  S.  Marcarelli  

–  J.  Vienne  

Current  Senior  Research  Associate  –  H.  Subramoni  

Past  Programmers  –  D.  Bureddy  

Page 61: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  August  25-­‐27,  2014;  Columbus,  Ohio,  USA  

•  Keynote  Talks,  Invited  Talks,  Contributed  PresentaRons        

•  Half-­‐day  tutorial  on  MVAPICH2,  MVAPICH2-­‐X  and  MVAPICH2-­‐GDR  opRmizaRon  and  tuning  

•  Speakers  (Confirmed  so  far):  –  Keynote:  Dan  Stanzione  (TACC)  

–  Keynote:  Darren  Kerbyson  (PNNL)  

–  Sadaf  Alam  (CSCS,  Switzerland)  

–  Onur  Celebioglu  (Dell)  

–  Jens  Glaser  (Univ.  of  Michigan)  

–  Jeff  Hammond  (Intel)  

–  Adam  Moody  (LLNL)  

–  David  Race  (Cray)  

–  Davide  Rosse|  (NVIDIA)  

–  Gilad  Shainer  (Mellanox)  

–  Sayantan  Sur  (Intel)  

–  Mahidhar  TaRneni  (SDSC)  

–  Jerome  Vienne  (TACC)  

•  Call  for  Contributed  PresentaPons  (Deadline:  July  1,  2014)  

•  More  details  at:  h<p://mug.mvapich.cse.ohio-­‐state.edu  61  

Upcoming  2nd  Annual  MVAPICH  User  Group  (MUG)  MeePng  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 62: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

•  Looking  for  Bright  and  EnthusiasRc  Personnel  to  join  as    –  VisiRng  ScienRsts  

–  Post-­‐Doctoral  Researchers  

–  PhD  Students  

–  MPI  Programmer/Sotware  Engineer    

–  Hadoop/Big  Data  Programmer/Sotware  Engineer  

•  If  interested,  please  contact  me  at  this  conference  and/or  send  an  e-­‐mail  to  [email protected]­‐state.edu  

62  

MulPple  PosiPons  Available  in  My  Group  

HPC  Advisory  Council  European  Conference,  June  '14  

Page 63: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

HPC  Advisory  Council  European  Conference,  June  '14  

Pointers  

hWp://nowlab.cse.ohio-­‐state.edu    

[email protected]­‐state.edu  hWp://www.cse.ohio-­‐state.edu/~panda    

63  

hWp://mvapich.cse.ohio-­‐state.edu  

hWp://hadoop-­‐rdma.cse.ohio-­‐state.edu