Arm and NVIDIA: Accelerating Supercomputing...Tools from NVIDIA Tools from Arm Ready for Deployment NVIDIA CUDA and OpenACC compilers and PGI compilers NVIDIA CUDA-X performance libraries

© 2018 Arm Limited 1 © 2018 Arm Limited 1

Arm and NVIDIA: Accelerating Supercomputing

David Lecomber

Senior Director, Infrastructure and HPC Tools

© 2018 Arm Limited 2

© 2018 Arm Limited

Arm is everywhere

We design & license IP, we do not manufacture chips

Partners build products for their target markets

One size is not always the best fit for all

HPC is a great fit for co-design and collaboration

Partnership is key Choice is good

21 billion chips in the past year

Mobile/Embedded/IoT/Automotive/GPUs/Servers

Arm Technology Connects the World

© 2018 Arm Limited 3 © 2018 Arm Limited 3

The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices


Arm #1 market share in Infrastructure!

Top-of-Rack Switches

Cellular Base Stations

Gateways

WAN Routers

Servers

Infrastructure Processor Unit Market Share

2011 2012 2013 2014 2015 2016 2017 2018

5%

10%

15%

20%

25%

30%

Source: IDC and Arm

Confidential © 2019 Arm Limited

Arm in HPC

Software ecosystem

• Architecture (ISA) – v8.x and SVE

• Neoverse IP roadmap


Vanguard Astra by HPE

© 2018 Arm Limited

• 2,592 HPE Apollo 70 compute nodes

• 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak)

• Marvell ThunderX2 ARM SoC, 28 core, 2.0 GHz

• Memory per node: 128 GB (16 x 8 GB DR DIMMs)

• Aggregate capacity: 332 TB, 885 TB/s (peak)

• Mellanox IB EDR, ConnectX-5

• 112 36-port edges, 3 648-port spine switches

• Red Hat RHEL for Arm

• HPE Apollo 4520 All–flash Lustre storage

• Storage Capacity: 403 TB (usable)

• Storage Bandwidth: 244 GB/s


Development and deployment of NVIDIA CUDA on ArmConsistency and compatibility provided by the standard, familiar tools

Tools from NVIDIA Tools from Arm Ready for Deployment

NVIDIA CUDA and OpenACCcompilers and PGI compilers

NVIDIA CUDA-X performance libraries

GPU profiling and debugging of single-server applications

High performance C, C++, Fortran compilers for Arm architecture cores

Performance libraries for Arm architecture cores

Combined CPU and GPU profiling and debugging for single and multi-server applications

Containerized applications ready to run for deep learning, machine learning and HPC


11

NVIDIA+ARM HPC software stack evaluation using Wombat at NCCS

CoMet LAMMPS NAMD VMD DCA++ Gromacs Gamera LSMSComparative Molecular Molecular MD Material Molecular Earthquake MaterialGenomics Dynamics Dynamics Vizualiz. Science Dynamics Simulator Science

Applications:

Evaluation done by:

KokkosC++

Prog. Model

Magma SLATESci.

Libraries

Parallel Prog Models & Sci. Libraries:

Open MPIDistributedProg.

Model

Memory Transfer

BabelStreamTeaLeafHeat

MiniSweep Radiation Transport

Clover LeafLagrangian

Cond. Eulerianhydrodynamic

SNAP

Radiation

Transport

Benchmarks & Mini-apps:

Patatrack Pixel

Reconstruction

CUDAGPU

Prog. Model


Arm Allinea StudioScalable tools for developing, debugging and optimizing CPU and GPU applications

Arm Allinea Studio provides Arm Forge for debugging and profiling

• Multi-process and multi-server MPI and OpenMP support

• Unrivalled scalability – deployed at large scale by developers on Summit and Titan systems

Arm Compiler for Linux and Performance Libraries provide high-performance for CPU portions of accelerated HPC applications on Arm cores


End-to-end CPU and GPU tools for end-to-end softwareAddress the whole application – CPU and GPU simultaneously

Debugging control flow on CPU and GPU – and examining data on device and host

Performance analysis from GPU and CPU in one coherent view


Optimization of CPU and GPU performanceWhere should I focus my effort?

Performance comes from understanding an application within the whole system

IO, MPI, CPU and GPU matter

Is the speed of this ResNet18 image classifier limited by GPU or CPU performance?


Identifying why a component is bottleneckedWho is waiting for whom?

The CPU is rarely for waiting GPU here -loading images is our main bottleneck

• Tuned CPU compiler flags –3x speed up

• Using 4 threads to load images simultaneously –a 2x speed up


Summary

NVIDIA GPUs on Arm architecture servers provides a new choice and access to a widening range of silicon providers

Application porting has been quick and straight-forward – and applications are able to achieve high performance

Arm Allinea Studio helps to maximize performance applications and is part of a rich and familiar developer ecosystem for NVIDIA CUDA on Arm


The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices

Documents

Arm and NVIDIA: Accelerating Supercomputing...Tools from NVIDIA Tools from Arm Ready for Deployment NVIDIA CUDA and OpenACC compilers and PGI compilers NVIDIA CUDA-X performance libraries