Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
© 2018 Arm Limited 1 © 2018 Arm Limited 1
Arm and NVIDIA: Accelerating Supercomputing
David Lecomber
Senior Director, Infrastructure and HPC Tools
© 2018 Arm Limited 2
© 2018 Arm Limited
Arm is everywhere
We design & license IP, we do not manufacture chips
Partners build products for their target markets
One size is not always the best fit for all
HPC is a great fit for co-design and collaboration
Partnership is key Choice is good
21 billion chips in the past year
Mobile/Embedded/IoT/Automotive/GPUs/Servers
Arm Technology Connects the World
© 2018 Arm Limited 3 © 2018 Arm Limited 3
The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices
© 2019 Arm Limited 4
Arm #1 market share in Infrastructure!
Top-of-Rack Switches
Cellular Base Stations
Gateways
WAN Routers
Servers
Infrastructure Processor Unit Market Share
2011 2012 2013 2014 2015 2016 2017 2018
5%
10%
15%
20%
25%
30%
Source: IDC and Arm
Confidential © 2019 Arm Limited
Arm in HPC
Software ecosystem
• Architecture (ISA) – v8.x and SVE
• Neoverse IP roadmap
Confidential © 2019 Arm Limited
Vanguard Astra by HPE
© 2018 Arm Limited
• 2,592 HPE Apollo 70 compute nodes
• 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak)
• Marvell ThunderX2 ARM SoC, 28 core, 2.0 GHz
• Memory per node: 128 GB (16 x 8 GB DR DIMMs)
• Aggregate capacity: 332 TB, 885 TB/s (peak)
• Mellanox IB EDR, ConnectX-5
• 112 36-port edges, 3 648-port spine switches
• Red Hat RHEL for Arm
• HPE Apollo 4520 All–flash Lustre storage
• Storage Capacity: 403 TB (usable)
• Storage Bandwidth: 244 GB/s
© 2018 Arm Limited 7
Development and deployment of NVIDIA CUDA on ArmConsistency and compatibility provided by the standard, familiar tools
Tools from NVIDIA Tools from Arm Ready for Deployment
NVIDIA CUDA and OpenACCcompilers and PGI compilers
NVIDIA CUDA-X performance libraries
GPU profiling and debugging of single-server applications
High performance C, C++, Fortran compilers for Arm architecture cores
Performance libraries for Arm architecture cores
Combined CPU and GPU profiling and debugging for single and multi-server applications
Containerized applications ready to run for deep learning, machine learning and HPC
Confidential © 2019 Arm Limited
11
NVIDIA+ARM HPC software stack evaluation using Wombat at NCCS
CoMet LAMMPS NAMD VMD DCA++ Gromacs Gamera LSMSComparative Molecular Molecular MD Material Molecular Earthquake MaterialGenomics Dynamics Dynamics Vizualiz. Science Dynamics Simulator Science
Applications:
Evaluation done by:
KokkosC++
Prog. Model
Magma SLATESci.
Libraries
Parallel Prog Models & Sci. Libraries:
Open MPIDistributedProg.
Model
Memory Transfer
BabelStreamTeaLeafHeat
MiniSweep Radiation Transport
Clover LeafLagrangian
Cond. Eulerianhydrodynamic
SNAP
Radiation
Transport
Benchmarks & Mini-apps:
Patatrack Pixel
Reconstruction
CUDAGPU
Prog. Model
© 2018 Arm Limited 9
Arm Allinea StudioScalable tools for developing, debugging and optimizing CPU and GPU applications
Arm Allinea Studio provides Arm Forge for debugging and profiling
• Multi-process and multi-server MPI and OpenMP support
• Unrivalled scalability – deployed at large scale by developers on Summit and Titan systems
Arm Compiler for Linux and Performance Libraries provide high-performance for CPU portions of accelerated HPC applications on Arm cores
© 2018 Arm Limited 10
End-to-end CPU and GPU tools for end-to-end softwareAddress the whole application – CPU and GPU simultaneously
Debugging control flow on CPU and GPU – and examining data on device and host
Performance analysis from GPU and CPU in one coherent view
© 2018 Arm Limited 11
Optimization of CPU and GPU performanceWhere should I focus my effort?
Performance comes from understanding an application within the whole system
IO, MPI, CPU and GPU matter
Is the speed of this ResNet18 image classifier limited by GPU or CPU performance?
© 2018 Arm Limited 12
Identifying why a component is bottleneckedWho is waiting for whom?
The CPU is rarely for waiting GPU here -loading images is our main bottleneck
• Tuned CPU compiler flags –3x speed up
• Using 4 threads to load images simultaneously –a 2x speed up
© 2019 Arm Limited 13
Summary
NVIDIA GPUs on Arm architecture servers provides a new choice and access to a widening range of silicon providers
Application porting has been quick and straight-forward – and applications are able to achieve high performance
Arm Allinea Studio helps to maximize performance applications and is part of a rich and familiar developer ecosystem for NVIDIA CUDA on Arm
© 2018 Arm Limited 14
The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices