14
© 2018 Arm Limited 1 © 2018 Arm Limited 1 Arm and NVIDIA: Accelerating Supercomputing David Lecomber Senior Director, Infrastructure and HPC Tools

Arm and NVIDIA: Accelerating Supercomputing...Tools from NVIDIA Tools from Arm Ready for Deployment NVIDIA CUDA and OpenACC compilers and PGI compilers NVIDIA CUDA-X performance libraries

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • © 2018 Arm Limited 1 © 2018 Arm Limited 1

    Arm and NVIDIA: Accelerating Supercomputing

    David Lecomber

    Senior Director, Infrastructure and HPC Tools

  • © 2018 Arm Limited 2

    © 2018 Arm Limited

    Arm is everywhere

    We design & license IP, we do not manufacture chips

    Partners build products for their target markets

    One size is not always the best fit for all

    HPC is a great fit for co-design and collaboration

    Partnership is key Choice is good

    21 billion chips in the past year

    Mobile/Embedded/IoT/Automotive/GPUs/Servers

    Arm Technology Connects the World

  • © 2018 Arm Limited 3 © 2018 Arm Limited 3

    The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices

  • © 2019 Arm Limited 4

    Arm #1 market share in Infrastructure!

    Top-of-Rack Switches

    Cellular Base Stations

    Gateways

    WAN Routers

    Servers

    Infrastructure Processor Unit Market Share

    2011 2012 2013 2014 2015 2016 2017 2018

    5%

    10%

    15%

    20%

    25%

    30%

    Source: IDC and Arm

  • Confidential © 2019 Arm Limited

    Arm in HPC

    Software ecosystem

    • Architecture (ISA) – v8.x and SVE

    • Neoverse IP roadmap

  • Confidential © 2019 Arm Limited

    Vanguard Astra by HPE

    © 2018 Arm Limited

    • 2,592 HPE Apollo 70 compute nodes

    • 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak)

    • Marvell ThunderX2 ARM SoC, 28 core, 2.0 GHz

    • Memory per node: 128 GB (16 x 8 GB DR DIMMs)

    • Aggregate capacity: 332 TB, 885 TB/s (peak)

    • Mellanox IB EDR, ConnectX-5

    • 112 36-port edges, 3 648-port spine switches

    • Red Hat RHEL for Arm

    • HPE Apollo 4520 All–flash Lustre storage

    • Storage Capacity: 403 TB (usable)

    • Storage Bandwidth: 244 GB/s

  • © 2018 Arm Limited 7

    Development and deployment of NVIDIA CUDA on ArmConsistency and compatibility provided by the standard, familiar tools

    Tools from NVIDIA Tools from Arm Ready for Deployment

    NVIDIA CUDA and OpenACCcompilers and PGI compilers

    NVIDIA CUDA-X performance libraries

    GPU profiling and debugging of single-server applications

    High performance C, C++, Fortran compilers for Arm architecture cores

    Performance libraries for Arm architecture cores

    Combined CPU and GPU profiling and debugging for single and multi-server applications

    Containerized applications ready to run for deep learning, machine learning and HPC

  • Confidential © 2019 Arm Limited

    11

    NVIDIA+ARM HPC software stack evaluation using Wombat at NCCS

    CoMet LAMMPS NAMD VMD DCA++ Gromacs Gamera LSMSComparative Molecular Molecular MD Material Molecular Earthquake MaterialGenomics Dynamics Dynamics Vizualiz. Science Dynamics Simulator Science

    Applications:

    Evaluation done by:

    KokkosC++

    Prog. Model

    Magma SLATESci.

    Libraries

    Parallel Prog Models & Sci. Libraries:

    Open MPIDistributedProg.

    Model

    Memory Transfer

    BabelStreamTeaLeafHeat

    MiniSweep Radiation Transport

    Clover LeafLagrangian

    Cond. Eulerianhydrodynamic

    SNAP

    Radiation

    Transport

    Benchmarks & Mini-apps:

    Patatrack Pixel

    Reconstruction

    CUDAGPU

    Prog. Model

  • © 2018 Arm Limited 9

    Arm Allinea StudioScalable tools for developing, debugging and optimizing CPU and GPU applications

    Arm Allinea Studio provides Arm Forge for debugging and profiling

    • Multi-process and multi-server MPI and OpenMP support

    • Unrivalled scalability – deployed at large scale by developers on Summit and Titan systems

    Arm Compiler for Linux and Performance Libraries provide high-performance for CPU portions of accelerated HPC applications on Arm cores

  • © 2018 Arm Limited 10

    End-to-end CPU and GPU tools for end-to-end softwareAddress the whole application – CPU and GPU simultaneously

    Debugging control flow on CPU and GPU – and examining data on device and host

    Performance analysis from GPU and CPU in one coherent view

  • © 2018 Arm Limited 11

    Optimization of CPU and GPU performanceWhere should I focus my effort?

    Performance comes from understanding an application within the whole system

    IO, MPI, CPU and GPU matter

    Is the speed of this ResNet18 image classifier limited by GPU or CPU performance?

  • © 2018 Arm Limited 12

    Identifying why a component is bottleneckedWho is waiting for whom?

    The CPU is rarely for waiting GPU here -loading images is our main bottleneck

    • Tuned CPU compiler flags –3x speed up

    • Using 4 threads to load images simultaneously –a 2x speed up

  • © 2019 Arm Limited 13

    Summary

    NVIDIA GPUs on Arm architecture servers provides a new choice and access to a widening range of silicon providers

    Application porting has been quick and straight-forward – and applications are able to achieve high performance

    Arm Allinea Studio helps to maximize performance applications and is part of a rich and familiar developer ecosystem for NVIDIA CUDA on Arm

  • © 2018 Arm Limited 14

    The Cloud to Edge Infrastructure Foundationfor a World of 1T Intelligent Devices