Co-Design Architecture for Exascale

Dror Goldenberg, March 2016, HPCAC Swiss

Co-Design Architecture Emergence of New Co-Processors

Co-Design Architecture to Enable Exascale Performance

CPU-Centric Co-Design

Limited to Main CPU Usage Results in Performance Limitation

Creating Synergies Enables Higher Performance and Scale

Software Software

In-CPU Computing

In-Network Computing

In-Storage Computing

The Intelligence is Moving to the Interconnect

Interconnect

Past Future

Intelligent Interconnect Delivers Higher Datacenter ROI

NETWORK

COMPUTING

NETWORK

Intelligence

Network Offloads Computing for applications

Smart Network Increase Datacenter Value

Network functions On CPU

COMPUTING

Breaking the Application Latency Wall

§ Today: Network device latencies are on the order of 100 nanoseconds

§ Challenge: Enabling the next order of magnitude improvement in application performance

§ Solution: Creating synergies between software and hardware – intelligent interconnect

Intelligent Interconnect Paves the Road to Exascale Performance

10 years ago

~10 microsecond

~100 microsecond

Network Communication Framework

~10 microsecond

Communication Framework

~0.1 microsecond

Network

~1 microsecond

Communication Framework

Future

~0.05 microsecond

Co-Design Network

Introducing Switch-IB 2 World’s First Smart Switch

§ The world fastest switch with <90 nanosecond latency

§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec

§ Adaptive Routing, Congestion control, support for multiple topologies

World’s First Smart Switch

Build for Scalable Compute and Storage Infrastructures

10X Higher Performance with The New Switch SHArP Technology

SHArP (Scalable Hierarchical Aggregation Protocol) Technology

Delivering 10X Performance Improvement

for MPI and SHMEM/PAGS Communications

Switch-IB 2 Enables the Switch Network to

Operate as a Co-Processor

SHArP Enables Switch-IB 2 to Manage and

Execute MPI Operations in the Network

SHArP Performance Advantage

§  MiniFE is a Finite Element mini-application •  Implements kernels that represent

implicit finite-element applications

10X to 25X Performance Improvement

AllRedcue MPI Collective

The Intelligence is Moving to the Interconnect

Communication Frameworks (MPI, SHMEM/PGAS)

The Only Approach to Deliver 10X Performance Improvements

Applications Transport RDMA SR-IOV

Collectives Peer-Direct GPUDirect

More…

MPI / SHMEM Offloads

Q1’16

Q3’16

Multi-Host Socket DirectTM – Low Latency Socket Communication

§ Each CPU with direct network access

§ QPI avoidance for I/O – improve performance

§ Enables GPU / peer direct on both sockets

§ Solution is transparent to software

CPU CPU CPU CPU QPI

Multi-Host Socket Direct Performance

50% Lower CPU Utilization

20% lower Latency

Multi Host Evaluation Kit

Lower Application Latency, Free-up CPU

Introducing ConnectX-4 Lx Programmable Adapter

Scalable, Efficient, High-Performance and Flexible Solution

Security

Cloud/Virtualization

Storage

High Performance Computing

Precision Time Synchronization

Networking + FPGA

Mellanox Acceleration Engines

and FGPA Programmability

On One Adapter

Mellanox InfiniBand Proven and Most Scalable HPC Interconnect

“Summit” System “Sierra” System

Paving the Road to Exascale

NCAR-Wyoming Supercomputing Center (NWSC) – “Cheyenne”

§ Cheyenne supercomputer system

§ 5.34-petaflop SGI ICE XA Cluster

§  Intel “Broadwell” processors

§ More than 4K compute nodes

§ Mellanox EDR InfiniBand interconnect

§ Mellanox Unified Fabric Manager

§ Partial 9D Enhanced Hypercube interconnect topology

§ DDN SFA14KX systems

§ 20 petabytes of usable file system space

§  IBM GPFS (General Parallel File System)

High-Performance Designed 100Gb/s Interconnect Solutions

Transceivers

Active Optical and Copper Cables

(10 / 25 / 40 / 50 / 56 / 100Gb/s) VCSELs, Silicon Photonics and Copper

36 EDR (100Gb/s) Ports, <90ns Latency

Throughput of 7.2Tb/s

7.02 Billion msg/sec (195M msg/sec/port)

100Gb/s Adapter, 0.7us latency

150 million messages per second

(10 / 25 / 40 / 50 / 56 / 100Gb/s)

32 100GbE Ports, 64 25/50GbE Ports

(10 / 25 / 40 / 50 / 100GbE)

Throughput of 6.4Tb/s

Leading Supplier of End-to-End Interconnect Solutions

Store Analyze Enabling the Use of Data

Software ICs Switches/Gateways Adapter Cards Cables/Modules

Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)

Metro / WAN NPU & Multicore

NPS TILE

The Performance Advantage of EDR 100G InfiniBand (28-80%)

End-to-End Interconnect Solutions for All Platforms

Highest Performance and Scalability for

X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms

10, 20, 25, 40, 50, 56 and 100Gb/s Speeds

X86 Open POWER GPU ARM FPGA

Smart Interconnect to Unleash The Power of All Compute Architectures

Technology Roadmap – One-Generation Lead over the Competition

2000 2020 2010 2005

20G 40G 56G 100G

“Roadrunner” Mellanox Connected

1st 3rd TOP500 2003

Virginia Tech (Apple)

Terascale Petascale Exascale

Mellanox 400G

§ Transparent InfiniBand integration into OpenStack •  Since Havana

§ RDMA directly from VM - SRIOV § MAC to GUID mapping § VLAN to pkey mapping §  InfiniBand SDN network

§  Ideal fit for High Performance Computing Clouds

OpenStack Over InfiniBand – Extreme Performance in the Cloud

InfiniBand Enables The Highest Performance and Efficiency

§ Mellanox End to End •  Mellanox ConnectX-4 NIC family, Switch-IB/Spectrum switches and 25/100Gb/s cables

§ Bring the astonishing 100Gb/s speeds to the cloud with minimal CPU utilization •  Both VMs and Hypervisors •  Accelerations are critical to reach line rate -  SR-IOV, RDMA, etc.

25, 50 And 100Gb/s Clouds Are Here!

92.412 Gb/s

The Next Generation HPC Software Framework To Meet the Needs of Future Systems / Applications

Unified Communication – X Framework (UCX)

Exascale Co-Design Collaboration

Collaborative Effort Industry, National Laboratories and Academia

The Next Generation

HPC Software Framework

A Collaboration Effort

§ Mellanox co-designs network interface and contributes MXM technology •  Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH

§ ORNL co-designs network interface and contributes UCCS project •  InfiniBand optimizations, Cray devices, shared memory

§ NVIDIA co-designs high-quality support for GPU devices •  GPUDirect, GDR copy, etc.

§  IBM co-designs network interface and contributes ideas and concepts from PAMI § UH/UTK focus on integration with their research platforms

Mellanox HPC-X™ Scalable HPC Software Toolkit

§ Complete MPI, PGAS OpenSHMEM and UPC package

§ Maximize application performance

§ For commercial and open source applications

§ Based on UCX (Unified Communication – X Framework)

Mellanox Delivers Highest MPI (HPC-X) Performance

Enabling Highest Applications Scalability and Performance

Mellanox ConnectX-4 Collectives Offload

Mellanox Delivers Highest Applications Performance (HPC-X)

§ Quantum Espresso application

IntelMPI BullMPI(HPC-X)

QuantumEspresso

TestCase #nodes <me(s) <me(s) Gain

A 43 584 368 37%

B 196 2592 998 61%

Enabling Highest Applications Scalability and Performance

Maximize Performance via Accelerator and GPU Offloads

GPUDirect RDMA Technology

GPUs are Everywhere!

GPUDirect RDMA / Sync

GPU Chip set

GPU Memory

System Memory 1

§ Eliminates CPU bandwidth and latency bottlenecks § Uses remote direct memory access (RDMA) transfers between GPUs § Resulting in significantly improved MPI efficiency between GPUs in remote nodes § Based on PCIe PeerDirect technology

GPUDirect™ RDMA (GPUDirect 3.0)

With GPUDirect™ RDMA Using PeerDirect™

Mellanox GPUDirect RDMA Performance Advantage

§ HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs § GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand

•  Unlocks performance between GPU and InfiniBand •  This provides a significant decrease in GPU-GPU communication latency •  Provides complete CPU offload from all GPU communications across the network

2X Application Performance!

GPUDirect Sync (GPUDirect 4.0)

§ GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect •  Control path still uses the CPU -  CPU prepares and queues communication tasks on GPU -  GPU triggers communication on HCA -  Mellanox HCA directly accesses GPU memory

§ GPUDirect Sync (GPUDirect 4.0) •  Both data path and control path go directly between the GPU and the Mellanox interconnect

Number of nodes/GPUs

2D stencil benchmark

RDMA only RDMA+PeerSync

27% faster 23% faster

Maximum Performance For GPU Clusters

Remote GPU Access through rCUDA

GPU servers GPU as a Service

rCUDA daemon

Network Interface CUDA Driver + runtime Network Interface

rCUDA library

Application

Client Side Server Side

Application

CUDA Driver + runtime

CUDA Application

rCUDA provides remote access from every node to any GPU in the system

CPU VGPU

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

Interconnect Architecture Comparison

Offload versus Onload (Non-Offload)

§ Two interconnect architectures exist – Offload-based and Onload-based

§ Offload Architecture •  The Interconnect manages and executes all network operations •  The interconnect is capable of including application acceleration engines •  Offloads the CPU and therefore free CPU cycles to be used by the applications •  Development requires large R&D investment •  Higher data center ROI

§ Onload architecture •  A CPU-centric approach – everything must be executed on and by the CPU •  The CPU is responsible for all network functions, the interconnect only pushes the data into the wire •  Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU •  Onload the CPU and reduces the CPU cycles available for the applications •  Does not require R&D investments or interconnect expertise

Sandia National Laboratory Paper – Offloading versus Onloading

Interconnect Throughput – Offload versus Onload

The Offloading Advantage!

Network Performance Dramatically Depends on CPU Frequency!

Data Throughput:

20% Higher at common Xeon Frequency

250% Higher at common Xeon Phi Frequency

Common Xeon Frequency 2.6GHz

Common Xeon Phi Frequency ~1Ghz

Only Offload Architecture Can Enable Co-Processors

Offloading (Highest Performance for all Frequencies)

Onloading (performance loss with lower CPU frequency)

Common Xeon Frequency

Common Xeon Phi Frequency

Onloading Technology Not Suitable for Co-Processors!

Switch Latency

Message Rate

Mellanox InfiniBand Leadership Over Omni-Path

20% Lower

44% Higher

Power Consumption Per Switch Port

Scalability CPU efficiency

25% Lower

2X Higher

100 Gb/s Link Speed

200 Gb/s Link Speed

Gain Competitive Advantage Today Protect Your Future

Smart Network For Smart Systems RDMA, Acceleration Engines, Programmability

Higher Performance Unlimited Scalability

Higher Resiliency Proven!

Thank You

Co-Design Architecture for Exascale

Technology

Revisiting Co-Scheduling for Upcoming ExaScale Systems

Towards Exascale Computing - Fujitsu · Feasibility study toward Exascale Evolution of the K computer architecture Co-design with various target applications stack covers x86 clusters

MainIssues - Exascale

Reference exascale - Process Project · PROCESS architecture M. Bobak, et al.: Reference exascale architecture 15 • Modular software platform capable to handle exascale datasets

Exascale Computing Project (ECP) Overview · Co-Design and Integration 1.2.5 Exascale Systems 1.5 Site Preparation 1.5.1 System Build Phase NRE 1.5.2 Prototypes and Testbeds 1.5.3

The Exascale Challenge - Center for Energy Efficient ...€¦ · Programming Models are Increasingly Mismatched with Underlying Hardware Architecture •Changes in computer architecture

Exascale Operating Systems and Runtime Software …...Exascale Operating Systems and Runtime Software Report DECEMBER 28, 2012 EXASCALE OS/R TECHNICAL COUNCIL Pete Beckman, ANL (co-chair)

Programming for Exascale Computers · Programming for Exascale Computers Exascale systems present programmers with many challenges. A review of appropriate parallel …

Exascale(Co+design(Center(for( Materials(in(Extreme ... · programming, and architecture choices. Our strategy is to exploit hierarchical, heterogeneous architectures to achieve more

Exascale radio astronomy

Exascale by Co-Design Architectureweb.cse.ohio-state.edu/~subramoni.1/ExaComm16/presentations/... · Terascale Petascale Exascale SMP to Clusters Single-Core to Many-Core Performance

Next Stop: Exascale

An Evolutionary Exascale Programming Model Deserves ... · Exascale Systems: The Planning ! ... Candidate exascale programming models defined . DOE Workshop’s Reverse Timeline

An Evolutionary Approach to Exascale System … · 1 © 2010 IBM Corporation © 2002 IBM Corporation An Evolutionary Approach to Exascale System Software by Leveraging Co-Design Principles

Age of Exascale

Exascale Computing - hlrs.de · Exascale Computing 4 Exascale Computing The shift from petascale computing to exascale computing—a thousandfold increase in computing power—constitutes

SOS 14 Challenges in Exascale ComputingChallenges in Exascale

Automatic Extraction of Software Models for Exascale Hardware/Software Co-Design

The UHPC X-Caliber Project Architecture, Design Space ......Exascale Design Study •2018 Exascale Machine –1 Exaop/sec –500 petabyte/sec memory bandwidth –500 petabyte/sec interconnect

Production-ready, Exascale-enabled Krylov Solvers (PEEKS ... · Distributed Tasking for Exascale CO-DESIGN CENTER The University of Tennessee’s Innovative Computing Laboratory (ICL)