Hybrid Architectures – Why Should I Bother?Top 500 () – June 2013 Michael Bader: Hybrid Architectures – Why Should I Bother? Computer Simulations in Science and Engineering,

Technische Universitat Munchen

Hybrid Architectures – Why Should I Bother?

CSCS-FoMICS-USI Summer School onComputer Simulations in Science and Engineering

Michael Bader

July 8–19, 2013

Michael Bader: Hybrid Architectures – Why Should I Bother?

Computer Simulations in Science and Engineering, July 8–19, 2013 1


The Simulation Pipelinephenomenon, process etc.

mathematical model?

modelling

numerical algorithm?

numerical treatment

simulation code?

parallel implementation

results to interpret?

visualization

��

HHHHj embeddingstatement tool

-

-

-

validation




Parallel Computing – “Faster, Bigger, More”

Why parallel high performance computing:• Response time: compute a problem in 1

p time– speed up engineering processes– real-time simulations (tsunami warning?)

• Problem size: compute a p-times bigger problem– simulation of large-/multi-scale phenomena– maximal problem size that “fits into the machine”– validation of smaller, “operational” models

• Throughput: compute p problems at once– case and parameter studies, statistical risk scenarios, etc.

(hazard maps, data base for tsunami warning, . . . )– massively distributed computing (SETI@home, e.g.)




Part I

High Performance Computingin CSE –

Past(?) and Present Trends




The Seven Dwarfs of HPC“dwarfs” = key algorithmic kernels in many scientific computing

applications

P. Colella (LBNL), 2004:1. dense linear algebra2. sparse linear algebra3. spectral methods4. N-body methods5. structured grids6. unstructured grids7. Monte Carlo

Tsunami & storm-surge simulation:→ usually PDE solvers on structured or unstructured meshes→ SWE: a simple shallow water solver on Cartesian grids




Computational Science Demands a New Paradigm

Computational simulation must meet three challenges to become amature partner of theory and experiment

(Post & Votta, 2005)

1. performance challenge→ exponential growth of performance, massively parallel

architectures2. programming challenge→ new (parallel) programming models

3. prediction challenge→ careful verification and validation of codes;

towards reproducible simulation experiments




Four Horizons for Enhancing the Performance . . .

. . . of Parallel Simulations Based on Partial Differential Equations(David Keyes, 2000)

1. Expanded Number of Processors→ in 2000: 1000 cores; in 2010: 200,000 cores

2. More Efficient Use of Faster Processors→ PDF working-sets, cache efficiency

3. More Architecture-Friendly Algorithms→ improve temporal/spatial locality

4. Algorithms Delivering More “Science per Flop”

→ adaptivity (in space and time), higher-order methods, fastsolvers




Performance Development in Supercomputing

(source: www.top500.org)




Top 500 (www.top500.org) – June 2013




Top 500 Spotlights – Tianhe-2 and K Computer

Tianhe-2/MilkyWay-2→ Intel Xeon Phi (NUDT)• 3.1 mio cores(!) – Intel Ivy Bridge and Xeon Phi• Linpack benchmark: 33.8 PFlop/s• ≈ 17 MW power(!!)• Knights Corner / Intel Xeon Phi / Intel MIC as accelerator• 61 cores, roughly 1.1–1.3 GHz

Titan→ Cray XK7, NVIDIA K20x (ORNL)• 18,688 compute nodes; 300,000 Opteron cores• 18,688 NVIDIA Tesla K20 GPUs• Linpack benchmark: 17.6 PFlop/s• ≈ 8.2 MW power




Top 500 Spotlights – Sequoia and K Computer

Sequoia→ IBM BlueGene/Q (LLNL)• 98,304 compute nodes; 1.6 mio cores• Linpack benchmark: 17.1 PFlop/s• ≈ 8 MW power

K Computer→ SPARC64 (RIKEN, Kobe)• 88,128 processors; 705,024 cores• Linpack benchmark: 10.51 PFlop/s• ≈ 12 MW power• SPARC64 VIIIfx 2.0GHz (8-core CPU)




Performance Development in Supercomputing

(source: www.top500.org)




International Exascale Software Project Roadmap

Towards an Exa-Flop/s Platform in 2018 (www.exascale.org):1. technology trends→ concurrency, reliability, power consumption, . . .→ blueprint of an exascale system: 10-billion-way concurrency,

100 million to 1 billion cores, 10-to-100-way concurrency percore, hundreds of cores per die, . . .

2. science trends→ climate, high-energy physics, nuclear physics, fusion energy

sciences, materials science and chemistry, . . .3. X-stack (software stack for exascale)→ energy, resiliency, heterogeneity, I/O and memory

4. Polito-economic trends→ exascale systems run by government labs, used by CSE

scientists




Exascale Roadmap“Aggressively Designed Strawman Architecture”

Level What Perform. Power RAMFPU FPU, regs,. instr.-memory 1.5 Gflops 30mWCore 4 FPUs, L1 6 Gflops 141mW

Proc. Chip 742 cores, L2/L3, Interconn. 4.5 Tflops 214WNode Proc. chip, DRAM 4.5 Tflops 230W 16 GBGroup 12 proc. chips, routers 54 Tflops 3.5KW 192 GBrack 32 groups 1.7 Pflops 116KW 6.1 TB

System 583 racks 1 Eflops 67.7MW 3.6 PB

approx. 285,000 cores per rack; 166 mio cores in total

Source:ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems




Exascale Roadmap – Should You Bother?

Your department’s compute cluster in 5 years?

• a Petaflop System!• “one rack of the Exaflop system”→ using the same/similar hardware

• extrapolated example machine:• peak performance: 1.7 PFlop/s• 6 TB RAM, 60 GB cache memory• “total concurrency”: 1.1 · 106

• number of cores: 280,000• number of chips: 384

Source: ExaScale Software Study: Software Challenges in Extreme ScaleSystems




Your Department’s PetaFlop/s Cluster in 5 Years?

Tianhe-1A (Tianjin, China; Top500 # 10 )• 14,336 Xeon X5670 CPUs• 7,168 Nvidia Tesla M2050 GPUs• Linpack benchmark: ≈ 2.6 PFlop/s• ≈ 4 MW power

Stampede (Intel, Top500 # 6)• 102,400 cores (incl. Xeon Phi: MIC/“many integrated cores”)• Linpack benchmark: ≈ 5 PFlop/s• Knights Corner / Intel Xeon Phi / Intel MIC as accelerator• 61 cores, roughly 1.1–1.3 GHz• wider vector FP units: 64 bytes (i.e., 16 floats, 8 doubles)• ≈ 4.5 MW power




Free Lunch is Over(∗). . . actually already over for quite some time!

Speedup of your softwarecan only come from parallelism:

• clock speed of CPU has stalled• instruction-level parallelism

per core has stalled• number of cores is growing• size of vector units is growing

(∗) Quote and image taken from: H. Sutter, The Free Lunch Is Over:A Fundamental Turn Toward Concurrency in Software, Dr. Dobb’s Journal 30(3),March 2005.




Manycore CPU – Intel MIC Architecture

Intel® MIC Architecture:An Intel Co-Processor Architecture

VECTORIA CORE

INTERPROCESSOR NETWORK

INTERPROCESSOR NETWORK

FIX

ED F

UN

CTIO

N L

OGI

C

MEM

ORY

and

I/O

INTE

RFA

CESVECTOR

IA COREVECTORIA CORE

VECTORIA CORE

VECTORIA CORE

VECTORIA CORE

VECTORIA CORE

VECTORIA CORE

COHERENTCACHE

…

……

…

COHERENTCACHE

COHERENTCACHE

COHERENTCACHE

COHERENTCACHE

COHERENTCACHE

COHERENTCACHE

COHERENTCACHE

Many cores and many, many more threads

Standard IA programming and memory model

(source: Intel/K. Skaugen – SC’10 keynote presentation)




Manycore CPU – Intel MIC Architecture (2)Diagram of a Knights Corner core:

Rev 20121015 Intel Xeon processors and Intel Xeon Phi coprocessors page Copyright © 2012, Intel Corporation. All rights reserved. 4

The cores are in-order dual issue x86 processor cores which trace some history to the original Pentium design, but with the addition of 64-bit support, four hardware threads per core, power management, ring interconnect support, 512 bit SIMD capabilities and other enhancements these are hardly the Pentium cores of 20 years ago. The x86-specific logic (excluding L2 caches) makes up less than 2% of the die area for an Intel Xeon Phi coprocessor.

Here are key facts about the first Intel Xeon Phi coprocessor product:

A coprocessor (requires at least one processor in the system), in production in 2012. Runs Linux* (source code available http://intel.com/software/mic). Manufactured using Intel’s 22nm process technology with 3-D Trigate transistors. Supported by standard tools including Intel Cluster Studio XE 2013. A list of additional

tools available can be found online (http://intel.com/software/mic). Many cores:

o More than 50 cores (it will vary within a generation of products, and between generations; it is good advice to avoid hard-coding applications to a particular number).

o In-order cores support 64-bit x86 instructions with uniquely wide SIMD capabilities.

o Four hardware threads on each core (resulting in more than 200 hardware threads available on a single device) are primarily used to hide latencies implicit in an in-order microarchitecture. In practice, use of at least two threads per core is nearly always beneficial. As such, it is much more important that applications

Figure 4: Knights Corner Core(source: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors)




GPGPU – NVIDIA Fermi

7

Hardware Execution

CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes

one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks;

and CUDA cores and other execution units in the SM execute threads. The SM executes

threads in groups of 32 threads called a warp. While programmers can generally ignore warp

execution for functional correctness and think of programming one thread, they can greatly

improve performance by having threads in a warp execute the same code path and access

memory in nearby addresses.

An Overview of An Overview of An Overview of An Overview of the Fermi Architecturethe Fermi Architecturethe Fermi Architecturethe Fermi Architecture

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA

cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The

512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory

partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM

memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread

global scheduler distributes thread blocks to SM thread schedulers.

Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical

rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).

(source: NVIDIA – Fermi Whitepaper)




GPGPU – NVIDIA Fermi (2)

8

Third Generation Streaming

Multiprocessor

The third generation SM introduces several

architectural innovations that make it not only the

most powerful SM yet built, but also the most

programmable and efficient.

512 High Performance CUDA cores

Each SM features 32 CUDA

processors—a fourfold

increase over prior SM

designs. Each CUDA

processor has a fully

pipelined integer arithmetic

logic unit (ALU) and floating

point unit (FPU). Prior GPUs used IEEE 754-1985

floating point arithmetic. The Fermi architecture

implements the new IEEE 754-2008 floating-point

standard, providing the fused multiply-add (FMA)

instruction for both single and double precision

arithmetic. FMA improves over a multiply-add

(MAD) instruction by doing the multiplication and

addition with a single final rounding step, with no

loss of precision in the addition. FMA is more

accurate than performing the operations

separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,

multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly

designed integer ALU supports full 32-bit precision for all instructions, consistent with standard

programming language requirements. The integer ALU is also optimized to efficiently support

64-bit and extended precision operations. Various instructions are supported, including

Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population

count.

16 Load/Store Units

Each SM has 16 load/store units, allowing source and destination addresses to be calculated

for sixteen threads per clock. Supporting units load and store the data at each address to

cache or DRAM.

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

Fermi Streaming Multiprocessor (SM)

8

Third Generation Streaming

Multiprocessor

The third generation SM introduces several

architectural innovations that make it not only the

most powerful SM yet built, but also the most

programmable and efficient.

512 High Performance CUDA cores

Each SM features 32 CUDA

processors—a fourfold

increase over prior SM

designs. Each CUDA

processor has a fully

pipelined integer arithmetic

logic unit (ALU) and floating

point unit (FPU). Prior GPUs used IEEE 754-1985

floating point arithmetic. The Fermi architecture

implements the new IEEE 754-2008 floating-point

standard, providing the fused multiply-add (FMA)

instruction for both single and double precision

arithmetic. FMA improves over a multiply-add

(MAD) instruction by doing the multiplication and

addition with a single final rounding step, with no

loss of precision in the addition. FMA is more

accurate than performing the operations

separately. GT200 implemented double precision FMA.

In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,

multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly

designed integer ALU supports full 32-bit precision for all instructions, consistent with standard

programming language requirements. The integer ALU is also optimized to efficiently support

64-bit and extended precision operations. Various instructions are supported, including

Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population

count.

16 Load/Store Units

Each SM has 16 load/store units, allowing source and destination addresses to be calculated

for sixteen threads per clock. Supporting units load and store the data at each address to

cache or DRAM.

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

Uniform Cache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

Fermi Streaming Multiprocessor (SM) (source: NVIDIA – Fermi Whitepaper)




GPGPU – NVIDIA Fermi (3)

General Purpose Graphics Processing Unit:

• 512 CUDA cores• improved double precision

performance• shared vs. global memory• new: L1 und L2 cache (768 KB)• trend from GPU towards CPU?

15

Memory Subsystem Innovations

NVIDIA Parallel DataCacheTM with Configurable L1 and Unified L2 Cache

Working with hundreds of GPU computing

applications from various industries, we learned

that while Shared memory benefits many

problems, it is not appropriate for all problems.

Some algorithms map naturally to Shared

memory, others require a cache, while others

require a combination of both. The optimal

memory hierarchy should offer the benefits of

both Shared memory and cache, and allow the

programmer a choice over its partitioning. The

Fermi memory hierarchy adapts to both types of

program behavior.

Adding a true cache hierarchy for load / store

operations presented significant

challenges. Traditional GPU architectures

support a read-only ‘‘load’’ path for texture

operations and a write-only ‘‘export’’ path for

pixel data output. However, this approach is

poorly suited to executing general purpose C or

C++ thread programs that expect reads and

writes to be ordered. As one example: spilling a

register operand to memory and then reading it

back creates a read after write hazard; if the

read and write paths are separate, it may be necessary to explicitly flush the entire write /

‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be

coherent with respect to the write data.

The Fermi architecture addresses this challenge by implementing a single unified memory

request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2

cache that services all operations (load, store and texture). The per-SM L1 cache is

configurable to support both shared memory and caching of local and global memory

operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16

KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with

48 KB of shared memory, programs that make extensive use of shared memory (such as

electrodynamic simulations) can perform up to three times faster. For programs whose memory

accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved

performance over direct access to DRAM.




Parallel Computing Paradigms

Not exactly sure how the hardware will look like . . .(CPU-style, GPU-style, something new?)

However: massively parallel programming required

• revival of vector computing→ several/many FPUs performing the same operation

• hybrid/heterogenous achitectures→ different kind of cores; dedicated accelerator hardware

• different access to memory→ cache and cache coherency→ small amount of memory per core

• our concern in this course: data parallelism→ vectorisation (and a look into GPU computing)



Documents

Hybrid Architectures – Why Should I Bother?Top 500 () – June 2013 Michael Bader: Hybrid Architectures – Why Should I Bother? Computer Simulations in Science and Engineering,