70
Technische Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering (CiE) Scientific Computing (SCCS) Summer Term 2015

High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

  • Upload
    lythien

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

High Performance Computing –Programming Paradigms and Scalability

Part 1: Introduction

PD Dr. rer. nat. habil. Ralf-Peter MundaniComputation in Engineering (CiE)

Scientific Computing (SCCS)

Summer Term 2015

Page 2: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 12

General Remarks Ralf-Peter Mundani

email: [email protected], phone: 289–25057, room: 3181 consultation-hour: by appointment lecture: Tuesday, 12:00—13:30, room 02.07.023

Christoph Riesinger email: [email protected] exercise: Wednesday, 10:15—11:45, room 02.07.023 (fortnightly)

examination written, 90 minutes all printed/written materials allowed (no electronic devices)

materials: http:www5.in.tum.de

Page 3: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 13

General Remarks content

part 1: introduction part 2: high-performance networks part 3: foundations part 4: shared-memory programming part 5: distributed-memory programming part 6: examples of parallel algorithms

Page 4: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 14

Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation

If one ox could not do the job they did not tryto grow a bigger ox, but used two oxen.

—Grace Murray Hopper

Page 5: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 15

Motivation numerical simulation: from phenomena to predictions

physical phenomenontechnical process 1. modelling

determination of parameters, expression of relations

2. numerical treatmentmodel discretisation, algorithm development

3. implementationsoftware development, parallelisation

4. visualisationillustration of abstract simulation results

5. validationcomparison of results with reality

6. embeddinginsertion into working process

mathematics

computer science

application

discipline

Page 6: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 16

Motivation why numerical simulation?

because experiments are sometimes impossible life cycle of galaxies, weather forecast, terror attacks, e.g.

because experiments are sometimes not welcome avalanches, nuclear tests, medicine, e.g.

bomb attack on WTC (1993)

Page 7: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 17

Motivation why numerical simulation? (cont’d)

because experiments are sometimes very costly & time consuming protein folding, material sciences, e.g.

because experiments are sometimes more expensive aerodynamics, crash test, e.g.

Mississippi basin model (Jackson, MS)

Page 8: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 18

Motivation why parallel programming and HPC?

complex problems (especially the so called “grand challenges”) demand for more computing power climate or geophysics simulation (tsunami, e.g.) structure or flow simulation (crash test, e.g.) development systems (CAD, e.g.) large data analysis (Large Hadron Collider at CERN, e.g.) military applications (crypto analysis, e.g.)

performance increase due to faster hardware, more memory (“work harder”) more efficient algorithms, optimisation (“work smarter”) parallel computing (“get some help”)

Page 9: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 19

Motivation objectives (in case all resources would be available N-times)

throughput: compute N problems simultaneously running N instances of a sequential program with different data sets

(“embarrassing parallelism”); SETI@home, e.g. drawback: limited resources of single nodes

response time: compute one problem at a fraction (1N) of time running one instance (i. e. N processes) of a parallel program for

jointly solving a problem; finding prime numbers, e.g. drawback: writing a parallel program; communication

problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using

the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e.g.

drawback: writing a parallel program; communication

Page 10: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 110

Motivation levels of parallelism

qualitative meaning: level(s) on which work is done in parallel

gran

ular

ity

sub-instruction level

instruction level

block level

process level

program level

Page 11: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 111

Motivation levels of parallelism (cont’d)

program level parallel processing of different programs independent units without any shared data organised by the OS

process level a program is subdivided into processes to be executed in

parallel each process consists of a larger amount of sequential

instructions and some private data communication in most cases necessary (data exchange, e.g.) term of process often referred to as heavy-weight process

Page 12: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 112

Motivation levels of parallelism (cont’d)

block level blocks of instructions are executed in parallel each block consists of few instructions and shares data with others communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread)

instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order

of commands

sub-instruction level instructions are further subdivided in units to be executed in parallel or

via overlapping (vector operations, e.g.)

Page 13: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 113

Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation

Page 14: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 114

Hardware Excursion definition of parallel computers

“A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989)

possible appearances of such processing elements specialised units (steps of a vector pipeline, e.g.) parallel features in modern monoprocessors (instruction pipelining,

superscalar architectures, VLIW, multithreading, multicore, …) several uniform arithmetical units (processing elements of array

computers, GPGPUs, accelerators e.g.) complete stand-alone computers connected via LAN (work station

or PC clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called

metacomputers)

Page 15: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 115

Hardware Excursion instruction pipelining

instruction execution involves several operations1.instruction fetch (IF)2.decode (DE)3.fetch operands (OP)4.execute (EX)5.write back (WB)

which are executed successively

hence, only one part of CPU works at a given moment

IF DE OP EX WB IF DE OP EX WB ……

instruction N instruction N1

Page 16: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 116

Hardware Excursion instruction pipelining (cont‘d)

observation: while processing particular stage of instruction, other stages are idle

hence, multiple instructions to be overlapped in execution instruction pipelining (similar to assembly lines)

advantage: no additional hardware necessary

instruction N IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WBIF DE OP EX WB

IF DE OP EX WB

…instruction N1

instruction N2

instruction N3

instruction N4

time

Page 17: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 117

Hardware Excursion superscalar

faster CPU throughput due to simultaneously execution of instructions within one clock cycle via redundant functional units (ALU, multiplier, …)

dispatcher decides (during runtime) which instructions read from memory can be executed in parallel and dispatches them to different functional units

for instance, PowerPC 970 (4 ALU, 2 FPU)

but, performance improvement is limited (intrinsic parallelism)

ALU

inst

r. 1

ALU

inst

r. 2

ALU

inst

r. 3

ALU

inst

r. 4

FPU

inst

r. A

FPU

inst

r. B

Page 18: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 118

Hardware Excursion superscalar (cont’d)

pipelining for superscalar architectures also possible

instruction N9

instruction Ninstruction N1

IF DE OP EX WB

IF DE OP EX WB timeIF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

IF DE OP EX WB

Page 19: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 119

Hardware Excursion very long instruction word (VLIW)

in contrast to superscalar architectures, the compiler groups parallel executable instructions during compilation (pipelining still possible)

advantage: no additional hardware logic necessary drawback: not always fully useable ( dummy filling (NOP))

VLIW instruction

inst

r. 1

inst

r. 4

inst

r. 3

inst

r. 2

registers

Page 20: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 120

Hardware Excursion vector units

simultaneously execution of one instruction on a one-dimensional array of data ( vector)

VU first appeared in 1970s and were the basis of most supercomputers in the 1980s and 1990s

specialised hardware very expensive limited application areas (mostly CFD, CSD, …)

instruction1 2 3 N1 N

( A1 B1 A2 B2 A3 B3 AN1 BN1 AN BN )T

( C1 C2 C3 CN1 CN )T

Page 21: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 121

Hardware Excursion dual core, quad core, many core, and multicore

observation: increasing frequency f (and thus core voltage v) over past years problem: thermal power dissipation P fv2

Page 22: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 122

Hardware Excursion dual core, quad core, many core, and multicore (cont’d)

25% reduction in performance (i.e. core voltage) leads to approx. 50% reduction in dissipation

dissipation

performance

normal CPU reduced CPU

Page 23: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 123

Hardware Excursion dual core, quad core, many core, and multicore (cont’d)

idea: installation of two cores per die with same dissipation assingle core system

dissipation

performance

single core

dual core

Page 24: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 124

Hardware Excursion dual core, quad core, many core, and multicore (cont’d)

single vs. dual quad

FSB

core 0

L1

L2

FSB

core 0 core 1

L1 L1

shared L2

FSB

core 0 core 1

L1 L1

shared L2

core 2 core 3

L1 L1

shared L2

FSB: front side bus (i.e. connection to memory (via north bridge))

Page 25: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 125

Hardware Excursion INTEL Nehalem Core i7

source: www.samrathacks.comQPI

core 0 core 1

L1L2 L1L2

shared L3

core 2 core 3

L1L2 L1L2

QPI: QuickPath Interconnect replaces FSB (QPI is a point-to-point interconnection – with a memory controller now on-die – in order to allow both reduced latency and higher bandwidth up to (theoretically) 25.6 GBytes data transfer, i.e. 2 FSB)

Page 26: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 126

Hardware Excursion Intel E5-2600 Sandy-Bridge Series

2 CPUs connected by 2 QPIs (Intel Quick Path Interconnect) Quick Path Interconnect (1 sending and 1 receiving port)

8 GT/s ∙ 16 Bit/T payload ∙ 2 directions / 8 Bit/Byte = 32 GB/s max bandwidth / QPI

2 QPI links: 2 ∙ 32 GB/s 64 GB/s max bandwidth

source: G. Wellein, RRZE

Page 27: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 127

Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation

Page 28: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 128

Supercomputers arrival of clusters

in the late eighties, PCs became a commodity market with rapidlyincreasing performance, mass production, and decreasing prices

growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of

commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MPI library

1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops

Page 29: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 129

Supercomputers supercomputers

supercomputing or high-performance scientific computing as the most important application of the big number crunchers

national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in 199293 decision: develop, build, and install a series of five

supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National

Laboratory, the world’s first TFlops computer) then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain,

ASCI White,

meanwhile new high-end computing memorandum (2004)

Page 30: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 130

Supercomputers supercomputers (cont’d)

federal “Bundeshöchstleistungsrechner” initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich,

Stuttgart, and Jülich) one new installation every second year (i.e. a six year upgrade

cycle for each centre) the newest one to be among the top 10 of the world

overview and state of the art: Top500 list (updated every six month), see http:www.top500.org

finally (a somewhat different definition)Supercomputer: Turns CPU-bound problems into I/O-bound problems.

—Ken Batcher

Page 31: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 131

Supercomputers MOORE’s law

observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965)

number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every eighteen months

Page 32: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 132

Supercomputers some numbers: Top500

Page 33: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 133

Supercomputers some numbers: Top500 (cont’d)

Citius, altius, fortius!

Citius, altius, fortius!

Page 34: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 134

Supercomputers some numbers: Top500 (cont’d)

Page 35: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 135

Supercomputers the 10 fastest supercomputers in the world (by November 2014)

Page 36: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 136

Supercomputers The Earth Simulator – world’s #1 from 2002—04

installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching)

8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack)

nodes connected by 640640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth)

further 700 TB disc space and 1.60 PB mass storage

Page 37: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 137

Supercomputers BlueGeneL – world’s #1 from 2004—08

installed in 2005 at LLNL, CA, USA(beta-system in 2004 at IBM)

cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 IO nodes)

2 PowerPC 440d processors (2.8 GFlops each) 512MB memory 131,072 processors (367.00 TFlops peak performance) and33.50 TB memory; 280.60 TFlops sustained performance (Linpack)

nodes configured as 3D torus (32 32 64); global reduction tree for fast operations (global max sum) in a few microseconds

1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES 9

Page 38: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 138

Supercomputers Roadrunner – world’s #1 from 2008—09

installed in 2008 at LANL, NM, USA installation costs about $120 million first “hybrid” supercomputer

dual-core Opteron Cell Broadband Engine 129,600 cores (1456.70 TFlops peak performance) and98 TB memory; 1144.00 TFlops sustained performance (Linpack)

standard processing (file system IO, e. g.) handled by Opteron, while mathematically and CPU-intensive tasks are handled by Cell

2.35 MW power consumption ( 437 MFlops per Watt ) primarily usage: ensure safety and reliability of nation’s nuclear

weapons stockpile, real-time applications (cause & effect in capital markets, bone structures and tissues renderings as patients are being examined, e.g.)

Page 39: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 139

Supercomputers HLRB II (world’s #6 for 042006)

installed in 2006 at LRZ, Garching installation costs 38M€ monthly costs approx. 400,000€ upgrade in 2007 (finished) one of Germany’s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus)

256 blades (ccNUMA link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.80 GFlops) 4GB memory per core

9728 cores (62.30 TFlops peak performance) and 39 TB memory; 56.50 TFlops sustained performance (Linpack)

footprint 24m 12m; total weight 103 metric tons

Page 40: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 140

Supercomputers SuperMUC (world’s #4 for 062012)

installed in 2012 at LRZ, Garching IBM System x iDataPlex (still) one of Germany’s 3 supercomputers consists of 19 islands (Infiniband FDR10 pruned tree with

4:1 intra-island / inter-island ratio) 18 thin islands with 512 nodes each (total 288 TB memory) Sandy Bridge-EP Xeon E5 (2 CPUs (8 cores each) / node)

1 fat island with 205 nodes (total 52 TB memory) Westmere-EX Xeon E7 (4 CPUs (10 cores each) / node)

147,456 cores (3.185 PFlops peak performance – thin islands only); 2.897 PFlops sustained performance (Linpack)

footprint 21m 26m; warm water cooling

Page 41: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 141

Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation

Page 42: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 142

Classification of Parallel Computers standard classification according to FLYNN

global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams

two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process

hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data

drawback: very different computers may belong to the same class

Page 43: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 143

Classification of Parallel Computers standard classification according to FLYNN (cont’d)

SISD one processing unit that has access to one data memory and to

one program memory classical monoprocessor following VON NEUMANN’s principle

processor program memorydata memory

Page 44: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 144

Classification of Parallel Computers standard classification according to FLYNN (cont’d)

SIMD several processing units, each with separate access to a (shared or

distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict

synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes out-

dated due to recent developments at commodity market

processor

program memory

data memory

data memory processor

Page 45: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 145

Classification of Parallel Computers standard classification according to FLYNN (cont’d)

MISD several processing units that have access to one data memory;

several program memories not very popular class (mainly for special applications such as Digital

Signal Processing) operating on a single stream of data, forwarding results from one

processing unit to the next example: systolic array (network of primitive processing elements that

“pump” data)

processor program memory

data memory

processor program memory

Page 46: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 146

Classification of Parallel Computers standard classification according to FLYNN (cont’d)

MIMD several processing units, each with separate access to a (shared or

distributed) data memory; several program memories classification according to (physical) memory organisation

shared memory shared (global) address space distributed memory distributed (local) address space

example: multiprocessor systems, networks of computers

processor program memorydata memory

data memory processor program memory

Page 47: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 147

Classification of Parallel Computers

MesMSdistributed address space

Mem-MesMS (hybrid)MemMS, SMPsharedaddress space

distributed memoryglobal memory

processor coupling cooperation of processors computers as well as their shared use

of various resources require communication and synchronisation the following types of processor coupling can be distinguished

memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS)

Page 48: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 148

Classification of Parallel Computers processor coupling (cont’d)

uniform memory access (UMA) each processor P has direct access via the network to each

memory module M with same access times to all data standard programming model can be used (i.e. no explicit send receive of messages necessary)

communication and synchronisation via shared variables (inconsistencies (write conflicts, e.g.) have to prevented in general by the programmer)

M

network

P PP

M M

Page 49: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 149

Classification of Parallel Computers processor coupling (cont’d)

symmetric multiprocessor (SMP) only a small amount of processors, in most cases a central bus,

one address space (UMA), but bad scalability cache-coherence implemented in hardware (i.e. a read always

provides a variable’s value from its last write) example: double or quad boards, SGI Challenge

M

C

P

CC

PP

C: cache

Page 50: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 150

Classification of Parallel Computers processor coupling (cont’d)

non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of

data (i.e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM VSM, Cray T3E

P

M

network

M

P

Page 51: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 151

Classification of Parallel Computers processor coupling (cont’d)

cache-coherent non-uniform memory access (ccNUMA) caches for local and remote addresses; cache-coherence

implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000

C

M

network

P

M

C

P

Page 52: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 152

Classification of Parallel Computers processor coupling (cont’d)

cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1

P

C

network

CC

PP

Page 53: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 153

Classification of Parallel Computers processor coupling (cont’d)

no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange

(due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and IO due to

parallel data transfer (Direct Memory Access, e.g.) possible example: IBM SP2, ASCI Red Blue White

M

P

network

M

P

M

P

Page 54: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 154

Classification of Parallel Computers difference between processes and threads

program (*.exe, *.out, e.g.)

messages

messagesproc

ess

mod

el(N

OR

MA

)

program (*.exe, *.out, e.g.)

thre

ad m

odel

(UM

A, N

UM

A)

Page 55: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 155

Overview motivation hardware excursion supercomputers classification of parallel computers quantitative performance evaluation

Page 56: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 156

Quantitative Performance Evaluation execution time

time T of a parallel program between start of the execution on one processor and end of all computations on the last processor

during execution all processors are in one of the following states compute TCOMP: time spent for computations

communicate TCOMM: time spent for send and receive operations

idle TIDLE: time spent for waiting (sending receiving messages)

hence T TCOMP TCOMM TIDLE

Page 57: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 157

Quantitative Performance Evaluation comparison multiprocessor monoprocessor

correlation of multi- and monoprocessor systems’ performance important: program that can be executed on both systems definitions

P(1): amount of unit operations of a program on the monoprocessor system

P(p): amount of unit operations of a program on the multiprocessor systems with p processors

T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles)

T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors

Page 58: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 158

Quantitative Performance Evaluation comparison multiprocessor monoprocessor (cont’d)

simplifying preconditions T(1) P(1) one operation to be executed in one step on the

monoprocessor system

T(p) P(p) more than one operation to be executed in one step

(for p 2) on the multiprocessor system with p processors

Page 59: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 159

comparison multiprocessor monoprocessor (cont’d) speed-up

S(p) indicates the improvement in processing speed

efficiency E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p

Quantitative Performance Evaluation

)(

(1))(

pT

TpS with 1 S(p) p

p

pSpE

)()( with 1p E(p) 1

Page 60: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 160

Quantitative Performance Evaluation comparison multiprocessor monoprocessor (cont’d)

speed-up and efficiency can be seen in two different ways algorithm-independent

best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency

algorithm-dependent parallel algorithm is treated as sequential one to measure the

execution time on the monoprocessor system; “unfair” due to communication and synchronisation overhead relative speed-up relative efficiency

Page 61: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 161

Quantitative Performance Evaluation scalability

objective: adding further processing elements to the system shall reduce the execution time without any program modifications

i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size

one porter may carry one suitcase in a minute 60 porters won’t do it in a second but 60 porters may carry 60 suitcases in a minute

in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited

when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems

Page 62: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 162

AMDAHL’s law the probably most important and most famous estimate for the

speed-up (even if quite pessimistic) underlying model

each program has a sequential part s, 0 s 1, that can only be executed in a sequential way: synchronisation, data IO, …

furthermore, each program consists of a parallelisable part 1sthat can be executed in parallel by several processes; finding the maximum value within a set of numbers, e.g.

hence, the execution time for the parallel program executed on pprocessors can be written as

Quantitative Performance Evaluation

(1)1

(1))( Tp

sTspT

Page 63: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 163

AMDAHL’s law (cont’d) the speed-up can thus be computed as

when increasing p we finally get AMDAHL’s law

speed-up is bounded: S(p) 1s the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s 0.1)

Quantitative Performance Evaluation

p

ssT

p

sTs

T

pT

T

11

(1)

1(1)

(1)

)(

(1) )(pS

sp

ss

pSpp

1

11

lim )(lim

Page 64: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 164

Quantitative Performance Evaluation

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

# processes

speed

-up

S(p)

AMDAHL’s law (cont’d) example: s 0.1

independent from p the speed-up is bounded by this limit where’s the error?

Page 65: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 165

Quantitative Performance Evaluation GUSTAFSON’s law

addresses the shortcomings of AMDAHL’s law as it states that any sufficient large problem can be efficiently parallelised

instead of a fixed problem size it supposes a fixed time concept underlying model

execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part , 0 1

hence, the execution time for the sequential program on the monoprocessor can be written as

T(1) p(1)

the speed-up can thus be computed as

S(p) p(1) p (1p)

Page 66: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 166

GUSTAFSON’s law (cont’d) difference to AMDAHL

sequential part s(p) is not constant, but gets smaller with increasing p

s(p) 0, 1

often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, )

speed-up is not bounded for increasing p

Quantitative Performance Evaluation

, )(1

)(

pps

Page 67: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 167

Quantitative Performance Evaluation GUSTAFSON’s law (cont’d)

some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a

monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system

but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main

memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system

Page 68: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 168

Quantitative Performance Evaluation communication—computation-ratio (CCR)

important quantity measuring the success of a parallelisation relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size

example NN matrix distributed among p processors (Np rows each) iterative method: in each step, each matrix element is replaced

by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8NNp communication time: 2N CCR: p4N – what does this mean?

Page 69: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 169

Twelve ways……to fool the masses when giving performance results on parallel computers.

—David H. Bailey,NASA Ames Research Centre, 1991

1. Quote only 32-bit performance results, not 64-bit results.

2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application.

3. Quietly employ assembly code and other low-level language constructs.

4. Scale up the problem size with the number of processors, but omit any mention of this fact.

5. Quote performance results projected to a full system.

6. Compare your results against scalar, unoptimised codes on Crays.

Page 70: High Performance Computing – Programming … Universität München High Performance Computing – Programming Paradigms and Scalability Part 1: Introduction PD Dr. rer. nat. habil

Technische Universität München

PD Dr. Ralf-Peter Mundani High Performance Computing Summer Term 2015 170

Twelve ways…7. When direct run time comparisons are required, compare with an old code on

an obsolete system.

8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation.

9. Quote performance in terms of processor utilisation, parallel speed-ups or MFLOPS per dollar.

10. Mutilate the algorithm used in the parallel implementation to match the architecture.

11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.

12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance.