High Performance Computing - Challenges on the Road to Exascale Computing

© 2011 IBM Corporation

High Performance Computing Challenges on the Road to Exascale Computing

April 2011

H. J. Schick IBM Germany Research & Development GmbH


Agenda

Introduction

The What and Why of High Performance Computing

Exascale Challenges

Balanced Systems

Blue Gene Architecture and Blue Gene Active Storage

Supercomputers in a Sugar Cube

2



Origination of the “Jugene” Supercomputer

4


Supercomputer Satisfies Need for FLOPS

FLOPS = FLoating point OPerations per Second. – Mega=106, Giga=109, Tera=1012, Peta=1015, Exa=1018

Simulation is a major application area.

Many simulations based on the notion of “timestep”. – At each timestep, advance the constituent parts according to their physics or

chemistry.

– Example Challenge:

Molecular dynamics has picosecond=10-12 timescale,

but many biology processes have millisecond=10-3 timescale.

• Simulation has 109 timesteps!

Each timestep requires many operations!

5


Simulation Pseudo-code:

// Initialize state of atoms.

While time < 1 millisecond {

// Calculate forces on 40,000 atoms.

// Calculate velocities of all atoms.

// Advance position of all atoms.

time = time + 1picosecond

}

// Write biology result.

6


Supercomputing is Capability Computing

A single instance of an application using large tightly-coupled computer resources.

– For example, a single 1000-year climate simulation.

Contrast to Capacity Computing:

– Many instances of one or more applications using large loosely-coupled computer

resources.

– For example, 1000 independent 1-year climate simulations.

– Often trivial parallelism.

Often suited for GRID or SETI@Home-style systems.

7


Supercomputer Versus Your Desktop

Assume 2000-processor supercomputer delivers simulation result in 1 day.

Assuming memory-size is not a problem, then your 1-processor desktop would deliver same

result in 2000 days = 5 years.

So supercomputers make results available on a human timescale.

8

© 2011 IBM Corporation 9

But what could you do if all objects were intelligent…

…and connected?


What could you do with unlimited computing power… for pennies?

Could you predict the path of a storm down to the square kilometer? Could you identify another 20%

of proven oil reserves without drilling one hole?


Grand Challenges

“A grand challenge is a fundamental problem in science or

engineering, with broad applications, whose solution would be

enabled by the application of high performance computing resources

that could become available in the near future.”

11

Computational fluid dynamics • Design of hypersonic aircraft,

efficient automobile bodies, and extremely quiet submarines.

• Weather forecasting for short and long term effects.

• Efficient recovery of oil, and for many other applications.

Electronic structure calculations for the design of new materials:

• Chemical catalysts • Immunological agents • Superconductors

Calculations to understand the fundamental nature of matter:

• Quantum chromodynamics

• Condensed matter theory


Enough Atoms to See Grains in Solidification of Metal http://www-phys.llnl.gov/Research/Metals_Alloys/news.html

12


Building Blocks of Matter

QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

Quarks are the constituents of matter which strongly interact exchanging gluons.

Particular phenomena – Confinement – Asymptotic freedom (Nobel Prize 2004)

Theory of strong interactions = Quantum Chromodynamics (QCD)


Projected Performance Development

14

Almost a doubling every year !!!


Extrapolating an Exaflop in 2018 Standard technology scaling will not get us there in 2018

15

BlueGene/L

(2005)

Exaflop

Directly

scaled

Exaflop compromise

using traditional

technology

Assumption for “compromise guess”

Node Peak Perf 5.6GF 20TF 20TF Same node count (64k)

hardware

concurrency/node

2 8000 1600 Assume 3.5GHz

System Power in

Compute Chip

1 MW 3.5 GW 25 MW Expected based on technology improvement through 4 technology generations. (Only

compute chip power scaling, I/Os also scaled same way)

Link Bandwidth

(Each unidirectional

3-D link)

1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio.

Wires per

unidirectional 3-D

link

2 400 wires 80 wires Large wire count will eliminate high density and drive links onto cables where they are

100x more expensive. Assume 20 Gbps signaling

Pins in network on

node

24 pins 5,000 pins 1,000 pins 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will need

optics for in rack interconnects.

10Gbps now possible in both copper and optics.

Power in network 100 KW 20 MW 4 MW 10 mW/Gbps assumed.

Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one

direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electrical

Electrical power in future: separately optimized links for power.

Memory

Bandwidth/node

5.6GB/s 20TB/s 1 TB/s Not possible to maintain external bandwidth/Flop

L2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations with expected eDRAM density improvements

Data pins associated

with memory/node

128 data

pins

40,000 pins 2000 pins 3.2 Gbps per pin

Power in memory I/O

(not DRAM)

12.8 KW 80 MW 4 MW 10 mW/Gbps assumed. Most current power in address bus.

Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f for

random data on data pins) Address power is higher.


The Big Leap from Petaflops to Exaflops

We will hit 20 Petaflop in 2011/2012 …. Now beginning research for ~2018 Exascale.

IT/CMOS industry is trying to double performance every 2 years.

HPC industry is trying to double performance every year.

Technology disruptions in many areas.

– BAD NEWS: Scalability of current technologies?

• Silicon Power, Interconnect, Memory, Packaging.

– GOOD NEWS: Emerging technologies?

• Memory technologies (e.g. storage class memory), 3D-chips, etc.

Exploiting exascale machines.

– Want to maximize science output per €.

– Need multiple partner applications to evaluate HW trade-offs.

16


Exascale Challenges – Energy

Power consumption will increase in the future!

What is the critical limit?

– JSC has 5 MW, potential of 10 MW

– 1 MW is 1 M€ / year

– 20 MW expected to be the critical limit

Are Exascale systems a Large Scale Facility?

– LHC uses 100 MW

Energy efficiency

– Cooling uses significant fraction (PUE > 1.2 today → 1.0)

– Hot cooling water (40°C and more) might help

– Free cooling: use free air to cool water

– Heat recycling: use waste heat for heating, cooling, etc.

17


Exascale Challenges – Resiliency

Ever increasing number of components

– O(10000) nodes

– O(100000) DIMMs of RAM

Each component's MTBF will not increase

– Optimistic: Remains constant

– Realistic: Smaller structures, lower voltages → decrease

Global MTBF will decrease

– Critical limit? 1 day? 1 hour? Time to write checkpoint!

How to handle failures

– Try to anticipate failures via monitoring

– Software must help to handle failures

• checkpoints, process-migration, transactional computing

18


Exascale Challenges – Applications

Ever increasing levels of parallelism

– Thousands of nodes, hundreds of cores, dozens of registers

– Automatic parallelization vs. explicit exposure

– How large are coherency domains?

– How many languages do we have to learn?

MPI + X most probably not sufficient

– 1 process / core makes orchestration of processes harder

– GPUs require explicit handling today (CUDA, OpenCL)

What is the future paradigm

– MPI + X + Y? PGAS + X (+Y)?

– PGAS: UPC, Co-Array Fortran, X10, Chapel, Fortress, …

Which applications are inherently scalable enough at all?

19


Example caxpy:

Processor FPU throughput

[FLOPS / cycle]

Memory bandwidth

[words / cycle]

[FLOPS / word]

apeNEXT 8 2 4

QCDOC (MM) 2 0.63 3.2

QCDOC (LS) 2 2 1

Xeon 2 0.29 7

GPU 128 x 2 17.3 (*) 14.8

Cell/B.E. (MM) 8 x 4 1 32

Cell/B.E. (LS) 8 x 4 8 x 4 1

Balanced Systems


Balanced Systems ???


… but are they Reliable, Available and Serviceable ???


Blue Gene/P

23


Blue Gene/P

24

13.6 GF/s

8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s

2.0 GB DDR2

(4.0GB 6/30/08)

32 Node Cards

13.9 TF/s

2 (4) TB

72 Racks, 72x32x32

1 PF/s

144 (288) TB

Cabled 8x8x16

Rack

System

Compute Card

Chip

435 GF/s

64 (128) GB

(32 chips 4x4x2)

32 compute, 0-1 IO cards

Node Card


Blue Gene/P Compute ASIC

JTAG 10 Gb/s

256

256

32k I1/32k D1

PPC450

Double FPU

Ethernet

10 Gbit

JTAG

Access Collective Torus

Global

Barrier

DDR-2

Controller

w/ ECC

32k I1/32k D1

PPC450

Double FPU

4MB

eDRAM

L3 Cache

or

On-Chip

Memory

6 3.4Gb/s

bidirectional

4 global

barriers or

interrupts

128

32k I1/32k D1

PPC450

Double FPU

32k I1/32k D1

PPC450

Double FPU L2

Snoop

filter

4MB

eDRAM

L3 Cache

or

On-Chip

Memory

512b data

72b ECC

128

L2

Snoop

filter

128

L2

Snoop

filter

128

L2

Snoop

filter

Mu

ltiple

xin

g s

witc

h

DMA

Mu

ltiple

xin

g s

witc

h

3 6.8Gb/s

bidirectional

DDR-2

Controller

w/ ECC

13.6 Gb/s

DDR-2 DRAM bus

32

Shared

SRAM

snoop

Hybrid

PMU

w/ SRAM

256x64b

Shared L3

Directory

for eDRAM

w/ECC

Shared L3

Directory

for eDRAM

w/ECC

Arb

512b data

72b ECC


Blue Gene/P Compute Card

2 x 16GB interface to 2 or 4

GB SDRAM-DDR2 BGQ ASIC 29mm x 29mm FC-PBGA

NVRAM, monitors, decoupling,

Vtt termination

All network and IO, power input


Blue Gene/P Node Board 32 Compute nodes

Optional IO card (one of 2 possible)

Local DC-DC regulators

(6 required, 8 with redundancy) 10Gb optical link


3D Torus

– Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth

Global Collective Network

– One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link; One-way global latency 2.5 µs – ~23TB/s total bandwidth (64k machine) – Interconnects all compute and I/O nodes (1024)

Low Latency Global Barrier and Interrupt

– Round trip latency 1.3 µs

Control Network

– Boot, monitoring and diagnostics

Ethernet

– Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.)

28

Blue Gene Interconnection Networks Optimized for Parallel Programming and Scalable Management


Source: Kirk Borne, Data Science Challenges from Distributed Petabyte Astronomical Data Collections: Preparing for the Data Avalanche through Persistence, Parallelization, and Provenance

© 2011 IBM Corporation 30 Blue Gene Active Storage HEC FSIO 2010 30

Blue Gene Architecture in Review

Blue Gene is not just FLOPs …

… it’s also torus network, power efficiency, and dense packaging. Focus on scalability rather than on configurability gives the Blue Gene family’s System-on-a-Chip architecture unprecedented scalability and reliability.

© 2011 IBM Corporation 31 Blue Gene Active Storage HEC FSIO 2010 31

• Integrate significant storage class memory (SCM) at each node

• For now, Flash memory, maybe similar function to Fusion-io ioDrive Duo

• Future systems may deploy Phase Change Memory (PCM), Memristor, or …?

• Assume node density will drops 50% -- 512 Nodes/Rack for embedded apps

• Objective: balance Flash bandwidth to network all-to-all throughput

• Resulting System Attributes:

• Rack: 0.5 petabyte, 512 Blue Gene processors, and embedded torus network

• 700 TB/s I/O bandwidth to Flash – competitive with ~70 large disk controllers

• Order of magnitude less space and power than equivalent perf via disk solution

• Can configure fewer disk controllers and optimize them for archival use

• With network all-to-all throughput at 1GB/s per node, anticipate:

• 1TB sort from/to persistent storage in order 10 secs.

• 130 Million IOPs per rack, 700 GB/s I/O bandwidth

• Inherit Blue Gene attributes: scalability, reliability, power efficiency,

• Research Challenges (list not exhaustive):

• Packaging – can the integration succeed?

• Resilience – storage, network, system management, middleware

• Data management – need clear split between on-line and archival data

• Data structures and algorithms can take specific advantage of the BGAS architecture – no one cares it’s not x86 since software is embedded in storage

• Related Work:

• Gordon (UCSD) http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf

• FAWN (CMU) http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf

• RamCloud (Stanford) http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf

ioDrive Duo One Board 512 Node

SLC NAND Cap. 320 GB 160 TB

Read BW (64K) 1450 MB/s 725 GB/s

Write BW (64K) 1400 MB/s 700 GB/s

Read IOPS (4K) 270,000 138 Mega

Write IOPS (4K) 257,000 131 Maga

Mixed R/W

IOPs(75/25@4K) 207,000 105 Mega

Thought Experiment: A Blue Gene Active Storage Machine

http://www.fusionio.com/load/media-imagesMediakit/fbdzz/Fusion-io_logo_horizontal.eps?attach=1

http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf

http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf



http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf




From individual transistors to the globe Energy-consumption issues (and thermal issues) propagate through hardware levels


Current air-cooled datacenters are extremely inefficient. Cooling needs as much energy as IT and both are thrown-away.

Provocative: Datacenter is a huge “Heater with integrated Logic”.

For a 10 MW datacenter US$ 3 - 5M is wasted per year.

Source: APC, Whitepaper #154 (2008)

Energy consumption of datacenters today


CMOS 80ºC

Water 60ºC

Micro-channel

liquid coolers

Heat exchanger

Direct „Waste“-Heat usage

e.g. heating

Hot-water-cooled datacenters – towards zero emission

34


Global wire lengths reduction

Meindl 05 et al.

System on Chip

3D Integration

Multi-Chip Design

Benefits:

High core-cache bandwidth

Separation of technologies

Reduction in wire length

Equivalent to two generations of scaling

No impact on software development

Brain: synapse network

Paradigm change: Moore’s law goes 3D


3D integration requires (scalable) interlayer liquid cooling

Challenge: isolate electrical interconnects from liquid

Through silicon via electrical bonding

and water insulation scheme

A large fraction of energy in computers is spent for data transport

Shrinking computers saves energy

cross-section through fluid

port and cavities

Test vehicle with fluid

manifold and connection

Microchannel

Pin fin

Scalable Heat Removal by Interlayer Cooling


Paradigm Changes -Energy will cost more than servers

-Coolers are million fold larger than transistors

Moore’s Law goes 3D -Single layer scaling slows down

-Stacking of layers allows extension of Moore’s law

-Approaching functional density of human brain

Future computers look different -Liquid cooling and heat re-use, e.g. Aquasar

-Interlayer cooled 3D chip stacks

-Smarter energy by bionic designs

Energy aspects are key -Cooling – power delivery – photonics

-Shrink a rack to a “sugar cube”: 50x efficiency

On the Cube Road


Thank you very much for your attention.