Upload
heiko-joerg-schick
View
1.708
Download
1
Tags:
Embed Size (px)
Citation preview
© 2011 IBM Corporation
High Performance Computing Challenges on the Road to Exascale Computing
April 2011
H. J. Schick IBM Germany Research & Development GmbH
© 2011 IBM Corporation
Agenda
Introduction
The What and Why of High Performance Computing
Exascale Challenges
Balanced Systems
Blue Gene Architecture and Blue Gene Active Storage
Supercomputers in a Sugar Cube
2
© 2011 IBM Corporation
© 2011 IBM Corporation
Origination of the “Jugene” Supercomputer
4
© 2011 IBM Corporation
Supercomputer Satisfies Need for FLOPS
FLOPS = FLoating point OPerations per Second. – Mega=106, Giga=109, Tera=1012, Peta=1015, Exa=1018
Simulation is a major application area.
Many simulations based on the notion of “timestep”. – At each timestep, advance the constituent parts according to their physics or
chemistry.
– Example Challenge:
Molecular dynamics has picosecond=10-12 timescale,
but many biology processes have millisecond=10-3 timescale.
• Simulation has 109 timesteps!
Each timestep requires many operations!
5
© 2011 IBM Corporation
Simulation Pseudo-code:
// Initialize state of atoms.
While time < 1 millisecond {
// Calculate forces on 40,000 atoms.
// Calculate velocities of all atoms.
// Advance position of all atoms.
time = time + 1picosecond
}
// Write biology result.
6
© 2011 IBM Corporation
Supercomputing is Capability Computing
A single instance of an application using large tightly-coupled computer resources.
– For example, a single 1000-year climate simulation.
Contrast to Capacity Computing:
– Many instances of one or more applications using large loosely-coupled computer
resources.
– For example, 1000 independent 1-year climate simulations.
– Often trivial parallelism.
Often suited for GRID or SETI@Home-style systems.
7
© 2011 IBM Corporation
Supercomputer Versus Your Desktop
Assume 2000-processor supercomputer delivers simulation result in 1 day.
Assuming memory-size is not a problem, then your 1-processor desktop would deliver same
result in 2000 days = 5 years.
So supercomputers make results available on a human timescale.
8
© 2011 IBM Corporation 9
But what could you do if all objects were intelligent…
…and connected?
© 2011 IBM Corporation
What could you do with unlimited computing power… for pennies?
Could you predict the path of a storm down to the square kilometer? Could you identify another 20%
of proven oil reserves without drilling one hole?
© 2011 IBM Corporation
Grand Challenges
“A grand challenge is a fundamental problem in science or
engineering, with broad applications, whose solution would be
enabled by the application of high performance computing resources
that could become available in the near future.”
11
Computational fluid dynamics • Design of hypersonic aircraft,
efficient automobile bodies, and extremely quiet submarines.
• Weather forecasting for short and long term effects.
• Efficient recovery of oil, and for many other applications.
Electronic structure calculations for the design of new materials:
• Chemical catalysts • Immunological agents • Superconductors
Calculations to understand the fundamental nature of matter:
• Quantum chromodynamics
• Condensed matter theory
© 2011 IBM Corporation
Enough Atoms to See Grains in Solidification of Metal http://www-phys.llnl.gov/Research/Metals_Alloys/news.html
12
© 2011 IBM Corporation 13
Building Blocks of Matter
QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Quarks are the constituents of matter which strongly interact exchanging gluons.
Particular phenomena – Confinement – Asymptotic freedom (Nobel Prize 2004)
Theory of strong interactions = Quantum Chromodynamics (QCD)
© 2011 IBM Corporation
Projected Performance Development
14
Almost a doubling every year !!!
© 2011 IBM Corporation
Extrapolating an Exaflop in 2018 Standard technology scaling will not get us there in 2018
15
BlueGene/L
(2005)
Exaflop
Directly
scaled
Exaflop compromise
using traditional
technology
Assumption for “compromise guess”
Node Peak Perf 5.6GF 20TF 20TF Same node count (64k)
hardware
concurrency/node
2 8000 1600 Assume 3.5GHz
System Power in
Compute Chip
1 MW 3.5 GW 25 MW Expected based on technology improvement through 4 technology generations. (Only
compute chip power scaling, I/Os also scaled same way)
Link Bandwidth
(Each unidirectional
3-D link)
1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio.
Wires per
unidirectional 3-D
link
2 400 wires 80 wires Large wire count will eliminate high density and drive links onto cables where they are
100x more expensive. Assume 20 Gbps signaling
Pins in network on
node
24 pins 5,000 pins 1,000 pins 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will need
optics for in rack interconnects.
10Gbps now possible in both copper and optics.
Power in network 100 KW 20 MW 4 MW 10 mW/Gbps assumed.
Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one
direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electrical
Electrical power in future: separately optimized links for power.
Memory
Bandwidth/node
5.6GB/s 20TB/s 1 TB/s Not possible to maintain external bandwidth/Flop
L2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations with expected eDRAM density improvements
Data pins associated
with memory/node
128 data
pins
40,000 pins 2000 pins 3.2 Gbps per pin
Power in memory I/O
(not DRAM)
12.8 KW 80 MW 4 MW 10 mW/Gbps assumed. Most current power in address bus.
Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f for
random data on data pins) Address power is higher.
© 2011 IBM Corporation
The Big Leap from Petaflops to Exaflops
We will hit 20 Petaflop in 2011/2012 …. Now beginning research for ~2018 Exascale.
IT/CMOS industry is trying to double performance every 2 years.
HPC industry is trying to double performance every year.
Technology disruptions in many areas.
– BAD NEWS: Scalability of current technologies?
• Silicon Power, Interconnect, Memory, Packaging.
– GOOD NEWS: Emerging technologies?
• Memory technologies (e.g. storage class memory), 3D-chips, etc.
Exploiting exascale machines.
– Want to maximize science output per €.
– Need multiple partner applications to evaluate HW trade-offs.
16
© 2011 IBM Corporation
Exascale Challenges – Energy
Power consumption will increase in the future!
What is the critical limit?
– JSC has 5 MW, potential of 10 MW
– 1 MW is 1 M€ / year
– 20 MW expected to be the critical limit
Are Exascale systems a Large Scale Facility?
– LHC uses 100 MW
Energy efficiency
– Cooling uses significant fraction (PUE > 1.2 today → 1.0)
– Hot cooling water (40°C and more) might help
– Free cooling: use free air to cool water
– Heat recycling: use waste heat for heating, cooling, etc.
17
© 2011 IBM Corporation
Exascale Challenges – Resiliency
Ever increasing number of components
– O(10000) nodes
– O(100000) DIMMs of RAM
Each component's MTBF will not increase
– Optimistic: Remains constant
– Realistic: Smaller structures, lower voltages → decrease
Global MTBF will decrease
– Critical limit? 1 day? 1 hour? Time to write checkpoint!
How to handle failures
– Try to anticipate failures via monitoring
– Software must help to handle failures
• checkpoints, process-migration, transactional computing
18
© 2011 IBM Corporation
Exascale Challenges – Applications
Ever increasing levels of parallelism
– Thousands of nodes, hundreds of cores, dozens of registers
– Automatic parallelization vs. explicit exposure
– How large are coherency domains?
– How many languages do we have to learn?
MPI + X most probably not sufficient
– 1 process / core makes orchestration of processes harder
– GPUs require explicit handling today (CUDA, OpenCL)
What is the future paradigm
– MPI + X + Y? PGAS + X (+Y)?
– PGAS: UPC, Co-Array Fortran, X10, Chapel, Fortress, …
Which applications are inherently scalable enough at all?
19
© 2011 IBM Corporation 20
Example caxpy:
Processor FPU throughput
[FLOPS / cycle]
Memory bandwidth
[words / cycle]
[FLOPS / word]
apeNEXT 8 2 4
QCDOC (MM) 2 0.63 3.2
QCDOC (LS) 2 2 1
Xeon 2 0.29 7
GPU 128 x 2 17.3 (*) 14.8
Cell/B.E. (MM) 8 x 4 1 32
Cell/B.E. (LS) 8 x 4 8 x 4 1
Balanced Systems
© 2011 IBM Corporation 21
Balanced Systems ???
© 2011 IBM Corporation 22
… but are they Reliable, Available and Serviceable ???
© 2011 IBM Corporation
Blue Gene/P
23
© 2011 IBM Corporation
Blue Gene/P
24
13.6 GF/s
8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s
2.0 GB DDR2
(4.0GB 6/30/08)
32 Node Cards
13.9 TF/s
2 (4) TB
72 Racks, 72x32x32
1 PF/s
144 (288) TB
Cabled 8x8x16
Rack
System
Compute Card
Chip
435 GF/s
64 (128) GB
(32 chips 4x4x2)
32 compute, 0-1 IO cards
Node Card
© 2011 IBM Corporation
Blue Gene/P Compute ASIC
JTAG 10 Gb/s
256
256
32k I1/32k D1
PPC450
Double FPU
Ethernet
10 Gbit
JTAG
Access Collective Torus
Global
Barrier
DDR-2
Controller
w/ ECC
32k I1/32k D1
PPC450
Double FPU
4MB
eDRAM
L3 Cache
or
On-Chip
Memory
6 3.4Gb/s
bidirectional
4 global
barriers or
interrupts
128
32k I1/32k D1
PPC450
Double FPU
32k I1/32k D1
PPC450
Double FPU L2
Snoop
filter
4MB
eDRAM
L3 Cache
or
On-Chip
Memory
512b data
72b ECC
128
L2
Snoop
filter
128
L2
Snoop
filter
128
L2
Snoop
filter
Mu
ltiple
xin
g s
witc
h
DMA
Mu
ltiple
xin
g s
witc
h
3 6.8Gb/s
bidirectional
DDR-2
Controller
w/ ECC
13.6 Gb/s
DDR-2 DRAM bus
32
Shared
SRAM
snoop
Hybrid
PMU
w/ SRAM
256x64b
Shared L3
Directory
for eDRAM
w/ECC
Shared L3
Directory
for eDRAM
w/ECC
Arb
512b data
72b ECC
© 2011 IBM Corporation
Blue Gene/P Compute Card
2 x 16GB interface to 2 or 4
GB SDRAM-DDR2 BGQ ASIC 29mm x 29mm FC-PBGA
NVRAM, monitors, decoupling,
Vtt termination
All network and IO, power input
© 2011 IBM Corporation
Blue Gene/P Node Board 32 Compute nodes
Optional IO card (one of 2 possible)
Local DC-DC regulators
(6 required, 8 with redundancy) 10Gb optical link
© 2011 IBM Corporation
3D Torus
– Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – Communications backbone for computations – 0.7/1.4 TB/s bisection bandwidth, 67TB/s total bandwidth
Global Collective Network
– One-to-all broadcast functionality – Reduction operations functionality – 2.8 Gb/s of bandwidth per link; One-way global latency 2.5 µs – ~23TB/s total bandwidth (64k machine) – Interconnects all compute and I/O nodes (1024)
Low Latency Global Barrier and Interrupt
– Round trip latency 1.3 µs
Control Network
– Boot, monitoring and diagnostics
Ethernet
– Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.)
28
Blue Gene Interconnection Networks Optimized for Parallel Programming and Scalable Management
© 2011 IBM Corporation 29
Source: Kirk Borne, Data Science Challenges from Distributed Petabyte Astronomical Data Collections: Preparing for the Data Avalanche through Persistence, Parallelization, and Provenance
© 2011 IBM Corporation 30 Blue Gene Active Storage HEC FSIO 2010 30
Blue Gene Architecture in Review
Blue Gene is not just FLOPs …
… it’s also torus network, power efficiency, and dense packaging. Focus on scalability rather than on configurability gives the Blue Gene family’s System-on-a-Chip architecture unprecedented scalability and reliability.
© 2011 IBM Corporation 31 Blue Gene Active Storage HEC FSIO 2010 31
• Integrate significant storage class memory (SCM) at each node
• For now, Flash memory, maybe similar function to Fusion-io ioDrive Duo
• Future systems may deploy Phase Change Memory (PCM), Memristor, or …?
• Assume node density will drops 50% -- 512 Nodes/Rack for embedded apps
• Objective: balance Flash bandwidth to network all-to-all throughput
• Resulting System Attributes:
• Rack: 0.5 petabyte, 512 Blue Gene processors, and embedded torus network
• 700 TB/s I/O bandwidth to Flash – competitive with ~70 large disk controllers
• Order of magnitude less space and power than equivalent perf via disk solution
• Can configure fewer disk controllers and optimize them for archival use
• With network all-to-all throughput at 1GB/s per node, anticipate:
• 1TB sort from/to persistent storage in order 10 secs.
• 130 Million IOPs per rack, 700 GB/s I/O bandwidth
• Inherit Blue Gene attributes: scalability, reliability, power efficiency,
• Research Challenges (list not exhaustive):
• Packaging – can the integration succeed?
• Resilience – storage, network, system management, middleware
• Data management – need clear split between on-line and archival data
• Data structures and algorithms can take specific advantage of the BGAS architecture – no one cares it’s not x86 since software is embedded in storage
• Related Work:
• Gordon (UCSD) http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf
• FAWN (CMU) http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf
• RamCloud (Stanford) http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf
ioDrive Duo One Board 512 Node
SLC NAND Cap. 320 GB 160 TB
Read BW (64K) 1450 MB/s 725 GB/s
Write BW (64K) 1400 MB/s 700 GB/s
Read IOPS (4K) 270,000 138 Mega
Write IOPS (4K) 257,000 131 Maga
Mixed R/W
IOPs(75/25@4K) 207,000 105 Mega
Thought Experiment: A Blue Gene Active Storage Machine
© 2011 IBM Corporation 32
From individual transistors to the globe Energy-consumption issues (and thermal issues) propagate through hardware levels
© 2011 IBM Corporation 33
Current air-cooled datacenters are extremely inefficient. Cooling needs as much energy as IT and both are thrown-away.
Provocative: Datacenter is a huge “Heater with integrated Logic”.
For a 10 MW datacenter US$ 3 - 5M is wasted per year.
Source: APC, Whitepaper #154 (2008)
Energy consumption of datacenters today
© 2011 IBM Corporation
CMOS 80ºC
Water 60ºC
Micro-channel
liquid coolers
Heat exchanger
Direct „Waste“-Heat usage
e.g. heating
Hot-water-cooled datacenters – towards zero emission
34
© 2011 IBM Corporation 35
Global wire lengths reduction
Meindl 05 et al.
System on Chip
3D Integration
Multi-Chip Design
Benefits:
High core-cache bandwidth
Separation of technologies
Reduction in wire length
Equivalent to two generations of scaling
No impact on software development
Brain: synapse network
Paradigm change: Moore’s law goes 3D
© 2011 IBM Corporation 36
3D integration requires (scalable) interlayer liquid cooling
Challenge: isolate electrical interconnects from liquid
Through silicon via electrical bonding
and water insulation scheme
A large fraction of energy in computers is spent for data transport
Shrinking computers saves energy
cross-section through fluid
port and cavities
Test vehicle with fluid
manifold and connection
Microchannel
Pin fin
Scalable Heat Removal by Interlayer Cooling
© 2011 IBM Corporation 37
Paradigm Changes -Energy will cost more than servers
-Coolers are million fold larger than transistors
Moore’s Law goes 3D -Single layer scaling slows down
-Stacking of layers allows extension of Moore’s law
-Approaching functional density of human brain
Future computers look different -Liquid cooling and heat re-use, e.g. Aquasar
-Interlayer cooled 3D chip stacks
-Smarter energy by bionic designs
Energy aspects are key -Cooling – power delivery – photonics
-Shrink a rack to a “sugar cube”: 50x efficiency
On the Cube Road
© 2011 IBM Corporation 38
Thank you very much for your attention.