Sung Jong Lee (sjree@suwon.ac.kr) Dept. of Physics, University of Suwon Challenges in Parallel...

Preview:

Citation preview

Sung Jong Lee (sjree@suwon.ac.kr)

Dept. of Physics, University of Suwon

Challenges in Parallel Super-computing

2011 1st KIAS Parallel Computation Workshop

February 22-23, 2011

Brief History of Supercomputing www.top500.org Grand Challenge Problems Present Machine’s Characteristics Challenges of Exascale Supercomputing Summary References

Contents

Automobile Crash Simulations in Audi • A Virtual Car undergoes 100,000 Crash simulations (48 months) be-

fore the first prototype is built. Then real crash tests are conducted.

• Audi Supercomputer ranks 260th among top 500 supercomputers

(Nov. 2010)

Performance Measure

Megaflops (MF/s) = 106 flops

Gigaflops (GF/s) = 109 flops

Teraflops (TF/s) = 1012 flops

Petaflops (PF/s) = 1015 flops

Exaflops (EF/s) = 1018 flops

Zettaflops (ZF/s) = 1021 flops

Yottaflops (YF/s) = 1024 flops

Flops = Floating Point Operations / Second

Milestones in Supercomputing

• GigaFlops : M13 , Scientific Research Institute of Com-puter Complexes, Moscow (1984)

• TeraFlops : ASCI Red, Sandia National Lab. (1996)

• PetaFlops : Roadrunner, Los Alamos National Lab.

(2008)

History

• Alan Turing (1912-1954)

* Turing-Welchman

Bombe (1938)

* Used for Breaking

German Enigma, etc

History

• Seymour Cray (1925-1996)– Developed CDC 1604 – first fully transistorized

supercomputer (1958)– CDC 6600 (1965), 9 MFlops– Founded Cray Research in 1972

• CRAY-1 (1976), 160 MFlops• CRAY-2 (1985)• CRAY-3 (1989)

Supercomputers in USSR (Ukraine)• M13 , Scientific Research Institute of Computer Com-

plexes, Moscow (1984)• 2.4 Gigaflops • Led By Mikhail A. Kartsev - Developer of Super-Computers for

Space Observation

Mikhail A. Kartsev 1923~1983 (?)

Architectures: Shared vs. Distributed

• Easy Programming: one global memory• Bottleneck of memory

• Message Passing: Send/Recv• Scalability• Programming: not easy

Architectural transitions • Vector Processors (70s ~ 90s) : Cray-1, Cray-2 , CRAY-XMP,

CRAY_YMP,SX-2, VP-200, etc• Massively Parallel Processors (90s~2000): Cray-T3E , CM5,

VPP-500, nCUBE, SP2, PARAGON• Clusters (2000~ ) • Multicore Processors (2003 ? ~ )

Cray-1 (1976)

Installed at Los Alamos National Lab.

$ 8.8 Million

Performance: 160 Mflops,

Main Memory : 8 MB

Present Architectural Trends

* Transition to Simplicity and parallelism (driven by) three trends:

1) Single-Processor performance is no longer improving significantly

- Explicit Parallelsim is the only way to increase performance.

2) Constant field scaling has come to an end

- Threshold voltage cannot be reduced (due to leakage current)

- New processors are simpler (better performance per unit

power)

3) Increase in main memory latency and decrease in main memory

bandwidth relative to processor cycle time and execution rate

continues. Memory Bandwidth and Latency becomes the

Performance-Limiting factor !

• Multicore Processors (2003 ? ~ :

Multi-Core Processors

• Three classes of multi-core die microarchitectures

Recent Multi-core CPU’s

Tilera’s TILE Gx CPU:

• 100 Cores • Performance: 750 * 10^9 32-bit ops • Power Consumption: 10~55 W • Memory Bandwidth : 500 Gb/s

• Intel’s CPU with 48

Cores • Performance :• Power Consumption: 25 W ~ 125

W• On die Power mangement • Clock Speed : 1.66~1.83 GHz • Memory Bandwidth :

GPU

Nvidia Tesla M2050/70 GPU: • 448 CUDA Cores)• 3GB/6GB GDDR5 Memory• Power Consumption: 225 W • Memory Bandwidth• Performance: 515 Gflops (Double precision : 148 GB/s

State of the Art Summary

* 50 years reliance on von Neumann model :

1) Split between Memory and CPU’s, Sequential thread, Model of Sequential Execution

2) Memory Wall : Performance of memory has not kept up with

the improvement in CPU clock rates leading to multi-level

caches and deep memory hierarchies

Complexity increases when multiple CPU’s attempt to share

the same memory.

CPU and Memory Cycle Time Trend

* DARPA report, 2008, p103

State of the Art Summary (2)

3) Power Wall : Rise of Power as the first class constraint.

Concomitant flattening of CPU clock rates Multi-Cores .

Already several hundred cores on a die.

Expect thousands of cores on a die.

But, More cores demands more memory Bandwidth for

Memory access. But, this is not possible due to the Power concerns.

4) Attempts to modify this von Neumann model

blurring memory and processing logic there.

TOP 500 Supercomputers (Nov 2010) (http://www.top500.org)

TOP Supercomputers (Nov 2010)

• 7 Systems exceeds 1 PFlop/s• Top 10 : 0.8 PFlop/s• Top 100 : 76 TFlop/s• Top 500 : 31.1 TFlop/s

Rmax Maximal LINPACK performance achieved

Rpeak Theoretical peak performance

• Rmax= 2.57 Petaflops, 186,368 Cores, • Main Memory = 229.4 TB• 14,336 Xeon X5670 processors and

7,168 Nvidia Tesla M2050 gpGPUs. 2,048 NUDT FT1000 heterogeneous pro-cessors

• Top 2 : Jaguar, (Oakridge Lab., USA)

• Top 1 : Tianhe-1A , (NSC, China)

• Rmax= 1.76 Petaflops 224,162 Cores (Cray XT5-HE Opteron 6-core 2.6 GHz)

• Main Memory = Not Available

Clock Rate in the Top 10 Supercomputers

Processor Parallelism in the Top 10 Supercomputers

Processor Parallelism

How About Korea • Haedam (19th) & Haeon (20th) (Korea meteorological administration)

: 316.40 Tflops (45120 Cores)• TachyonII (24th) (KISTI) : 274.80 Tflops (26232 Cores)

Main Memory = 157.392 TByte

TachyonII and IBM p6

H

System Performance by Countries

Countries Share Over Time (1993~2010)

Architecture Share Over time (1993~2010)

Interconnect Family Share over Time (1993~2010)

Special Purpose Supercomputer• Anton (D. E. Shaw Research Group, 2008)• 512 processing nodes with 3D-torus hypercube topology • Each node includes a special MD engine as a single ASIC• Theoretical Maximum Performance = Flops• Net Power Consumption = KW

KIAS Cluster Case

• 418 nodes (44,826 cores) : • Theoretical Maximum Performance = 67 TeraFlops• Net Power Consumption = 154.8 KW

This includes a GPU cluster with 24 nodes (43,008 cores),

49 TFlops

Grand Challenge Problems

* Astrophysics Problems : • High Energy Physics, Nuclear Physics:• Materials science : Design of novel materials

Quantum Structure calculations • Atmospheric Science : Weather forcasting, etc • Fusion Research : Magnetohydrodynamics of Plasma,

etc • Macromolecular Structure Modeling and Dynamics :

Protein structures and folding dynamics

How does a protein physically fold from a denatured state into its native conformation?

Example : Protein Folding

?

Computational Load of Folding Simulation

By Molecular Dynamics

• Suppose a Protein + 1000 water molecules

- Approximately 3,000 atoms • Integration time step = 10-15 s• # of Long range force calculation at each time step

~ 1000*1000 = 106 • Then, one millisecond (10-3s) simulation corresponds to

1012 * 106 = 1018 calculations !!

This is an Exascale Problem !

An Example 2a: WRF (Weather Research and For-cast) model : Full Scale Nature Run

At present

(1) 5km*5km square resolution, 101 vertical levels on the

hemisphere 2*109 cells

(2) time step = ? milliseconds

--- 10 Teraflops on 10,000 5Gflops nodes (2007)

If the resolution is ~1km , then 5*1010 cells

If sustained at Exascale, it would require 10 PB of main memory

with I/O requirements up 1000 times

Types of Challenge Problems

* Parallelism :

(a) Embarrassingly Parallel Problems

(b) Coarse-Grained Problems

(c) Fine-Grained Problems

* Computation vs. Memory:

(a) CPU-intensive Problems: Molecular Dynamics

(b) Memory-Intensive Problems : Bioinformatics,

Data Analysis in Large Data Experiments

(High Energy Experiments)

* Objective: Understand the course of mainstream technology and determine the primary challenges to reaching a 1000x increase in computing capabil-ity by 2015.

DARPA Report on Exascale Computing Chal-lenges (2008)

ExaScale Computing Study:

Technology Challenges in Achieving

Exascale Systems

Peter Kogge, Editor & Study Lead

(1) The Energy and Power Challenge

(2) The Memory and Storage Challenge

(3) The Concurrency and Locality

Challenge

(4) The Resiliency Challenge

* DARPA Report 2008

Four Main Challenges

for the Exascale Supercomputing

Energy and Power Challenge

• Power / Performance (Average over Top 10 sites)

= 2.67 KW / Teraflops = 2.67 nJ/flop =2.67x10-9 J/flop

Simple Extension to Exascale: need

around 1~2 Giga-Watt for 1 Exaflops !

~ Capacity of a whole Nuclear Power Plant !!

e.g. 21 Nuclear Power plants in KOREA produced

~19 GW electricity in 2009

Power Consumption vs. Performanceof top 10 supercomputers (Nov. 2010)

Energy and Power Challenge (continued)

• Power / Performance of a Tesla GPU

approx. 229 W / 515 Gigaflops

This means around 400 MW for 1 Exaflops solely for

the processing unit alone !!!!

Consider the case of recent GPU’s Tesla 2050/70

http://www.green500.org

The Most Energy-Efficient Supercomputers (Nov. 2010)

• (1) Main Memory : Assume 1GB per chip, then 1PB = Million Chips

In Realistic Sizes of Main Memory = 10PB~ 100PB

---> 10M~100M chips !!

-- > (a) Multiple Power and Resiliency Issues (plus cost !)

(b) Bandwidth challenge : How chips are organized and

interfaced with other components.

* Need to increase memory densities and bandwidths by orders

of magnitude . • (2) Secondary Storage : Need ~100 times the Main Memory Size

(a) Bandwidth challenge

(b) Challenge of managing metadata (file descriptors, i-nodes,

file control blocks, etc)

The Memory and Storage Challenge

• DARPA Report 2008, p213~214

• Total Concurrency ≡ the total # of operations (flops) that must be initi-ated on each and every cycle. : Billion-way Concurrency Needed !!

The Concurrency Challenge

• DARPA Report 2008, p214~215

• Parallelism ≡ the number of distinct threads that makes up the execution of a program

• Present maximum ~ order of 100,000 • Need to go to 108 : order of 100~1000 times present value !

The Processor parallelism

• DARPA Report 2008, p216

• Resiliency ≡ the property of a system to continue effective operations even in the presence of faults either in hardware or software.

• More and Different forms of faults and disruptions than today’s systems

* Huge number of components : 106 to 108 memory chips

& 106 disk drives. * High clock rate increases bit error rates (BER) on data transmission.

* Aging effects in fault characteristics of devices

* Smaller feature sizes increases sensitivity of devices to SEU (Single

Event Upsets), e.g., cosmic rays, radiations

* Low operating voltage with low power increases the effect of noise

sources, like power supply

* The increased levels of concurrency increases the potential for races,

metastable states, and difficult timing problems.

The Resiliency Challenge

• DARPA Report 2008, p217

Aggressive Strawman Architecture

* DARPA Report 2008, p177

To Achieve 1 Exaflops,

* 1 Core = 4 FPU + L1 Cache Memory

* 1 Node = 742 Cores on a 4.5 Tflops, 150 Watt Active Power

Processor Chip

*1 Group = 12 Nodes +routing

* A Rack = 32 Groups

* System = 583 Racks,

* Total # of nodes ~ 223,000

* Total # of cores ~ 223,000*742

Aggressive Strawman Architecture

* DARPA Report 2008, p177

System Interconnect

• DARPA Report 2008, p 128

Interconnect bandwidth requirements for an Exascale system

Characteristics of Aggressive Strawman Architec-ture

* DARPA Report 2008, p176

Perfomance = 1 Exaflops

Total Memory = 3.6 PB

Total Power Consumption = 67.7 MW !

* Main Memory and Interconnect has important shares in power

consumption.

Power Distribution in Aggressive Strawman System

Characteristics of Aggressive Strawman Architec-ture

* DARPA Report 2008, p188

(1) Perfomance = 1 Exaflops

Total DRAM Memory = 3.6 PB

Disk Storage = 3,600 PB = 3.6 EB

Performance per Watt = 14.7 Gflops /Watt

Total # of Cores = 1.66 * 108

# of Microprocessor Chips = 223,872

Total Power Consumption = 67.7 MW !

(2) If Scaled down to 20 MW Power !

Then the Perfomance = 0.303 Exaflops = 303 Petaflops

Total DRAM Memory = 1.0 PB

Disk Storage = 1,080 PB = 1.08 EB

Total # of Cores = 5.04 * 107

# of Microprocessor Chips = 67,968

( Projected to the year of 2015)

(a) The Energy-efficient Circuits and Architecture In Silicon

* Communication Circuits & Memory circuits

(b) Alternative Low-energy Devices and Circuits for Logic and Memory

e.g., * Superconducting RSFQ (Rapid Single Flux Quantum)

Devices

* Cross-bar Architectures with Novel Bi-state devices

(c) Alternative Low-energy Systems for Memory and Storage

* New Levels in Memory Hierarchy

* Rearchitecting conventional DRAMS

(d) 3D Interconnect, Packaging and Cooling

(e) Photonic Interconnect

Possibilities of Exascale Hardware

* DARPA Report 2008

(a) Alternative Low-energy Devices and Circuits for Logic and Memory

e.g. * Superconducting RSFQ (Rapid Single Flux Quantum)

Devices : Extremely Low Power consumption

Logic Devices and Memory Devices

* DARPA Report 2008, Ch. 6

(b) Alternative Memory Types (Non-Volatile RAM’s) * Phase Change Memory (PCRAM) - Two resistance states (Crystalline vs. Amorphous) * SONOS Memory * Magnetic Random Access Memory (MRAM) - Fast Non-Volatile Memory Technology * FeRAM, Resistive RRAM

* Potential Direction for 3D Packaging (A)

3D Packaging (A)

* DARPA Report 2008, p 160

3D Packaging (B), (C)

* DARPA Report 2008, p161

• Potential Direction for 3D Packaging (B)

• Potential Direction for 3D Packaging (C)

Possible Aggressive Packaging of a Single Node

* DARPA Report 2008

Each Chip consists of 36 super-cores (6 by 6) each of which contain-ing 21 cores = 742 cores

A Strawman Design with

Optical Interconnects

* DARPA Report 2008, p191-198

Chip super-core organization and Photonic Interconnect

* On-Chip Optical Interconnect

* Off-Chip Optical Interconnect

* Rack to Rack Optical Intercon-nect

* Optically Connected Memory and Storage System

Rack to Rack Optical System Intercon-nect

* DARPA Report 2008, p195

Total Memory Power = 8.5 MW~ 12 MW

A Possible Optically Connected Memory Stack

* DARPA Report 2008, p197

(a) System Architectures and Programming Model to Reduce Com-munication

* Design in self-awareness of the status of energy usage at all

levels, and the ability to maintain a specific power level

* more explicit and dynamic program control over the contents of

memory structures (such that minimal communication energy

is expended)

* Alternative execution and programming models

(b) Locality-aware Architectures

* Optimize data placement and movement

Exascale Architectures and Programming Models

* DARPA Report 2008

Presently O(105) processors need O(108) processors

and possibly O(1010) threads in Exascale

(a) Power and Resiliency Model in Application Models

(b) Understanding Adapting Old Algorithms

(c) Inventing New Algorithms

(d) Inventing New Applications

(e) Making Applications Resiliency-Aware

Exascale Algorithm and Application De-velopment

* DARPA Report 2008

(a) Energy-efficient Error Detection and Correction Architecture

(b) Fail-in-place and Self-Healing Systems

(c) Checkpoing Roll-back and Recovery

(d) Algorithmic-level Fault Checking and Fault Resiliency

(e) Vertically-Integrated Resilient Systems

Resilient Exascale Systems

* DARPA Report 2008

Summary

• Exascale Supercomputing Requires New Technology

• Possibly Expected around ~2010

• Power Wall and Memory Wall should be overcome

• 3D Packaging and Optical Inteconnect should be pursued

• Alternative Materials for Memory and Logic Devices

- e.g., Superconducting Devices, Spintronics-based Devices

• Different Programming Model

• Reversible Logic and Computing

• TOP 500 Supercomputers (http://www.top500.org)

• Exascale Computing Study (DARPA Report, 2008)

References :

Thank You

Recommended