8/9/20151 An Overview of High Performance Computing and Challenges for the Future Jack Dongarra INNOVATIVE COMP ING LABORATORY University of Tennessee

04/19/23 1

An Overview of High An Overview of High Performance Computing Performance Computing and Challenges for the and Challenges for the

FutureFuture

Jack DongarraINNOVATIVE COMP ING LABORATORY

University of TennesseeOak Ridge National Laboratory

University of Manchester

http://www.research.ibm.com/



OutlineOutline

• Top500 Results• Four Important Concepts that

Will Effect Math Software Effective Use of Many-Core Exploiting Mixed Precision in Our

Numerical Computations Self Adapting / Auto Tuning of

Software Fault Tolerant Algorithms

2

3

H. Meuer, H. Simon, E. Strohmaier, & JDH. Meuer, H. Simon, E. Strohmaier, & JD

- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

- Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June

- All data available from www.top500.org

Size

Rate

TPP performance

4

Performance Development

4.92 PF/s

1.17 TF/s

59.7 GF/s

281 TF/s

0.4 GF/s

4.0 TF/s

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Fujitsu 'NWT'

NEC Earth Simulator

Intel ASCI Red

IBM ASCI WhiteN=1

N=500

SUM

1 Gflop/ s

1 Tflop/ s

100 Mflop/ s

100 Gflop/ s

100 Tflop/ s

10 Gflop/ s

10 Tflop/ s

1 Pflop/ s

IBM BlueGene/L

My Laptop

6-8 years

29th List / June 2007www.top500.orgpage 5

29th List: The TOP10Manufacturer

Computer Rmax [TF/s]

Installation Site Country Year #Proc

1 IBM BlueGene/LeServer Blue Gene

280.6 DOE/NNSA/LLNL USA 2005131,07

2 210

Cray JaguarCray XT3/XT4

101.7 DOE/ORNL USA 2007 23,016

32

Sandia/Cray

Red StormCray XT3

101.4 DOE/NNSA/Sandia USA 2006 26,544

43

IBM BGWeServer Blue Gene

91.29 IBM Thomas Watson USA 2005 40,960

5 IBM New York BLueeServer Blue Gene

82.16 Stony Brook/BNL USA 2007 36,864

64

IBM ASC PurpleeServer pSeries p575

75.76 DOE/NNSA/LLNL USA 2005 12,208

7 IBM BlueGene/LeServer Blue Gene

73.03Rensselaer Polytechnic Institute/CCNI

USA 2007 32,768

8 DellAbe

PowerEdge 1955, Infiniband

62.68 NCSA USA 2007 9,600

95

IBM MareNostrumJS21 Cluster, Myrinet

62.63Barcelona

Supercomputing Center

Spain 2006 12,240

10 SGI HLRB-IISGI Altix 4700

56.52 LRZ Germany 2007 9,728

6

Performance Projection

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

N=1

N=500

SUM

1 GF/s

1 TF/s

100 MF/s

100 GF/s

100 TF/s

10 GF/s

10 TF/s

1 PF/s

10 PF/s Pflop/s

1 EF/s Eflop/s100 PF/s PF/s

7

Cores per System - June 2007

0

50

100

150

200

250

64k-128k

32k-64k16k-32k8k-16k4k-8k2049-4096

1025-2048

513-1024

257-512129-25665-12833-64

Nu

mb

er o

f S

yste

ms

8

14 systems > 50 Tflop/s



9

Intel IA-646%

IBM Power17%

AMD x86_6421%

Cray0%

Intel EM64T46%

Intel IA-326%

HP PA-RISC2%

NEC1%

Sun Sparc1%

HP Alpha0%

96% = 58% Intel 17% IBM 21% AMD

Chips Used in Each of the 500 Systems

10

Interconnects / Systems

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Others

CrayInterconnectSP Switch

Crossbar

Quadrics

Infiniband

Myrinet

Gigabit Ethernet

N/ A

(206)

(46)

GigE + Infiniband + Myrinet = 74%

(128)

29th List / June 2007www.top500.orgpage 11

Countries / Systems

Rank Site Manufact Computer Procs RMax Segment

Interconnect

Family

66 CINECA IBM eServer 326 Opteron Dual 5120 12608 Academic Infband

132 SCS S.r.l. HP Cluster Platform 3000 Xeon 1024 7987.2 Research Infband

271 Telecom Italia HP SuperDome 875 MHz 3072 5591 Industry Myrinet

295 Telecom Italia HP Cluster Platform 3000 Xeon 740 5239 Industry Gige

305 Esprinet HP Cluster Platform 3000 Xeon 664 5179 Industry Gige

12

Power is an Industry Wide Power is an Industry Wide ProblemProblem

“Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006

New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006

Google facilities leveraging

hydroelectric powerold aluminum

plants

>500,000 servers worldwide

Gflop/KWatt in the Top 20Gflop/KWatt in the Top 20

13

14

Chip(2 processors)

17 watts

Compute Card(2 chips, 2x1x1)

4 processors

Node Board(32 chips, 4x4x2)16 Compute Cards

64 processors

(64 racks, 64x32x32)131,072 procsRack

(32 Node boards, 8x8x16)2048 processors

2.8/5.6 GF/s4 MB (cache)

5.6/11.2 GF/s1 GB DDR

90/180 GF/s16 GB DDR

2.9/5.7 TF/s0.5 TB DDR

180/360 TF/s32 TB DDR

IBM BlueGene/L #1 131,072 Cores Total of 33 systems in the Top500

“Fastest Computer”BG/L 700 MHz 131K proc64 racksPeak: 367 Tflop/sLinpack: 281 Tflop/s77% of peak

BlueGene/L Compute ASIC

Full system total of 131,072 processors

The compute node ASICs include all networking and processor functionality. Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded cores (note that L1 cache coherence is not maintained between these cores).(13K sec about 3.6 hours; n=1.8M)

1.6 MWatts (1600 homes)43,000 ops/s/person

15

Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate

& & Transistor Transistor DensityDensity

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

C1C1 C2C2

C3C3 C4C4

Cache

C1C1 C2C2

C3C3 C4C4

Cache

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

16

Power Cost of FrequencyPower Cost of Frequency

• Power ∝ Voltage2 x Frequency

(V2F)

• Frequency ∝ Voltage

• Power ∝Frequency3

17

Power Cost of FrequencyPower Cost of Frequency

• Power ∝ Voltage2 x Frequency

(V2F)

• Frequency ∝ Voltage

• Power ∝Frequency3

What’s Next?What’s Next?

SRAMSRAM

+ 3D Stacked Memory

Many Floating-Point Cores

All Large CoreAll Large CoreMixed LargeMixed LargeandandSmall CoreSmall Core

All Small CoreAll Small Core

Many Small CoresMany Small Cores

Different Classes of Chips Home Games / Graphics Business Scientific

Different Classes of Chips Home Games / Graphics Business Scientific

19

Novel Opportunities in Novel Opportunities in MulticoresMulticores

• Don’t have to contend with uniprocessors

• Not your same old multiprocessor problem How does going from

Multiprocessors to Multicores impact programs?

What changed? Where is the Impact?

•Communication Bandwidth•Communication Latency

20

Communication Communication BandwidthBandwidth

• How much data can be communicated between two cores?

• What changed? Number of Wires Clock rate Multiplexing

• Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck

processor affinity not that important

32 Giga bits/sec~300 Tera bits/sec

10,000X

21

Communication LatencyCommunication Latency

• How long does it take for a round trip communication?

• What changed? Length of wire Pipeline stages

• Impact on programming model? Ultra-fast synchronization Can run real-time apps

on multiple cores

50X

~200 Cycles ~4 cycles

22

80 Core80 Core• Intel’s 80

Core chip 1 Tflop/s 62 Watts 1.2 TB/s

internal BW

• $200M• 10 Pflop/s; • 40K 8-core 4Ghz IBM Power7 chips; • 1.2 PB memory; • 5PB/s global bandwidth; • interconnect BW of 0.55PB/s; • 18 PB disk at 1.8 TB/s I/O bandwidth.• For use by a few people

NSF Track 1 – NCSA/UIUCNSF Track 1 – NCSA/UIUC

• $65M over 5 years for a 1 Pflop/s system $30M over 5 years for equipment

• 36 cabinets of a Cray XT5 • (AMD 8-core/chip, 12 socket/board, 3 GHz, 4

flops/cycle/core) $35M over 5 years for operations

• Power cost: • $1.1M/year

• Cray Maintenance: • $1M/year

• To be used by the NSF community 1000’s of users

• Joins UCSD, PSC, TACC

NSF UTK/JICS Track 2 NSF UTK/JICS Track 2 proposalproposal

Last Year’s Track 2 award to U of Texas

27

Major Changes to SoftwareMajor Changes to Software• Must rethink the design of our

software Another disruptive technology

•Similar to what happened with cluster computing and message passing

Rethink and rewrite the applications, algorithms, and software

• Numerical libraries for example will change For example, both LAPACK and

ScaLAPACK will undergo major changes to accommodate this

28

Major Changes to SoftwareMajor Changes to Software• Must rethink the design of our

software Another disruptive technology

•Similar to what happened with cluster computing and message passing

Rethink and rewrite the applications, algorithms, and software

• Numerical libraries for example will change For example, both LAPACK and

ScaLAPACK will undergo major changes to accommodate this

A New Generation of Software:A New Generation of Software:

Algorithms follow hardware evolution in time

LINPACK (80’s)(Vector operations)

Rely on - Level-1 BLAS operations

LAPACK (90’s)(Blocking, cache friendly)


PLASMA (00’s)New Algorithms (many-core friendly)

Rely on - a DAG/scheduler - block data layout - some extra kernelsThose new algorithms

- have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

A New Generation of Software:A New Generation of Software:Parallel Linear Algebra Software for Multicore Architectures Parallel Linear Algebra Software for Multicore Architectures (PLASMA)(PLASMA)

Algorithms follow hardware evolution in time

LINPACK (80’s)(Vector operations)


LAPACK (90’s)(Blocking, cache friendly)


PLASMA (00’s)New Algorithms (many-core friendly)

Rely on - a DAG/scheduler - block data layout - some extra kernelsThose new algorithms

- have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.

31

D G ETF2

D LSW P

D LSW P

D TR SM

D G EM M

LAPACK

LAPACK

LAPACK

BLAS

BLAS

Steps in the LAPACK LUSteps in the LAPACK LU

(Factor a panel)

(Backward swap)

(Forward swap)

(Triangular solve)

(Matrix multiply)

DG ETF2

DLSW P

DLSW P

DTR SM

DG EM M

LAPACK

LAPACK

LAPACK

BLAS

BLAS

LU Timing Profile (4 processor system)LU Timing Profile (4 processor system)

1D decomposition and SGI OriginTime for each componentDGETF2

DLASWP(L)

DLASWP(R)

DTRSM

DGEMM

Threads – no lookahead

Bulk Sync PhasesBulk Sync Phases

33

Adaptive Lookahead - DynamicAdaptive Lookahead - Dynamic

Event Driven MultithreadingEvent Driven Multithreading

Reorganizing algorithms to use

this approach

34

A

C

A

B C

T TT

Fork-Join vs. Dynamic ExecutionFork-Join vs. Dynamic Execution

Fork-Join – parallel BLAS

Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads

Time

35

A

C

A

B C

T TT

Fork-Join vs. Dynamic ExecutionFork-Join vs. Dynamic Execution

Fork-Join – parallel BLAS

DAG-based – dynamic scheduling

Time

Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads

Time saved

36

With the Hype on Cell & PS3With the Hype on Cell & PS3We Became Interested We Became Interested

• The PlayStation 3's CPU based on a "Cell“ processor• Each Cell contains a Power PC processor and 8 SPEs. (SPE is

processing unit, SPE: SPU + DMA engine) An SPE is a self contained vector processor which acts

independently from the others. • 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2

GHZ

204.8 Gflop/s peak! The catch is that this is for 32 bit floating point; (Single Precision

SP) And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!

• Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

SPE ~ 25 Gflop/s peak

Performance of Single Performance of Single Precision on Conventional Precision on Conventional ProcessorsProcessors

Single precision is faster because:• Higher parallelism in SSE/vector units• Reduced data motion • Higher locality in cache

• Realized have the similar situation on our commodity processors.• That is, SP is 2X

as fast as DP on many systems

• The Intel Pentium and AMD Opteron have SSE2• 2 flops/cycle DP• 4 flops/cycle SP

• IBM PowerPC has AltiVec• 8 flops/cycle SP• 4 flops/cycle DP

• No DP on AltiVec

SizeSize SGEMM/SGEMM/DGEMMDGEMM SizeSize SGEMV/SGEMV/

DGEMVDGEMVAMD

Opteron 246 30003000 2.002.00 50005000 1.701.70

UltraSparc-IIe 30003000 1.641.64 50005000 1.661.66

Intel PIII Coppermine 30003000 2.032.03 50005000 2.092.09

PowerPC 970 30003000 2.042.04 50005000 1.441.44Intel

Woodcrest 30003000 1.811.81 50005000 2.182.18Intel XEON 30003000 2.042.04 50005000 1.821.82

Intel Centrino

Duo 30003000 2.712.71 50005000 2.212.21

38

32 or 64 bit Floating Point 32 or 64 bit Floating Point Precision?Precision?• A long time ago 32 bit floating point

was used Still used in scientific apps but limited

• Most apps use 64 bit floating point Accumulation of round off error

• A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (1018) ops.

Ill conditioned problems IEEE SP exponent bits too few (8 bits, 10±38) Critical sections need higher precision

• Sometimes need extended precision (128 bit fl pt) However some can get by with 32 bit fl pt in

some parts• Mixed precision a possibility

Approximate in lower precision and then refine or improve solution to high precision.

39

Idea Goes Something Like Idea Goes Something Like This…This…• Exploit 32 bit floating point as much

as possible. Especially for the bulk of the computation

• Correct or update the solution with selective use of 64 bit floating point to provide a refined results

• Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result

using selected higher precision and, Perform the update of the 32 bit results

with the correction using high precision.

L U = lu(A) SINGLE O(n3)

x = L\(U\b) SINGLE O(n2)

r = b – Ax DOUBLE O(n2)WHILE || r || not small enough

z = L\(U\r) SINGLE O(n2)

x = x + z DOUBLE O(n1)

r = b – Ax DOUBLE O(n2)END

Mixed-PrecisionMixed-Precision Iterative Iterative RefinementRefinement

• Iterative refinement for dense systems, Ax = b, can work this way.

L U = lu(A) SINGLE O(n3)

x = L\(U\b) SINGLE O(n2)

r = b – Ax DOUBLE O(n2)WHILE || r || not small enough

z = L\(U\r) SINGLE O(n2)

x = x + z DOUBLE O(n1)

r = b – Ax DOUBLE O(n2)END

Mixed-PrecisionMixed-Precision Iterative Iterative RefinementRefinement

• Iterative refinement for dense systems, Ax = b, can work this way.

Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.

It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

• Requires extra storage, total is 1.5 times normal;• O(n3) work is done in lower precision• O(n2) work is done in high precision

• Problems if the matrix is ill-conditioned in sp; O(108)

Results for Mixed Precision Iterative Refinement for Dense

Ax = bArchitecture (BLAS)1 Intel Pentium III Coppermine (Goto)2 Intel Pentium III Katmai (Goto)3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto)5 Intel Pentium IV-M Northwood (Goto)6 AMD Opteron (Goto)7 Cray X1 (libsci)

8 IBM Power PC G5 (2.7 GHz) (VecLib)

9 Compaq Alpha EV6 (CXML)10 IBM SP Power3 (ESSL)11 SGI Octane (ATLAS)

• Single precision is faster than DP because: Higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion

32 bit data instead of 64 bit data Higher locality in cache

More data items in cache

Results for Mixed Precision Iterative Refinement for Dense

Ax = bArchitecture (BLAS)1 Intel Pentium III Coppermine (Goto)2 Intel Pentium III Katmai (Goto)3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto)5 Intel Pentium IV-M Northwood (Goto)6 AMD Opteron (Goto)7 Cray X1 (libsci)

8 IBM Power PC G5 (2.7 GHz) (VecLib)

9 Compaq Alpha EV6 (CXML)10 IBM SP Power3 (ESSL)11 SGI Octane (ATLAS)

• Single precision is faster than DP because: Higher parallelism within vector units

4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion

32 bit data instead of 64 bit data Higher locality in cache

More data items in cache

Architecture (BLAS-MPI) # procs n DP Solve/SP Solve

DP Solve/Iter Ref

# iter

AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85 1.79 6

AMD Opteron (Goto – OpenMPI MX) 64 32000 1.90 1.83 6

44

What about the Cell?What about the Cell?

• Power PC at 3.2 GHz DGEMM at 5 Gflop/s Altivec peak at 25.6 Gflop/s

• Achieved 10 Gflop/s SGEMM

• 8 SPUs 204.8 Gflop/s peak! The catch is that this is for 32 bit

floating point; (Single Precision SP) And 64 bit floating point runs at 14.6

Gflop/s total for all 8 SPEs!! • Divide SP peak by 14; factor of 2 because of

DP and 7 because of latency issues

Moving Data Around on the Cell

256 KB

Worst case memory bound operations (no reuse of data) 3 data movements (2 in and 1 out) with 2 ops (SAXPY)For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B)

Injection bandwidth25.6 GB/s

Injection bandwidth

46

IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b

0

50

100

150

200

250

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix Size

GF

lop

/s

SP Peak (204 Gflop/s)

SP Ax=b IBM

DP Peak (15 Gflop/s)

DP Ax=b IBM

.30 secs

3.9 secs

8 SGEMM (Embarrassingly Parallel)

47

IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b

0

50

100

150

200

250

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix Size

GF

lop

/s

SP Peak (204 Gflop/s)

SP Ax=b IBM

DSGESV

DP Peak (15 Gflop/s)

DP Ax=b IBM

.30 secs

.47 secs

3.9 secs

8.3X


33 48

Cholesky on the CellCholesky on the Cell, , Ax=b, A=AAx=b, A=ATT, , xxTTAx > 0Ax > 0

For the SPE’s standard C code and C language SIMD extensions (intrinsics)

Single precision performance

Mixed precision performance using iterative refinement Method achieving 64 bit accuracy

Cholesky - Using 2 Cell Cholesky - Using 2 Cell ChipsChips

49

50

Intriguing PotentialIntriguing Potential• Exploit lower precision as much as possible

Payoff in performance• Faster floating point • Less data to move

• Automatically switch between SP and DP to match the desired accuracy Compute solution in SP and then a correction to

the solution in DP

• Potential for GPU, FPGA, special purpose processors What about 16 bit floating point?

• Use as little you can get away with and improve the accuracy

• Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used.

Correction = - A\(b – Ax)

33 51

IBM/Mercury Cell BladeIBM/Mercury Cell Blade

From IBM or Mercury 2 Cell chip

Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW

33 52

Sony Playstation 3 Cluster Sony Playstation 3 Cluster PS3-TPS3-T

From IBM or Mercury 2 Cell chip

Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW

From WAL*MART PS3 1 Cell chip

w/6 SPEs 256 MB/PS3 $600 Download SW Dual boot

SIT CELL

Cell Hardware OverviewCell Hardware Overview

PE

PE

PE

PE

PE

PE

200 GB/s200 GB/s

512 MiB512 MiB

25 GB/s

PowerPC

PE

PE

3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs32 bit peak perf 8*25.6 Gflop/s

204.8 Gflop/s peak64 bit peak perf 8*1.8 Gflop/s

14.6 Gflop/s peak512 MiB memory

25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s25.6 Gflop/s

25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s25.6 Gflop/s

SIT CELL

PS3 Hardware OverviewPS3 Hardware Overview

PE

PE

PE

PE

PE

PE

200 GB/s200 GB/sGameOS

Hypervisor

256 MiB256 MiB

Disabled/Broken: Yield issues

25 GB/s

PowerPC

3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s

153.6 Gflop/s peak64 bit peak perf 6*1.8 Gflop/s

10.8 Gflop/s peak1 Gb/s NIC256 MiB memory

25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s

25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s

33 55

PlayStation 3 LU CodesPlayStation 3 LU Codes

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500

Matrix Size

GF

lop

/s

SP Peak (153.6 Gflop/s)

SP Ax=b IBM

DP Peak (10.9 Gflop/s)


33 56

PlayStation 3 LU CodesPlayStation 3 LU Codes

0

20

40

60

80

100

120

140

160

180

0 500 1000 1500 2000 2500

Matrix Size

GF

lop

/s

SP Peak (153.6 Gflop/s)

SP Ax=b IBM

DSGESV

DP Peak (10.9 Gflop/s)


33 57

Cholesky on the PS3Cholesky on the PS3, , Ax=b, A=AAx=b, A=ATT, x, xTTAx > Ax >

00

33 58

HPC in the Living RoomHPC in the Living Room

33

Matrix Multiple on a 4 Node PlayStation3 Cluster

What's goodVery cheap: ~4$ per Gflop/s (with 32 bit fl pt theoretical peak)Fast local computations between SPEsPerfect overlap between communications and computations is possible (Open-MPI running):

PPE does communication via MPI SPEs do computation via SGEMMs

What's badGigabit network card. 1 Gb/s is too little for such computational power (150 Gflop/s per node)Linux can only run on top of GameOS (hypervisor)

Extremely high network access latencies (120 usec)

Low bandwidth (600 Mb/s)Only 256 MB local memoryOnly 6 SPEs

Gold: Computation: 8 ms

Blue: Communication: 20 ms

33

Users Guide for SC on PS3Users Guide for SC on PS3

• SCOP3: A Rough Guide to Scientific Computing on the PlayStation 3

• See webpage for details

Conclusions Conclusions • For the last decade or more, the research

investment strategy has been overwhelmingly biased in favor of hardware.

• This strategy needs to be rebalanced - barriers to progress are increasingly on the software side.

• Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while

software has a half-life measured in decades.• High Performance Ecosystem out of balance

Hardware, OS, Compilers, Software, Algorithms, Applications• No Moore’s Law for software, algorithms and applications

33

Collaborators / SupportCollaborators / Support

Alfredo Buttari, UTKJulien Langou,

UColoradoJulie Langou, UTKPiotr Luszczek,

MathWorksJakub Kurzak, UTKStan Tomov, UTK

Documents

8/9/20151 An Overview of High Performance Computing and Challenges for the Future Jack Dongarra INNOVATIVE COMP ING LABORATORY University of Tennessee