Upload
marjory-alyson-dennis
View
217
Download
0
Embed Size (px)
Citation preview
04/19/23 1
An Overview of High An Overview of High Performance Computing Performance Computing and Challenges for the and Challenges for the
FutureFuture
Jack DongarraINNOVATIVE COMP ING LABORATORY
University of TennesseeOak Ridge National Laboratory
University of Manchester
OutlineOutline
• Top500 Results• Four Important Concepts that
Will Effect Math Software Effective Use of Many-Core Exploiting Mixed Precision in Our
Numerical Computations Self Adapting / Auto Tuning of
Software Fault Tolerant Algorithms
2
3
H. Meuer, H. Simon, E. Strohmaier, & JDH. Meuer, H. Simon, E. Strohmaier, & JD
- Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem
- Updated twice a yearSC‘xy in the States in NovemberMeeting in Germany in June
- All data available from www.top500.org
Size
Rate
TPP performance
4
Performance Development
4.92 PF/s
1.17 TF/s
59.7 GF/s
281 TF/s
0.4 GF/s
4.0 TF/s
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Fujitsu 'NWT'
NEC Earth Simulator
Intel ASCI Red
IBM ASCI WhiteN=1
N=500
SUM
1 Gflop/ s
1 Tflop/ s
100 Mflop/ s
100 Gflop/ s
100 Tflop/ s
10 Gflop/ s
10 Tflop/ s
1 Pflop/ s
IBM BlueGene/L
My Laptop
6-8 years
29th List / June 2007www.top500.orgpage 5
29th List: The TOP10Manufacturer
Computer Rmax [TF/s]
Installation Site Country Year #Proc
1 IBM BlueGene/LeServer Blue Gene
280.6 DOE/NNSA/LLNL USA 2005131,07
2 210
Cray JaguarCray XT3/XT4
101.7 DOE/ORNL USA 2007 23,016
32
Sandia/Cray
Red StormCray XT3
101.4 DOE/NNSA/Sandia USA 2006 26,544
43
IBM BGWeServer Blue Gene
91.29 IBM Thomas Watson USA 2005 40,960
5 IBM New York BLueeServer Blue Gene
82.16 Stony Brook/BNL USA 2007 36,864
64
IBM ASC PurpleeServer pSeries p575
75.76 DOE/NNSA/LLNL USA 2005 12,208
7 IBM BlueGene/LeServer Blue Gene
73.03Rensselaer Polytechnic Institute/CCNI
USA 2007 32,768
8 DellAbe
PowerEdge 1955, Infiniband
62.68 NCSA USA 2007 9,600
95
IBM MareNostrumJS21 Cluster, Myrinet
62.63Barcelona
Supercomputing Center
Spain 2006 12,240
10 SGI HLRB-IISGI Altix 4700
56.52 LRZ Germany 2007 9,728
6
Performance Projection
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
N=1
N=500
SUM
1 GF/s
1 TF/s
100 MF/s
100 GF/s
100 TF/s
10 GF/s
10 TF/s
1 PF/s
10 PF/s Pflop/s
1 EF/s Eflop/s100 PF/s PF/s
7
Cores per System - June 2007
0
50
100
150
200
250
64k-128k
32k-64k16k-32k8k-16k4k-8k2049-4096
1025-2048
513-1024
257-512129-25665-12833-64
Nu
mb
er o
f S
yste
ms
8
14 systems > 50 Tflop/s
88 systems > 10 Tflop/s
326 systems > 5 Tflop/s
9
Intel IA-646%
IBM Power17%
AMD x86_6421%
Cray0%
Intel EM64T46%
Intel IA-326%
HP PA-RISC2%
NEC1%
Sun Sparc1%
HP Alpha0%
96% = 58% Intel 17% IBM 21% AMD
Chips Used in Each of the 500 Systems
10
Interconnects / Systems
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Others
CrayInterconnectSP Switch
Crossbar
Quadrics
Infiniband
Myrinet
Gigabit Ethernet
N/ A
(206)
(46)
GigE + Infiniband + Myrinet = 74%
(128)
29th List / June 2007www.top500.orgpage 11
Countries / Systems
Rank Site Manufact Computer Procs RMax Segment
Interconnect
Family
66 CINECA IBM eServer 326 Opteron Dual 5120 12608 Academic Infband
132 SCS S.r.l. HP Cluster Platform 3000 Xeon 1024 7987.2 Research Infband
271 Telecom Italia HP SuperDome 875 MHz 3072 5591 Industry Myrinet
295 Telecom Italia HP Cluster Platform 3000 Xeon 740 5239 Industry Gige
305 Esprinet HP Cluster Platform 3000 Xeon 664 5179 Industry Gige
12
Power is an Industry Wide Power is an Industry Wide ProblemProblem
“Hiding in Plain Sight, Google Seeks More Power”, by John Markoff, June 14, 2006
New Google Plant in The Dulles, Oregon, from NYT, June 14, 2006
Google facilities leveraging
hydroelectric powerold aluminum
plants
>500,000 servers worldwide
Gflop/KWatt in the Top 20Gflop/KWatt in the Top 20
13
14
Chip(2 processors)
17 watts
Compute Card(2 chips, 2x1x1)
4 processors
Node Board(32 chips, 4x4x2)16 Compute Cards
64 processors
(64 racks, 64x32x32)131,072 procsRack
(32 Node boards, 8x8x16)2048 processors
2.8/5.6 GF/s4 MB (cache)
5.6/11.2 GF/s1 GB DDR
90/180 GF/s16 GB DDR
2.9/5.7 TF/s0.5 TB DDR
180/360 TF/s32 TB DDR
IBM BlueGene/L #1 131,072 Cores Total of 33 systems in the Top500
“Fastest Computer”BG/L 700 MHz 131K proc64 racksPeak: 367 Tflop/sLinpack: 281 Tflop/s77% of peak
BlueGene/L Compute ASIC
Full system total of 131,072 processors
The compute node ASICs include all networking and processor functionality. Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded cores (note that L1 cache coherence is not maintained between these cores).(13K sec about 3.6 hours; n=1.8M)
1.6 MWatts (1600 homes)43,000 ops/s/person
15
Lower Lower VoltageVoltage
Increase Increase Clock RateClock Rate
& & Transistor Transistor DensityDensity
We have seen increasing number of gates on a chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel Processors > 100 Watts
We will not see the dramatic increases in clock speeds in the future.
However, the number of gates on a chip will continue to increase.
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor
CoreCore
CacheCache
CoreCore
CacheCache
CoreCore
C1C1 C2C2
C3C3 C4C4
Cache
C1C1 C2C2
C3C3 C4C4
Cache
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
C1C1 C2C2
C3C3 C4C4
16
Power Cost of FrequencyPower Cost of Frequency
• Power ∝ Voltage2 x Frequency
(V2F)
• Frequency ∝ Voltage
• Power ∝Frequency3
17
Power Cost of FrequencyPower Cost of Frequency
• Power ∝ Voltage2 x Frequency
(V2F)
• Frequency ∝ Voltage
• Power ∝Frequency3
What’s Next?What’s Next?
SRAMSRAM
+ 3D Stacked Memory
Many Floating-Point Cores
All Large CoreAll Large CoreMixed LargeMixed LargeandandSmall CoreSmall Core
All Small CoreAll Small Core
Many Small CoresMany Small Cores
Different Classes of Chips Home Games / Graphics Business Scientific
Different Classes of Chips Home Games / Graphics Business Scientific
19
Novel Opportunities in Novel Opportunities in MulticoresMulticores
• Don’t have to contend with uniprocessors
• Not your same old multiprocessor problem How does going from
Multiprocessors to Multicores impact programs?
What changed? Where is the Impact?
•Communication Bandwidth•Communication Latency
20
Communication Communication BandwidthBandwidth
• How much data can be communicated between two cores?
• What changed? Number of Wires Clock rate Multiplexing
• Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck
processor affinity not that important
32 Giga bits/sec~300 Tera bits/sec
10,000X
21
Communication LatencyCommunication Latency
• How long does it take for a round trip communication?
• What changed? Length of wire Pipeline stages
• Impact on programming model? Ultra-fast synchronization Can run real-time apps
on multiple cores
50X
~200 Cycles ~4 cycles
22
80 Core80 Core• Intel’s 80
Core chip 1 Tflop/s 62 Watts 1.2 TB/s
internal BW
• $200M• 10 Pflop/s; • 40K 8-core 4Ghz IBM Power7 chips; • 1.2 PB memory; • 5PB/s global bandwidth; • interconnect BW of 0.55PB/s; • 18 PB disk at 1.8 TB/s I/O bandwidth.• For use by a few people
NSF Track 1 – NCSA/UIUCNSF Track 1 – NCSA/UIUC
• $65M over 5 years for a 1 Pflop/s system $30M over 5 years for equipment
• 36 cabinets of a Cray XT5 • (AMD 8-core/chip, 12 socket/board, 3 GHz, 4
flops/cycle/core) $35M over 5 years for operations
• Power cost: • $1.1M/year
• Cray Maintenance: • $1M/year
• To be used by the NSF community 1000’s of users
• Joins UCSD, PSC, TACC
NSF UTK/JICS Track 2 NSF UTK/JICS Track 2 proposalproposal
Last Year’s Track 2 award to U of Texas
27
Major Changes to SoftwareMajor Changes to Software• Must rethink the design of our
software Another disruptive technology
•Similar to what happened with cluster computing and message passing
Rethink and rewrite the applications, algorithms, and software
• Numerical libraries for example will change For example, both LAPACK and
ScaLAPACK will undergo major changes to accommodate this
28
Major Changes to SoftwareMajor Changes to Software• Must rethink the design of our
software Another disruptive technology
•Similar to what happened with cluster computing and message passing
Rethink and rewrite the applications, algorithms, and software
• Numerical libraries for example will change For example, both LAPACK and
ScaLAPACK will undergo major changes to accommodate this
A New Generation of Software:A New Generation of Software:
Algorithms follow hardware evolution in time
LINPACK (80’s)(Vector operations)
Rely on - Level-1 BLAS operations
LAPACK (90’s)(Blocking, cache friendly)
Rely on - Level-3 BLAS operations
PLASMA (00’s)New Algorithms (many-core friendly)
Rely on - a DAG/scheduler - block data layout - some extra kernelsThose new algorithms
- have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
A New Generation of Software:A New Generation of Software:Parallel Linear Algebra Software for Multicore Architectures Parallel Linear Algebra Software for Multicore Architectures (PLASMA)(PLASMA)
Algorithms follow hardware evolution in time
LINPACK (80’s)(Vector operations)
Rely on - Level-1 BLAS operations
LAPACK (90’s)(Blocking, cache friendly)
Rely on - Level-3 BLAS operations
PLASMA (00’s)New Algorithms (many-core friendly)
Rely on - a DAG/scheduler - block data layout - some extra kernelsThose new algorithms
- have a very low granularity, they scale very well (multicore, petascale computing, … ) - removes a lots of dependencies among the tasks, (multicore, distributed computing) - avoid latency (distributed computing, out-of-core) - rely on fast kernels Those new algorithms need new kernels and rely on efficient scheduling algorithms.
31
D G ETF2
D LSW P
D LSW P
D TR SM
D G EM M
LAPACK
LAPACK
LAPACK
BLAS
BLAS
Steps in the LAPACK LUSteps in the LAPACK LU
(Factor a panel)
(Backward swap)
(Forward swap)
(Triangular solve)
(Matrix multiply)
DG ETF2
DLSW P
DLSW P
DTR SM
DG EM M
LAPACK
LAPACK
LAPACK
BLAS
BLAS
LU Timing Profile (4 processor system)LU Timing Profile (4 processor system)
1D decomposition and SGI OriginTime for each componentDGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
Threads – no lookahead
Bulk Sync PhasesBulk Sync Phases
33
Adaptive Lookahead - DynamicAdaptive Lookahead - Dynamic
Event Driven MultithreadingEvent Driven Multithreading
Reorganizing algorithms to use
this approach
34
A
C
A
B C
T TT
Fork-Join vs. Dynamic ExecutionFork-Join vs. Dynamic Execution
Fork-Join – parallel BLAS
Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads
Time
35
A
C
A
B C
T TT
Fork-Join vs. Dynamic ExecutionFork-Join vs. Dynamic Execution
Fork-Join – parallel BLAS
DAG-based – dynamic scheduling
Time
Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads
Time saved
36
With the Hype on Cell & PS3With the Hype on Cell & PS3We Became Interested We Became Interested
• The PlayStation 3's CPU based on a "Cell“ processor• Each Cell contains a Power PC processor and 8 SPEs. (SPE is
processing unit, SPE: SPU + DMA engine) An SPE is a self contained vector processor which acts
independently from the others. • 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2
GHZ
204.8 Gflop/s peak! The catch is that this is for 32 bit floating point; (Single Precision
SP) And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
• Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues
SPE ~ 25 Gflop/s peak
Performance of Single Performance of Single Precision on Conventional Precision on Conventional ProcessorsProcessors
Single precision is faster because:• Higher parallelism in SSE/vector units• Reduced data motion • Higher locality in cache
• Realized have the similar situation on our commodity processors.• That is, SP is 2X
as fast as DP on many systems
• The Intel Pentium and AMD Opteron have SSE2• 2 flops/cycle DP• 4 flops/cycle SP
• IBM PowerPC has AltiVec• 8 flops/cycle SP• 4 flops/cycle DP
• No DP on AltiVec
SizeSize SGEMM/SGEMM/DGEMMDGEMM SizeSize SGEMV/SGEMV/
DGEMVDGEMVAMD
Opteron 246 30003000 2.002.00 50005000 1.701.70
UltraSparc-IIe 30003000 1.641.64 50005000 1.661.66
Intel PIII Coppermine 30003000 2.032.03 50005000 2.092.09
PowerPC 970 30003000 2.042.04 50005000 1.441.44Intel
Woodcrest 30003000 1.811.81 50005000 2.182.18Intel XEON 30003000 2.042.04 50005000 1.821.82
Intel Centrino
Duo 30003000 2.712.71 50005000 2.212.21
38
32 or 64 bit Floating Point 32 or 64 bit Floating Point Precision?Precision?• A long time ago 32 bit floating point
was used Still used in scientific apps but limited
• Most apps use 64 bit floating point Accumulation of round off error
• A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (1018) ops.
Ill conditioned problems IEEE SP exponent bits too few (8 bits, 10±38) Critical sections need higher precision
• Sometimes need extended precision (128 bit fl pt) However some can get by with 32 bit fl pt in
some parts• Mixed precision a possibility
Approximate in lower precision and then refine or improve solution to high precision.
39
Idea Goes Something Like Idea Goes Something Like This…This…• Exploit 32 bit floating point as much
as possible. Especially for the bulk of the computation
• Correct or update the solution with selective use of 64 bit floating point to provide a refined results
• Intuitively: Compute a 32 bit result, Calculate a correction to 32 bit result
using selected higher precision and, Perform the update of the 32 bit results
with the correction using high precision.
L U = lu(A) SINGLE O(n3)
x = L\(U\b) SINGLE O(n2)
r = b – Ax DOUBLE O(n2)WHILE || r || not small enough
z = L\(U\r) SINGLE O(n2)
x = x + z DOUBLE O(n1)
r = b – Ax DOUBLE O(n2)END
Mixed-PrecisionMixed-Precision Iterative Iterative RefinementRefinement
• Iterative refinement for dense systems, Ax = b, can work this way.
L U = lu(A) SINGLE O(n3)
x = L\(U\b) SINGLE O(n2)
r = b – Ax DOUBLE O(n2)WHILE || r || not small enough
z = L\(U\r) SINGLE O(n2)
x = x + z DOUBLE O(n1)
r = b – Ax DOUBLE O(n2)END
Mixed-PrecisionMixed-Precision Iterative Iterative RefinementRefinement
• Iterative refinement for dense systems, Ax = b, can work this way.
Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
It can be shown that using this approach we can compute the solution to 64-bit floating point precision.
• Requires extra storage, total is 1.5 times normal;• O(n3) work is done in lower precision• O(n2) work is done in high precision
• Problems if the matrix is ill-conditioned in sp; O(108)
Results for Mixed Precision Iterative Refinement for Dense
Ax = bArchitecture (BLAS)1 Intel Pentium III Coppermine (Goto)2 Intel Pentium III Katmai (Goto)3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto)5 Intel Pentium IV-M Northwood (Goto)6 AMD Opteron (Goto)7 Cray X1 (libsci)
8 IBM Power PC G5 (2.7 GHz) (VecLib)
9 Compaq Alpha EV6 (CXML)10 IBM SP Power3 (ESSL)11 SGI Octane (ATLAS)
• Single precision is faster than DP because: Higher parallelism within vector units
4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion
32 bit data instead of 64 bit data Higher locality in cache
More data items in cache
Results for Mixed Precision Iterative Refinement for Dense
Ax = bArchitecture (BLAS)1 Intel Pentium III Coppermine (Goto)2 Intel Pentium III Katmai (Goto)3 Sun UltraSPARC IIe (Sunperf) 4 Intel Pentium IV Prescott (Goto)5 Intel Pentium IV-M Northwood (Goto)6 AMD Opteron (Goto)7 Cray X1 (libsci)
8 IBM Power PC G5 (2.7 GHz) (VecLib)
9 Compaq Alpha EV6 (CXML)10 IBM SP Power3 (ESSL)11 SGI Octane (ATLAS)
• Single precision is faster than DP because: Higher parallelism within vector units
4 ops/cycle (usually) instead of 2 ops/cycle Reduced data motion
32 bit data instead of 64 bit data Higher locality in cache
More data items in cache
Architecture (BLAS-MPI) # procs n DP Solve/SP Solve
DP Solve/Iter Ref
# iter
AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85 1.79 6
AMD Opteron (Goto – OpenMPI MX) 64 32000 1.90 1.83 6
44
What about the Cell?What about the Cell?
• Power PC at 3.2 GHz DGEMM at 5 Gflop/s Altivec peak at 25.6 Gflop/s
• Achieved 10 Gflop/s SGEMM
• 8 SPUs 204.8 Gflop/s peak! The catch is that this is for 32 bit
floating point; (Single Precision SP) And 64 bit floating point runs at 14.6
Gflop/s total for all 8 SPEs!! • Divide SP peak by 14; factor of 2 because of
DP and 7 because of latency issues
Moving Data Around on the Cell
256 KB
Worst case memory bound operations (no reuse of data) 3 data movements (2 in and 1 out) with 2 ops (SAXPY)For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B)
Injection bandwidth25.6 GB/s
Injection bandwidth
46
IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b
0
50
100
150
200
250
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Size
GF
lop
/s
SP Peak (204 Gflop/s)
SP Ax=b IBM
DP Peak (15 Gflop/s)
DP Ax=b IBM
.30 secs
3.9 secs
8 SGEMM (Embarrassingly Parallel)
47
IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b
0
50
100
150
200
250
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Size
GF
lop
/s
SP Peak (204 Gflop/s)
SP Ax=b IBM
DSGESV
DP Peak (15 Gflop/s)
DP Ax=b IBM
.30 secs
.47 secs
3.9 secs
8.3X
8 SGEMM (Embarrassingly Parallel)
33 48
Cholesky on the CellCholesky on the Cell, , Ax=b, A=AAx=b, A=ATT, , xxTTAx > 0Ax > 0
For the SPE’s standard C code and C language SIMD extensions (intrinsics)
Single precision performance
Mixed precision performance using iterative refinement Method achieving 64 bit accuracy
Cholesky - Using 2 Cell Cholesky - Using 2 Cell ChipsChips
49
50
Intriguing PotentialIntriguing Potential• Exploit lower precision as much as possible
Payoff in performance• Faster floating point • Less data to move
• Automatically switch between SP and DP to match the desired accuracy Compute solution in SP and then a correction to
the solution in DP
• Potential for GPU, FPGA, special purpose processors What about 16 bit floating point?
• Use as little you can get away with and improve the accuracy
• Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used.
Correction = - A\(b – Ax)
33 51
IBM/Mercury Cell BladeIBM/Mercury Cell Blade
From IBM or Mercury 2 Cell chip
Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW
33 52
Sony Playstation 3 Cluster Sony Playstation 3 Cluster PS3-TPS3-T
From IBM or Mercury 2 Cell chip
Each w/8 SPEs 512 MB/Cell ~$8K - 17K Some SW
From WAL*MART PS3 1 Cell chip
w/6 SPEs 256 MB/PS3 $600 Download SW Dual boot
SIT CELL
Cell Hardware OverviewCell Hardware Overview
PE
PE
PE
PE
PE
PE
200 GB/s200 GB/s
512 MiB512 MiB
25 GB/s
PowerPC
PE
PE
3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs32 bit peak perf 8*25.6 Gflop/s
204.8 Gflop/s peak64 bit peak perf 8*1.8 Gflop/s
14.6 Gflop/s peak512 MiB memory
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s25.6 Gflop/s
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s25.6 Gflop/s
SIT CELL
PS3 Hardware OverviewPS3 Hardware Overview
PE
PE
PE
PE
PE
PE
200 GB/s200 GB/sGameOS
Hypervisor
256 MiB256 MiB
Disabled/Broken: Yield issues
25 GB/s
PowerPC
3.2 GHz25 GB/s injection bandwidth200 GB/s between SPEs 32 bit peak perf 6*25.6 Gflop/s
153.6 Gflop/s peak64 bit peak perf 6*1.8 Gflop/s
10.8 Gflop/s peak1 Gb/s NIC256 MiB memory
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s
25.6 Gflop/s 25.6 Gflop/s25.6 Gflop/s
33 55
PlayStation 3 LU CodesPlayStation 3 LU Codes
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500
Matrix Size
GF
lop
/s
SP Peak (153.6 Gflop/s)
SP Ax=b IBM
DP Peak (10.9 Gflop/s)
6 SGEMM (Embarrassingly Parallel)
33 56
PlayStation 3 LU CodesPlayStation 3 LU Codes
0
20
40
60
80
100
120
140
160
180
0 500 1000 1500 2000 2500
Matrix Size
GF
lop
/s
SP Peak (153.6 Gflop/s)
SP Ax=b IBM
DSGESV
DP Peak (10.9 Gflop/s)
6 SGEMM (Embarrassingly Parallel)
33 57
Cholesky on the PS3Cholesky on the PS3, , Ax=b, A=AAx=b, A=ATT, x, xTTAx > Ax >
00
33 58
HPC in the Living RoomHPC in the Living Room
33
Matrix Multiple on a 4 Node PlayStation3 Cluster
What's goodVery cheap: ~4$ per Gflop/s (with 32 bit fl pt theoretical peak)Fast local computations between SPEsPerfect overlap between communications and computations is possible (Open-MPI running):
PPE does communication via MPI SPEs do computation via SGEMMs
What's badGigabit network card. 1 Gb/s is too little for such computational power (150 Gflop/s per node)Linux can only run on top of GameOS (hypervisor)
Extremely high network access latencies (120 usec)
Low bandwidth (600 Mb/s)Only 256 MB local memoryOnly 6 SPEs
Gold: Computation: 8 ms
Blue: Communication: 20 ms
33
Users Guide for SC on PS3Users Guide for SC on PS3
• SCOP3: A Rough Guide to Scientific Computing on the PlayStation 3
• See webpage for details
Conclusions Conclusions • For the last decade or more, the research
investment strategy has been overwhelmingly biased in favor of hardware.
• This strategy needs to be rebalanced - barriers to progress are increasingly on the software side.
• Moreover, the return on investment is more favorable to software. Hardware has a half-life measured in years, while
software has a half-life measured in decades.• High Performance Ecosystem out of balance
Hardware, OS, Compilers, Software, Algorithms, Applications• No Moore’s Law for software, algorithms and applications
33
Collaborators / SupportCollaborators / Support
Alfredo Buttari, UTKJulien Langou,
UColoradoJulie Langou, UTKPiotr Luszczek,
MathWorksJakub Kurzak, UTKStan Tomov, UTK