Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
2/25/2006 2
Developments with LAPACK and Developments with LAPACK and ScaLAPACKScaLAPACK on Today's and on Today's and
TomorrowTomorrow’’s Systems s Systems Jack Dongarra
University of Tennesseeand
Oak Ridge National Laboratory
Also hear Jim Demmel’s talk at 2:30 today MS47 Carmel room
33 3
ParticipantsParticipants♦ U Tennessee, Knoxville
Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Jakub Kurzak, Stan Tomov, Remi Delmas, Peng Du
♦ UC Berkeley:Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads…
♦ Other Academic InstitutionsUT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa BarbaraTU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb
♦ Research InstitutionsCERFACS, LBL
♦ Industrial PartnersCray, HP, Intel, MathWorks, NAG, SGI, Microsoft
2
33 4
Architecture/Systems ContinuumArchitecture/Systems Continuum
♦ Custom processor with custom interconnect
Cray X1NEC SX-8IBM RegattaIBM Blue Gene/L
♦ Commodity processor with custom interconnect
SGI AltixIntel Itanium 2
Cray XT3, XD1AMD Opteron
♦ Commodity processor with commodity interconnect
Clusters Pentium, Itanium, Opteron, AlphaGigE, Infiniband, Myrinet, Quadrics
NEC TX7IBM eServerDawning
Loosely Coupled
Tightly Coupled ♦ Best processor performance for
codes that are not “cache friendly”
♦ Good communication performance♦ Simpler programming model♦ Most expensive
♦ Good communication performance♦ Good scalability
♦ Best price/performance (for codes that work well with caches and are latency tolerant)
♦ More complex programming model0%
20%
40%
60%
80%
100%
Jun-
93
Dec
-93
Jun-
94
Dec
-94
Jun-
95
Dec
-95
Jun-
96
Dec
-96
Jun-
97
Dec
-97
Jun-
98
Dec
-98
Jun-
99
Dec
-99
Jun-
00
Dec
-00
Jun-
01
Dec
-01
Jun-
02
Dec
-02
Jun-
03
Dec
-03
Jun-
04
Custom
Commod
Hybrid
33 5
Parallelism in the Top500Parallelism in the Top500
0
100
200
300
400
500
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
64k-128k32k-64k16k-32k8k-16k4k-8k2049-40961025-2048513-1024257-512129-25665-12833-6417-329-165-83-421
3
33 6
Lower Lower VoltageVoltage
Increase Increase Clock RateClock Rate& Transistor & Transistor
DensityDensity
We have seen increasing number of gates on a chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel Processors > 100 Watts
We will not see the dramatic increases in clock speeds in the future.
However, the number of gates on a chip will continue to increase.
Core
Cache
Core
Cache
Core
C1 C2
C3 C4
Cache
C1 C2
C3 C4
Cache
C1 C2
C3 C4
C1 C2
C3 C4
C1 C2
C3 C4
C1 C2
C3 C4
33 7
CPU Desktop Trends CPU Desktop Trends –– Change is ComingChange is Coming
2004 2005 2006 2007 2008 2009 2010
Cores Per Processor ChipHardware Threads Per Chip
0
50
100
150
200
250
300
Year
♦ Relative processing power will continue to double every 18 months
♦ 256 logical processors per chip in late 2010
4
33 8
Commodity Processor TrendsCommodity Processor TrendsBandwidth/Latency is the Critical Issue, not FLOPSBandwidth/Latency is the Critical Issue, not FLOPS
28 ns= 94,000 FP ops= 780 loads
50 ns= 1600 FP ops= 170 loads
70 ns= 280 FP ops= 70 loads
(5.5%) DRAM latency
27 GWord/s= 0.008 word/flop
3.5 GWord/s= 0.11 word/flop
1 GWord/s= 0.25 word/flop23%Front-side bus
bandwidth
3300 GFLOP/s 32 GFLOP/s 4 GFLOP/s 59% Single-chipfloating-point performance
Typical valuein 2020
Typical valuein 2010
Typical valuein 2006
Annual increase
Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.
Got Bandwidth?
33
ScaLAPACK
PBLASPBLASPBLAS
BLACSBLACSBLACS
MPIMPIMPI
LAPACK
ATLASATLASATLAS Specialized Specialized Specialized BLASBLASBLAS
threadsthreadsthreads
Parallel
Parallelism in LAPACK / ScaLAPACK
Shared Memory Distributed Memory
5
33 10
LAPACK and LAPACK and ScaLAPACKScaLAPACK FuturesFutures♦ Widely used dense and banded linear algebra
librariesUsed in vendor libraries from Cray, Fujitsu, HP, IBM, Intel, NEC, SGIIn Matlab (thanks to tuning…), NAG, PETSc,…over 56M web hits at www.netlib.org
LAPACK, ScaLAPACK, CLAPACK, LAPACK95
♦ NSF grant for new, improved releasesJoint with Jim Demmel, many othersCommunity effort (academic and industry)
♦ Next major release scheduled in 2007♦ See Jim Demmel’s talk at 2:30 today MS47
Carmel room
33 11
Goals (highlights)Goals (highlights)
♦ Putting more of LAPACK into ScaLAPACKLots of routines not yet parallelized
♦ New functionalityEx: Updating/downdating of factorizations
♦ Improving ease of useLife after F77?, Binding to other languagesCallable from Matlab
♦ Automatic Performance TuningOver 1300 calls to ILAENV() to get tuning parameters
♦ New AlgorithmsSome faster, some more accurate, some new
6
33 12
Faster: Faster: λλ’’s and s and σσ’’ss
♦ Nonsymmetric eigenproblemIncorporate SIAM Prize winning work of Byers / Braman / Mathias on faster HQRUp to 10x faster for large enough problems
♦ Symmetric eigenproblem and SVDReduce from dense to narrow band
Incorporate work of Bischof/Lang, Howell/FultonMove work from BLAS2 to BLAS3
Narrow band (tri/bidiagonal) problemIncorporate MRRR algorithm of Parlett/DhillonVoemel, Marques, Willems
33 13
Recursive/Fractal architecturesRecursive/Fractal architecturesRecursive/Fractal Recursive/Fractal data layoutdata layout & Recursive & Recursive algorithmsalgorithms
♦ Enable:Register blockingL1cache bockingL2 cache blockingNatural layout for parallel algorithm
♦ Close to the 2D block cyclic distribution
♦ Proven efficient by experimentson recursive algorithms and recursive data layout (see Gustavsonet al.)
register
L2 cache
Main memory
Shared memory
L1 cache
L3 cache
Distant memory
7
33 14
DGETF2 DLSWP DLSWP
DTRSM DGEMM
DGETF2 – Unblocked LU
DLSWP – row swaps
DTRSM – triangular solve with
many right-hand sides
DGEMM – matrix-matrix multiply
Right-Looking LU factorization (LAPACK)
33 15
DGETF2
DLSWP
DLSWP
DTRSM
DGEMM
LAPACK
LAPACK
LAPACK
BLAS
BLAS
Steps in the LAPACK LUSteps in the LAPACK LU
8
33
DGETF2DLASWP(L)DLASWP(R)DTRSMDGEMM
LAPACK + BLAS threads
1D decomposition and SGI Origin
LU Timing ProfileLU Timing Profile
Time for each component
33
DGETF2DLASWP(L)DLASWP(R)DTRSMDGEMM
LAPACK + BLAS threads
Threads – no lookahead
In this case the performance difference comes fromparallelizing row exchanges (DLASWP) and threads in the LU
algorithm.
1D decomposition and SGI Origin
LU Timing ProfileLU Timing Profile
Time for each component
9
33 18
Right-Looking LU factorizationRight-Looking LU Factorization
33
Right-Looking LU with a Lookahead
10
33 20
Left-Looking LU factorization
33 21∞
3
2
1
Lookahead = 0
Pivot Rearrangement and Pivot Rearrangement and LookaheadLookahead4 Processor runs4 Processor runs
11
33 22
Things to Watch:Things to Watch:PlayStation 3PlayStation 3
♦ The PlayStation 3's CPU based on a chip codenamed "Cell"♦ Each Cell contains 8 APUs.
An APU is a self contained vector processor which acts independently from the others.
4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each)
256 Gflop/s peak! 32 bit floating point; 64 bit floating point at 25 Gflop/s.IEEE format, but only rounds toward zero in 32 bit, overflow set to largest
According to IBM, the SPE’s double precision unit is fully IEEE854 compliant.Datapaths “lite”
33 23
32 and 64 Bit Floating Point Arithmetic32 and 64 Bit Floating Point Arithmetic♦ Use 32 bit floating point whenever possible and resort
to 64 bit floating point when needed to refine solution.♦ Iterative refinement for dense systems can work this
way.Solve Ax = b in lower precision,
save the factorization (L*U = A*P); O(n3)Compute in higher precision r = b – A*x; O(n2)
Requires the original data A (stored in high precision)Solve Az = r; using the lower precision factorization; O(n2)Update solution x+ = x + z using high precision; O(n)
Iterate until converged.
Requires extra storage, total is 1.5 times normal;O(n3) work is done in lower precisionO(n2) work is done in high precisionIn the best case doubles number of digits per iterationProblem if the matrix is ill-conditioned in sp; O(108)
12
33 24
0 500 1000 1500 2000 2500 30000
0.5
1
1.5
2
2.5
Gflo
p/s
In Matlab Comparison of 32 and 64 Bit Computation for Ax=b
Another Look at Iterative RefinementAnother Look at Iterative Refinement♦ On Cell processor, single precision is at 256 Gflop/s and double
precision is at 25 Gflop/s.
♦ On a Pentium; using SSE2, single precision can perform 4 floating point operations per cycle and in double precision 2 floating point operations per cycle.
♦ Reduced memory traffic (factor on sp data)
Double Precision
Single Precision w/iterative refinement
121.843200032
181.773200016
51.921600016
51.83240008
61.78160008
51.6480008
51.69160004
61.69120004
61.7880004
41.6640004
51.6580002
41.6660002
51.6040002
41.5220002
#stepsSpeedupn#procs
Cluster w/3.2 GHz Xeons w/ScaLAPACK
1.9 X speedup Matlabon my laptop!
33 25
Refinement Technique Using Refinement Technique Using Single/Double PrecisionSingle/Double Precision
♦ Linear SystemsLU (dense and sparse)CholeskyQR Factorization
♦ EigenvalueSymmetric eigenvalue problemSVDSame idea as with dense systems,
Reduce to tridiagonal/bi-diagonal in lower precision, retain original data and improve with iterative technique using the lower precision to solve systems and use higher precision to calculate residual with original data.O(n2) per value/vector
♦ Iterative Linear SystemRelaxed GMRESInner/outer scheme
LAPACK Working Note in progress
13
33 26
SummarySummary♦ Better / Faster Numerics
MRRR sym λ & SVDHQR, QZ, reductions, packed
♦ Expanded ContentScaLAPACK mirror LAPACK
♦ Extended precision version Variable precision, user controlled
♦ Callable from MatlabFrom Matlab invoke LAPACK routine
♦ Recursive data structuresFor Performance
♦ Automate Performance Tuning♦ Improve ease of use♦ Better Maintenance and Support♦ Involve the Community
Open source effort
33 27
Collaborators / SupportCollaborators / Support♦ U Tennessee, Knoxville
Julien Langou, Julie Langou, Piotr Luszczek, Jakub Kurzak, Stan Tomov, Remi Delmas, Peng Du
♦ UC Berkeley:Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads…
♦ Other Academic InstitutionsUT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå,UWuppertal, U Zagreb
♦ Research InstitutionsCERFACS, LBL
♦ Industrial PartnersCray, HP, Intel, IBM, MathWorks, NAG, SGI, Microsoft