Developments with LAPACK and ScaLAPACK on Today's and

1

2/25/2006 2

Developments with LAPACK and Developments with LAPACK and ScaLAPACKScaLAPACK on Today's and on Today's and

TomorrowTomorrow’’s Systems s Systems Jack Dongarra

University of Tennesseeand

Oak Ridge National Laboratory

Also hear Jim Demmel’s talk at 2:30 today MS47 Carmel room

33 3

ParticipantsParticipants♦ U Tennessee, Knoxville

Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Jakub Kurzak, Stan Tomov, Remi Delmas, Peng Du

♦ UC Berkeley:Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads…

♦ Other Academic InstitutionsUT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa BarbaraTU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb

♦ Research InstitutionsCERFACS, LBL

♦ Industrial PartnersCray, HP, Intel, MathWorks, NAG, SGI, Microsoft

2

33 4

Architecture/Systems ContinuumArchitecture/Systems Continuum

♦ Custom processor with custom interconnect

Cray X1NEC SX-8IBM RegattaIBM Blue Gene/L

♦ Commodity processor with custom interconnect

SGI AltixIntel Itanium 2

Cray XT3, XD1AMD Opteron

♦ Commodity processor with commodity interconnect

Clusters Pentium, Itanium, Opteron, AlphaGigE, Infiniband, Myrinet, Quadrics

NEC TX7IBM eServerDawning

Loosely Coupled

Tightly Coupled ♦ Best processor performance for

codes that are not “cache friendly”

♦ Good communication performance♦ Simpler programming model♦ Most expensive

♦ Good communication performance♦ Good scalability

♦ Best price/performance (for codes that work well with caches and are latency tolerant)

♦ More complex programming model0%

20%

40%

60%

80%

100%

Jun-

93

Dec

-93

Jun-

94

Dec

-94

Jun-

95

Dec

-95

Jun-

96

Dec

-96

Jun-

97

Dec

-97

Jun-

98

Dec

-98

Jun-

99

Dec

-99

Jun-

00

Dec

-00

Jun-

01

Dec

-01

Jun-

02

Dec

-02

Jun-

03

Dec

-03

Jun-

04

Custom

Commod

Hybrid

33 5

Parallelism in the Top500Parallelism in the Top500

0

100

200

300

400

500

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

64k-128k32k-64k16k-32k8k-16k4k-8k2049-40961025-2048513-1024257-512129-25665-12833-6417-329-165-83-421

3

33 6

Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate& Transistor & Transistor

DensityDensity

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Core

Cache

Core

Cache

Core

C1 C2

C3 C4

Cache

C1 C2

C3 C4

Cache

C1 C2

C3 C4

C1 C2

C3 C4

C1 C2

C3 C4

C1 C2

C3 C4

33 7

CPU Desktop Trends CPU Desktop Trends –– Change is ComingChange is Coming

2004 2005 2006 2007 2008 2009 2010

Cores Per Processor ChipHardware Threads Per Chip

0

50

100

150

200

250

300

Year

♦ Relative processing power will continue to double every 18 months

♦ 256 logical processors per chip in late 2010

4

33 8

Commodity Processor TrendsCommodity Processor TrendsBandwidth/Latency is the Critical Issue, not FLOPSBandwidth/Latency is the Critical Issue, not FLOPS

28 ns= 94,000 FP ops= 780 loads

50 ns= 1600 FP ops= 170 loads

70 ns= 280 FP ops= 70 loads

(5.5%) DRAM latency

27 GWord/s= 0.008 word/flop

3.5 GWord/s= 0.11 word/flop

1 GWord/s= 0.25 word/flop23%Front-side bus

bandwidth

3300 GFLOP/s 32 GFLOP/s 4 GFLOP/s 59% Single-chipfloating-point performance

Typical valuein 2020



Annual increase

Source: Getting Up to Speed: The Future of Supercomputing, National Research Council, 222 pages, 2004, National Academies Press, Washington DC, ISBN 0-309-09502-6.

Got Bandwidth?

33

ScaLAPACK

PBLASPBLASPBLAS

BLACSBLACSBLACS

MPIMPIMPI

LAPACK

ATLASATLASATLAS Specialized Specialized Specialized BLASBLASBLAS

threadsthreadsthreads

Parallel

Parallelism in LAPACK / ScaLAPACK

Shared Memory Distributed Memory

5

33 10

LAPACK and LAPACK and ScaLAPACKScaLAPACK FuturesFutures♦ Widely used dense and banded linear algebra

librariesUsed in vendor libraries from Cray, Fujitsu, HP, IBM, Intel, NEC, SGIIn Matlab (thanks to tuning…), NAG, PETSc,…over 56M web hits at www.netlib.org

LAPACK, ScaLAPACK, CLAPACK, LAPACK95

♦ NSF grant for new, improved releasesJoint with Jim Demmel, many othersCommunity effort (academic and industry)

♦ Next major release scheduled in 2007♦ See Jim Demmel’s talk at 2:30 today MS47

Carmel room

33 11

Goals (highlights)Goals (highlights)

♦ Putting more of LAPACK into ScaLAPACKLots of routines not yet parallelized

♦ New functionalityEx: Updating/downdating of factorizations

♦ Improving ease of useLife after F77?, Binding to other languagesCallable from Matlab

♦ Automatic Performance TuningOver 1300 calls to ILAENV() to get tuning parameters

♦ New AlgorithmsSome faster, some more accurate, some new

6

33 12

Faster: Faster: λλ’’s and s and σσ’’ss

♦ Nonsymmetric eigenproblemIncorporate SIAM Prize winning work of Byers / Braman / Mathias on faster HQRUp to 10x faster for large enough problems

♦ Symmetric eigenproblem and SVDReduce from dense to narrow band

Incorporate work of Bischof/Lang, Howell/FultonMove work from BLAS2 to BLAS3

Narrow band (tri/bidiagonal) problemIncorporate MRRR algorithm of Parlett/DhillonVoemel, Marques, Willems

33 13

Recursive/Fractal architecturesRecursive/Fractal architecturesRecursive/Fractal Recursive/Fractal data layoutdata layout & Recursive & Recursive algorithmsalgorithms

♦ Enable:Register blockingL1cache bockingL2 cache blockingNatural layout for parallel algorithm

♦ Close to the 2D block cyclic distribution

♦ Proven efficient by experimentson recursive algorithms and recursive data layout (see Gustavsonet al.)

register

L2 cache

Main memory

Shared memory

L1 cache

L3 cache

Distant memory

7

33 14

DGETF2 DLSWP DLSWP

DTRSM DGEMM

DGETF2 – Unblocked LU

DLSWP – row swaps

DTRSM – triangular solve with

many right-hand sides

DGEMM – matrix-matrix multiply

Right-Looking LU factorization (LAPACK)

33 15

DGETF2

DLSWP

DLSWP

DTRSM

DGEMM

LAPACK

LAPACK

LAPACK

BLAS

BLAS

Steps in the LAPACK LUSteps in the LAPACK LU

8

33

DGETF2DLASWP(L)DLASWP(R)DTRSMDGEMM

LAPACK + BLAS threads

1D decomposition and SGI Origin

LU Timing ProfileLU Timing Profile

Time for each component

33

DGETF2DLASWP(L)DLASWP(R)DTRSMDGEMM

LAPACK + BLAS threads

Threads – no lookahead

In this case the performance difference comes fromparallelizing row exchanges (DLASWP) and threads in the LU

algorithm.

1D decomposition and SGI Origin

LU Timing ProfileLU Timing Profile

Time for each component

9

33 18

Right-Looking LU factorizationRight-Looking LU Factorization

33

Right-Looking LU with a Lookahead

10

33 20

Left-Looking LU factorization

33 21∞

3

2

1

Lookahead = 0

Pivot Rearrangement and Pivot Rearrangement and LookaheadLookahead4 Processor runs4 Processor runs

11

33 22

Things to Watch:Things to Watch:PlayStation 3PlayStation 3

♦ The PlayStation 3's CPU based on a chip codenamed "Cell"♦ Each Cell contains 8 APUs.

An APU is a self contained vector processor which acts independently from the others.

4 floating point units capable of a total of 32 Gflop/s (8 Gflop/s each)

256 Gflop/s peak! 32 bit floating point; 64 bit floating point at 25 Gflop/s.IEEE format, but only rounds toward zero in 32 bit, overflow set to largest

According to IBM, the SPE’s double precision unit is fully IEEE854 compliant.Datapaths “lite”

33 23

32 and 64 Bit Floating Point Arithmetic32 and 64 Bit Floating Point Arithmetic♦ Use 32 bit floating point whenever possible and resort

to 64 bit floating point when needed to refine solution.♦ Iterative refinement for dense systems can work this

way.Solve Ax = b in lower precision,

save the factorization (L*U = A*P); O(n3)Compute in higher precision r = b – A*x; O(n2)

Requires the original data A (stored in high precision)Solve Az = r; using the lower precision factorization; O(n2)Update solution x+ = x + z using high precision; O(n)

Iterate until converged.

Requires extra storage, total is 1.5 times normal;O(n3) work is done in lower precisionO(n2) work is done in high precisionIn the best case doubles number of digits per iterationProblem if the matrix is ill-conditioned in sp; O(108)

12

33 24

0 500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5

Gflo

p/s

In Matlab Comparison of 32 and 64 Bit Computation for Ax=b

Another Look at Iterative RefinementAnother Look at Iterative Refinement♦ On Cell processor, single precision is at 256 Gflop/s and double

precision is at 25 Gflop/s.

♦ On a Pentium; using SSE2, single precision can perform 4 floating point operations per cycle and in double precision 2 floating point operations per cycle.

♦ Reduced memory traffic (factor on sp data)

Double Precision

Single Precision w/iterative refinement

121.843200032

181.773200016

51.921600016

51.83240008

61.78160008

51.6480008

51.69160004

61.69120004

61.7880004

41.6640004

51.6580002

41.6660002

51.6040002

41.5220002

#stepsSpeedupn#procs

Cluster w/3.2 GHz Xeons w/ScaLAPACK

1.9 X speedup Matlabon my laptop!

33 25

Refinement Technique Using Refinement Technique Using Single/Double PrecisionSingle/Double Precision

♦ Linear SystemsLU (dense and sparse)CholeskyQR Factorization

♦ EigenvalueSymmetric eigenvalue problemSVDSame idea as with dense systems,

Reduce to tridiagonal/bi-diagonal in lower precision, retain original data and improve with iterative technique using the lower precision to solve systems and use higher precision to calculate residual with original data.O(n2) per value/vector

♦ Iterative Linear SystemRelaxed GMRESInner/outer scheme

LAPACK Working Note in progress

13

33 26

SummarySummary♦ Better / Faster Numerics

MRRR sym λ & SVDHQR, QZ, reductions, packed

♦ Expanded ContentScaLAPACK mirror LAPACK

♦ Extended precision version Variable precision, user controlled

♦ Callable from MatlabFrom Matlab invoke LAPACK routine

♦ Recursive data structuresFor Performance

♦ Automate Performance Tuning♦ Improve ease of use♦ Better Maintenance and Support♦ Involve the Community

Open source effort

33 27

Collaborators / SupportCollaborators / Support♦ U Tennessee, Knoxville

Julien Langou, Julie Langou, Piotr Luszczek, Jakub Kurzak, Stan Tomov, Remi Delmas, Peng Du

♦ UC Berkeley:Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason Riedy, Jianlin Xia, Jiang Zhu, undergrads…

♦ Other Academic InstitutionsUT Austin, UC Davis, Florida IT, U Kansas, U Maryland, North Carolina SU, San Jose SU, UC Santa Barbara, TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå,UWuppertal, U Zagreb

♦ Research InstitutionsCERFACS, LBL

♦ Industrial PartnersCray, HP, Intel, IBM, MathWorks, NAG, SGI, Microsoft

Documents

Developments with LAPACK and ScaLAPACK on Today's and