Knoxville aug02-full

SPIRAL: Current StatusSPIRAL: Current Status

• José Moura (CMU)• Jeremy Johnson (Drexel)• Robert Johnson (MathStar)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)

• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Edward Wertz (CMU)• Jianxin Xiong (UIUC)

Faculty

Students

http://www.ece.cmu.edu/~spiral

Markus Püschel

Collaborators

• Christoph Überhuber (TU Vienna)• Franz Franchetti (TU Vienna)

SponsorSponsor

Work supported by DARPA (DSO), Applied & Computational

Mathematics Program, OPAL, through grant managed by

research grant DABT63-98-1-0004 administered by the Army

Directorate of Contracting.

Moore’s Moore’s Law and High(est)Law and High(est) Performance Performance Scientific ComputingScientific Computing

arithmetic cost model not accurate for predicting runtime better performance models hard to get best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations full performance requires (in part) assembly coding

Moore’s Law: processor-memory bottleneck short life cycles of computers very complex architectures

• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features

(single processor, off-the-shelf)

Consequences for software/algorithms:

Portable performance requires automation

SPIRALSPIRALAutomates

cuts development costs code less error-prone

takes advantage of architecture specific features porting without loss of performance

systematic exploration of alternatives both at algorithmic and code level

are performance critical

Implementation

Platform-Adaptation

Optimization

of DSP algorithms

A library generator for highly optimized signal processing algorithms

SPIRAL SPIRAL systemsystem

DSP transformspecifies

user

goes for a coffee

Formula Generator

SPL Compiler Sea

rch

En

gin

e

runtime on given platform

controlsimplementation options

controlsalgorithm generation

fast algorithmas SPL formula

C/Fortran/SIMDcodeS P

I R

A L

(or an

espres

so fo

r small tran

sfo

rms)

platform-adaptedimplementation

comes back

Mathematic

ian

Expert

Program

mer

DSP Transform

Algorithm

DSPDSP Algorithms: Example 4-point DFT Algorithms: Example 4-point DFT

Cooley/Tukey FFT (size 4):

product of structured sparse matrices mathematical notation

1000

0010

0100

0001

1100

1100

0011

0011

000

0100

0010

0001

1010

0101

1010

0101

11

1111

11

1111

iii

ii

4222

42224 )()( LDFTITIDFTDFT

Fourier transform

Identity Permutation

Diagonal matrix (twiddles)

Kronecker product

DSPDSP Algorithms: Terminology Algorithms: Terminology

Transform

Rule

Formula

PDFTIDIDFTDFT mnmnnm

PFIIDIFDFT 222428

parameterized matrix

• a breakdown strategy • product of sparse matrices

• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation

nDFT

Ruletree 8DFT

2DFT 4DFT

2DFT 2DFT

• few constructs and primitives• uniquely defines an algorithm• can be translated into code

DSPDSP Transforms Transforms

nkliDFTn /2exp

222 DFTDFTWHT k

nlkDCT nII /2/1cos)(

nlkDCT nIV /)2/1(2/1cos)(

nklDST nI /sin)(

nlnkMDCT nn /)2/1)(2/)1((cos2

Others: filters, discrete wavelet transforms, Haar, Hartley, …

discrete Fourier transform

Walsh-Hadamard transform

discrete cosine and sineTransforms (16 types)

modified discrete cosine transform

TT two-dimensional transform

RulesRules = Breakdown Strategies = Breakdown Strategies

QnIV

nII

nII

n FIDCTDCTPDCT 22/)(

2/)(

2/)(

DDCTSDCT IIn

IVn )()(

2)(

2 2/1,1 FdiagDCT II

rIV

n MMDCT 1)(

1 1 12 2 2 21

( )n n n n n ni i i t

n

i

WHT I WHT I

))(()()( // hFIIIhF ddnkdk

dnn

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

PDFTIDIDFTDFT mnmnnm

CDSTDCTBDFT In

Inn )( )(

2/)(2/

PDCTSMDCT IVnnn

)(2

base case

recursive

translation

iterative

recursive

recursive

recursive

recursive

recursive

translation

iterative/recursive

EhCirchFn )()(

FormulaFormula for a DCT, size 16 for a DCT, size 16

Algorithm (Formula)

Implementation

DSP Transform

FormulasFormulas in SPL in SPL

( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )

• • • •

• • • •

SPLSPL Syntax (Subset) Syntax (Subset) matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off

allows extension of SPL

controls loop unrolling

SPLSPL Compiler, 4-point FFT Compiler, 4-point FFT

(compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2))

f0 = x(1) + x(3)f1 = x(1) - x(3)f2 = x(2) + x(4)f3 = x(2) - x(4)f4 = (0.00d0,-1.00d0)*f(3)y(1) = f0 + f2y(2) = f0 - f2y(3) = f1 + f4y(4) = f1 - f4

r0 = x(1) + x(5)r1 = x(1) - x(5)r2 = x(2) + x(6)r3 = x(2) - x(6)r4 = x(3) + x(7)r5 = x(3) - x(7)r6 = x(4) + x(8)r7 = x(4) - x(8)y(1) = r0 + r4y(2) = r1 + r5y(3) = r0 - r4y(4) = r1 - r5y(5) = r2 + r7y(6) = r3 - r6y(7) = r2 - r7y(8) = r3 + r6

fast algorithmas

formulaas

SPL program#codetype

complex real

SPLSPL Compiler: Summary Compiler: Summary

Parsing

Intermediate Code Generation

Intermediate Code Restructuring

Target Code Generation

Symbol Table Abstract Syntax Tree

I-Code

I-Code

C, FORTRAN function

Template Table

SPL Formula Template DefinitionSymbol Definition

OptimizationI-Code

SPL Program

Built-in optimizations:

single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE

Extensible through templates

Algorithm (Formula)

Implementation

DSP Transform

Search

WhyWhy Search? Search?

DCT, type IV, size 16

• maaaany different formulas• large spread in runtimes, even for modest size• not due to arithmetic cost• best formula is platform-dependent

~31000 formulas

SearchSearch Methods available in SPIRAL Methods available in SPIRAL

Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm)

Possible FormulasSizes Timed Results

Exhaust Very small All Best

DP All 10s-100s (very) good

Random All User decided fair/good

Hill Climbing All 100s-1000s Good

STEER All 100s-1000s (very) good

Search over • algorithm space and • implementation options (degree of unrolling)

STEERSTEERPopulation n:

Population n+1:

……

……

Mutation

Cross-Breeding

expanddifferently

swapexpansions

Survival of Fittest

Experimental Experimental Results (C code)Results (C code)

high performance code(compared with FFTW)

different transforms

search methods(applicable to all transforms)

generated high quality code

SPIRALSPIRAL System System

Available for download (v3.1): www.ece.cmu.edu/~spiral

Easy installation (Unix: configure/make; Windows: install shield)

Unix/Linux and Windows 98/ME/NT/2000/XP

Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV,

MDCT, Filters, Wavelets, Toeplitz, Circulants

Extensible

New version (4.0) in preparation

Recent Work

LearningLearning to Generate Fast Algorithms to Generate Fast Algorithms

• Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT

SIMDSIMD Short Vector Extensions Short Vector Extensions

+ x

vector length = 4(4-way)

Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up

SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited

Problems:

very difficult to use

Vector Vector code generation from SPL formulascode generation from SPL formulas

Naturally vectorizable construct

A

x y

4IAvector length

iiii

k

ii QEIADP )(

1

Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length

(Current) generic construct completely vectorizable:

Vectorization in two steps:

1. Formula manipulation using manipulation rules2. Code generation (vector code + C code)

GeneratedGenerated Vector Code DFTs: Pentium 4 Vector Code DFTs: Pentium 4g

flo

ps

DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0

n

speedups (to C code) up to factor of 3.3 beats hand-tuned vendor library

0

1

2

3

4

5

6

7

4 5 6 7 8 9 10 11 12 13 14

Spiral SSE

Intel MKL interl.

FFTW 2.1.3

Spiral C and F95

Spiral F95 vect

SIMD-FFT

GeneratedGenerated Vector Code, Other Transforms Vector Code, Other Transforms

no

rmal

ized

ru

nti

me

no

rmal

ized

ru

nti

me

transform size transform size

2-dim DCTWHT

speedups up to factor of 2.5

FlexibleFlexible Loop Interleaving (Runtime Gain WHT) Loop Interleaving (Runtime Gain WHT)

Alpha 21264: up to 70% UltraSparc III: up to 60%

Athlon XP: up to 55% Pentium 4: up to 45%

ParallelParallel Code Generation: Example WHT Code Generation: Example WHT

WHT size log(N)

0

2

4

6

8

10

1 6 11 16 21 26

Sp

ee

du

p

1 thread

8 threads

10 threads

PowerPC RS64 III

Parallelized constructs:In A, A In

CodeCode Scheduling Scheduling

Runtime histograms

DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas

unscheduledscheduled

FiltersFilters and Wavelets and Wavelets

New constructs: row/column overlapped tensor product

AAA

A

AAA

A

AI kn AI kn

))(()()( // hFIIIhF ddnkdk

dnn

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

Examples for rules:

ConclusionsConclusions

Automatic code generation for the entire domain of (linear) DSP algorithms

Portable high performance across platforms and across time

Integration of math (high) level and implementation (low) level

Intelligence through search and learning in the space of alternatives

FutureFuture Plans Plans

Transforms: Radon, Gabor, Hankel, structured matrices, …

Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs

Instructions: Vector, FMAs, prefetching, OpenMP, MPI

Beyond transforms: entire DSP applications

Other domains amenable to SPIRAL approach?

http://www.ece.cmu.edu/~spiral

Questions?

Extra SlidesExtra Slides

Generating Parallel ProgramsGenerating Parallel Programs

Interpret constructs such as In A as parallel operations and transform formulas to obtain maximal parallelism.

Explore alternative data access patterns mathematically (e.g. different permutations in matrix factorizations)

Prototype implementation using WHT Build on existing sequential package

SMP implementation using OpenMP (IPDPS’02)

90% efficiency obtained on 12 processor PowerPC RS64 III Distributed memory implementation using MPI (POHLL’02)

Comparison of Parallel DDL Schemes Comparison of Parallel DDL Schemes

0

2

4

6

8

10

1 6 11 16 21 26

WHT size log(N)

Spe

edup Best sequential with

DDLParallel without DDL

Coarse-grained DDL

Fine-grained DDLwith ID shift

PowerPC RS64 III10 threads

Overall Parallel SpeedupOverall Parallel Speedup

0

2

4

6

8

10

1 6 11 16 21 26

WHT size log(N)

Sp

eed

up

1 thread

8 threads

10 threads

PowerPC RS64 III

Performance of Digit Permutations on CRAY T3EPerformance of Digit Permutations on CRAY T3E

0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200N

um

ber

of te

nsor

perm

uta

tions

Bandwidth in MB/sec/node

Distribution of Global Tensor Permutations on 128 Processors (shmem put)

Parameters: Pi, 1 i n, no. of processor

Parameters: Dimension, and Ti, 1 i n, no. of processor 2m

AG

CU

Interconnection Network

CU AG

M

CU AG

M

CU AG

M

CU AG

M

Host

I/O

Int

erfa

ce

. . .

M = MemoryAG = Address GeneratorCU = Computation Unit

Parameters: Dimension, Pd

I/O Interface

dPPTFIPPTFIPPTFIPPTFIP 111281

122282

133283

144284 )()()()(

Architecture FrameworkArchitecture Framework

Pease Algorithm DataflowPease Algorithm Dataflow

dPFILTFILLTFILLTFIL )()()()( 28162128

168

164228

164

168328

162

Stage 0 Stage 1 Stage 2 Stage 3

0

2

4

6

8

10

12

14

1

3

5

7

9

11

13

15

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

162L

Optimal DataflowOptimal Dataflow

dPILFIILLTFILLILTFIILLLILTFIILL ))()(()()())(()())(( 282282

84

162128

168

1642

842282

82

164

1682

843282

82

162

Stage 3Stage 1Stage 0 Stage 2

Pease v.s. OptimalPease v.s. Optimal

Performance comparison of Pease algorithm and optimal algorithm

0

2000

4000

6000

8000

10000

12000

5 6 7 8

FFT size

no

. o

f clo

ck c

ycle

s

Pease

Optimal

Local Remote Local RemoteAccess Access Access Access

1 32 96 3900 128 0 6402 32 96 3860 128 0 6403 64 64 3900 128 0 6404 128 0 640 128 0 6405 64 64 3840 64 64 25606 32 96 3880 64 64 2560

PerformanceStage Peformance

Pease Optimal

Performance: SpeedupPerformance: Speedup

Speedup

0

0.5

1

1.5

2

2.5

3

3.5

4

5 6 7 8 9 10 11 12 13 14 15 16

Address Size (Bits)

Sp

eed

up

(4*N

*lo

g(N

)/N

o.

of

Clo

cks)

Speedup = S/C

C = no. of. clock cycles using 4-processor FFT engine

S = minimum no. of clock cycles using single processor (4*2n*n,

2n is the number of points)

SPIRALSPIRAL Approach Approach

DSP Transform (DFT, DCT, Wavelets etc.)

Computing Platform

given

given

PossibleImplementations

PerformanceEvaluation

Inte

llig

en

t S

earc

hPossibleAlgorithms

SP

IRA

L S

earc

h S

pac

e

(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )

adaptedimplementation

ClassicalClassical Code Generation System Code Generation System

DSP Transform (DFT, DCT, Wavelets etc.)

Computing Platform

given

given

ExpertProgrammer

PerformanceEvaluation

Math/AlgorithmExpert

(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )

adaptedimplementation

AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(

8IIDCT

)(4IIDCT

)(4IVDCT

R1)()( 2/2

)(2/

)(2/

)(n

IVn

IIn

IIn IFDCTDCTPDCT

2FR3 R6

2FR4

R3

R1

R6

2F

2FR4

2)(

22

1FDCT II

)(2IIDCT

)(2IIDST

)(2IVDCT

)(2IIDST

)(2IIDCT

R1R6 SDCTPDCT II

nIV

n )()(

)(4IIDCT)(

2IVDCT

MathematicalMathematical Framework: Summary Framework: Summary

fast algorithms represented as ruletrees (easy generation/manipulation) and as formulas (can be translated into code)

formulas built from few constructs and primitives

many different algorithms/formulas generated from few rules (combinatorial explosion)

these algorithms are (essentially) equal in arithmetic cost, but differ in data flow

FormulaFormula Generation Generation

recursive application

Formula Generator

data base (extensible!)

data type

rules control

transforms formulasruletrees

translation

search engine formula translation

(spl compiler)export

runtime

written in GAP/AREP (computer algebra system)

all computation/manipulation is symbolic

exact arithmetic

easy extensible rule and transform data base

verification of rules and formulas

cut here for otheroptimization problems

NumberNumber of Formulas/Algorithms of Formulas/Algorithms

k

123456789

# DFT, size 2^k

16

40296

27744162570361280~1.01 • 10^27~2.31 • 10^61

~2.86 • 10^133

# DCT-IV, size 2^k

110

12631242

19244433627343815121631354242

~1.07 • 10^38~2.30 • 10^76

~1.06 • 10^153

differ in data flow not in arithmetic cost exponential search space

ExtensibilityExtensibility of SPIRAL of SPIRAL

New constructs and primitives (potentially required by radically different transforms) are readily included in SPL

New transforms are readily included on the high level

New instructions sets available (e.g., SSE) are included by extending the SPL compiler

(easy, due to SPIRAL’s framework)

(moderate effort, due to template mechanism)

(doable one time effort)

GeneratedGenerated Vector Code DFTs: Pentium III Vector Code DFTs: Pentium IIIg

flo

ps

DFT 2^n, Pentium III, 1 GHz, using Intel C compiler 6.0

n

speedups (to C code) up to factor of 2.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

4 5 6 7 8 9 10 11 12 13

Intel MKL

Spiral SSE

Spiral C/F95 vect

Spiral C/F95

FFTW 2.1.3

Technology

Knoxville aug02-full