Upload
douglaslyon
View
58
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
SPIRAL: Current StatusSPIRAL: Current Status
• José Moura (CMU)• Jeremy Johnson (Drexel)• Robert Johnson (MathStar)• David Padua (UIUC)• Viktor Prasanna (USC)• Markus Püschel (CMU)• Manuela Veloso (CMU)
• Gavin Haentjens (CMU)• Pinit Kumhom (Drexel)• Neungsoo Park (USC)• David Sepiashvili (CMU)• Bryan Singer (CMU)• Yevgen Voronenko (Drexel)• Edward Wertz (CMU)• Jianxin Xiong (UIUC)
Faculty
Students
http://www.ece.cmu.edu/~spiral
Markus Püschel
Collaborators
• Christoph Überhuber (TU Vienna)• Franz Franchetti (TU Vienna)
SponsorSponsor
Work supported by DARPA (DSO), Applied & Computational
Mathematics Program, OPAL, through grant managed by
research grant DABT63-98-1-0004 administered by the Army
Directorate of Contracting.
Moore’s Moore’s Law and High(est)Law and High(est) Performance Performance Scientific ComputingScientific Computing
arithmetic cost model not accurate for predicting runtime better performance models hard to get best code is machine dependent (registers/caches size, structure) hand-tuned code becomes obsolete as fast as it is written compiler limitations full performance requires (in part) assembly coding
Moore’s Law: processor-memory bottleneck short life cycles of computers very complex architectures
• vendor specific• special instructions (MMX, SSE, FMA, …)• undocumented features
(single processor, off-the-shelf)
Consequences for software/algorithms:
Portable performance requires automation
SPIRALSPIRALAutomates
cuts development costs code less error-prone
takes advantage of architecture specific features porting without loss of performance
systematic exploration of alternatives both at algorithmic and code level
are performance critical
Implementation
Platform-Adaptation
Optimization
of DSP algorithms
A library generator for highly optimized signal processing algorithms
SPIRAL SPIRAL systemsystem
DSP transformspecifies
user
goes for a coffee
Formula Generator
SPL Compiler Sea
rch
En
gin
e
runtime on given platform
controlsimplementation options
controlsalgorithm generation
fast algorithmas SPL formula
C/Fortran/SIMDcodeS P
I R
A L
(or an
espres
so fo
r small tran
sfo
rms)
platform-adaptedimplementation
comes back
Mathematic
ian
Expert
Program
mer
DSP Transform
Algorithm
DSPDSP Algorithms: Example 4-point DFT Algorithms: Example 4-point DFT
Cooley/Tukey FFT (size 4):
product of structured sparse matrices mathematical notation
1000
0010
0100
0001
1100
1100
0011
0011
000
0100
0010
0001
1010
0101
1010
0101
11
1111
11
1111
iii
ii
4222
42224 )()( LDFTITIDFTDFT
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
DSPDSP Algorithms: Terminology Algorithms: Terminology
Transform
Rule
Formula
PDFTIDIDFTDFT mnmnnm
PFIIDIFDFT 222428
parameterized matrix
• a breakdown strategy • product of sparse matrices
• recursive application of rules• uniquely defines an algorithm• efficient representation• easy manipulation
nDFT
Ruletree 8DFT
2DFT 4DFT
2DFT 2DFT
• few constructs and primitives• uniquely defines an algorithm• can be translated into code
DSPDSP Transforms Transforms
nkliDFTn /2exp
222 DFTDFTWHT k
nlkDCT nII /2/1cos)(
nlkDCT nIV /)2/1(2/1cos)(
nklDST nI /sin)(
nlnkMDCT nn /)2/1)(2/)1((cos2
Others: filters, discrete wavelet transforms, Haar, Hartley, …
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sineTransforms (16 types)
modified discrete cosine transform
TT two-dimensional transform
RulesRules = Breakdown Strategies = Breakdown Strategies
QnIV
nII
nII
n FIDCTDCTPDCT 22/)(
2/)(
2/)(
DDCTSDCT IIn
IVn )()(
2)(
2 2/1,1 FdiagDCT II
rIV
n MMDCT 1)(
1 1 12 2 2 21
( )n n n n n ni i i t
n
i
WHT I WHT I
))(()()( // hFIIIhF ddnkdk
dnn
EWIPIWDWTWDWT knnnn )())(()( 2/2/2/
PDFTIDIDFTDFT mnmnnm
CDSTDCTBDFT In
Inn )( )(
2/)(2/
PDCTSMDCT IVnnn
)(2
base case
recursive
translation
iterative
recursive
recursive
recursive
recursive
recursive
translation
iterative/recursive
EhCirchFn )()(
FormulaFormula for a DCT, size 16 for a DCT, size 16
Algorithm (Formula)
Implementation
DSP Transform
FormulasFormulas in SPL in SPL
( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) )
• • • •
• • • •
SPLSPL Syntax (Subset) Syntax (Subset) matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) parameterized matrices: (I n) (F n) scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 definition of new symbols: (define name formula) (template formula (i-code-list) directives for code generation #codetype real/complex #unroll on/off
allows extension of SPL
controls loop unrolling
SPLSPL Compiler, 4-point FFT Compiler, 4-point FFT
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2))
f0 = x(1) + x(3)f1 = x(1) - x(3)f2 = x(2) + x(4)f3 = x(2) - x(4)f4 = (0.00d0,-1.00d0)*f(3)y(1) = f0 + f2y(2) = f0 - f2y(3) = f1 + f4y(4) = f1 - f4
r0 = x(1) + x(5)r1 = x(1) - x(5)r2 = x(2) + x(6)r3 = x(2) - x(6)r4 = x(3) + x(7)r5 = x(3) - x(7)r6 = x(4) + x(8)r7 = x(4) - x(8)y(1) = r0 + r4y(2) = r1 + r5y(3) = r0 - r4y(4) = r1 - r5y(5) = r2 + r7y(6) = r3 - r6y(7) = r2 - r7y(8) = r3 + r6
fast algorithmas
formulaas
SPL program#codetype
complex real
SPLSPL Compiler: Summary Compiler: Summary
Parsing
Intermediate Code Generation
Intermediate Code Restructuring
Target Code Generation
Symbol Table Abstract Syntax Tree
I-Code
I-Code
C, FORTRAN function
Template Table
SPL Formula Template DefinitionSymbol Definition
OptimizationI-Code
SPL Program
Built-in optimizations:
single static assignment code no reuse of temporary vars only scalar temporary vars constants precomputed limited CSE
Extensible through templates
Algorithm (Formula)
Implementation
DSP Transform
Search
WhyWhy Search? Search?
DCT, type IV, size 16
• maaaany different formulas• large spread in runtimes, even for modest size• not due to arithmetic cost• best formula is platform-dependent
~31000 formulas
SearchSearch Methods available in SPIRAL Methods available in SPIRAL
Exhaustive Search Dynamic Programming (DP) Random Search Hill Climbing STEER (similar to a genetic algorithm)
Possible FormulasSizes Timed Results
Exhaust Very small All Best
DP All 10s-100s (very) good
Random All User decided fair/good
Hill Climbing All 100s-1000s Good
STEER All 100s-1000s (very) good
Search over • algorithm space and • implementation options (degree of unrolling)
STEERSTEERPopulation n:
Population n+1:
……
……
Mutation
Cross-Breeding
expanddifferently
swapexpansions
Survival of Fittest
Experimental Experimental Results (C code)Results (C code)
high performance code(compared with FFTW)
different transforms
search methods(applicable to all transforms)
generated high quality code
SPIRALSPIRAL System System
Available for download (v3.1): www.ece.cmu.edu/~spiral
Easy installation (Unix: configure/make; Windows: install shield)
Unix/Linux and Windows 98/ME/NT/2000/XP
Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV,
MDCT, Filters, Wavelets, Toeplitz, Circulants
Extensible
New version (4.0) in preparation
Recent Work
LearningLearning to Generate Fast Algorithms to Generate Fast Algorithms
• Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy)• Learns from a transform of one size, generates the best algorithm for many sizes• Tested for DFT and WHT
SIMDSIMD Short Vector Extensions Short Vector Extensions
+ x
vector length = 4(4-way)
Extension to instruction set architecture Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) Originally for multimedia (like MMX for integers) Requires fine grain parallelism Large potential speed-up
SIMD instructions are architecture specific No common API (usually assembly hand coding) Performance very sensitive to memory access Automatic vectorization very limited
Problems:
very difficult to use
Vector Vector code generation from SPL formulascode generation from SPL formulas
Naturally vectorizable construct
A
x y
4IAvector length
iiii
k
ii QEIADP )(
1
Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulasν SIMD vector length
(Current) generic construct completely vectorizable:
Vectorization in two steps:
1. Formula manipulation using manipulation rules2. Code generation (vector code + C code)
GeneratedGenerated Vector Code DFTs: Pentium 4 Vector Code DFTs: Pentium 4g
flo
ps
DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (to C code) up to factor of 3.3 beats hand-tuned vendor library
0
1
2
3
4
5
6
7
4 5 6 7 8 9 10 11 12 13 14
Spiral SSE
Intel MKL interl.
FFTW 2.1.3
Spiral C and F95
Spiral F95 vect
SIMD-FFT
GeneratedGenerated Vector Code, Other Transforms Vector Code, Other Transforms
no
rmal
ized
ru
nti
me
no
rmal
ized
ru
nti
me
transform size transform size
2-dim DCTWHT
speedups up to factor of 2.5
FlexibleFlexible Loop Interleaving (Runtime Gain WHT) Loop Interleaving (Runtime Gain WHT)
Alpha 21264: up to 70% UltraSparc III: up to 60%
Athlon XP: up to 55% Pentium 4: up to 45%
ParallelParallel Code Generation: Example WHT Code Generation: Example WHT
WHT size log(N)
0
2
4
6
8
10
1 6 11 16 21 26
Sp
ee
du
p
1 thread
8 threads
10 threads
PowerPC RS64 III
Parallelized constructs:In A, A In
CodeCode Scheduling Scheduling
Runtime histograms
DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas
unscheduledscheduled
FiltersFilters and Wavelets and Wavelets
New constructs: row/column overlapped tensor product
AAA
A
AAA
A
AI kn AI kn
))(()()( // hFIIIhF ddnkdk
dnn
EWIPIWDWTWDWT knnnn )())(()( 2/2/2/
Examples for rules:
ConclusionsConclusions
Automatic code generation for the entire domain of (linear) DSP algorithms
Portable high performance across platforms and across time
Integration of math (high) level and implementation (low) level
Intelligence through search and learning in the space of alternatives
FutureFuture Plans Plans
Transforms: Radon, Gabor, Hankel, structured matrices, …
Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs
Instructions: Vector, FMAs, prefetching, OpenMP, MPI
Beyond transforms: entire DSP applications
Other domains amenable to SPIRAL approach?
http://www.ece.cmu.edu/~spiral
Questions?
Extra SlidesExtra Slides
Generating Parallel ProgramsGenerating Parallel Programs
Interpret constructs such as In A as parallel operations and transform formulas to obtain maximal parallelism.
Explore alternative data access patterns mathematically (e.g. different permutations in matrix factorizations)
Prototype implementation using WHT Build on existing sequential package
SMP implementation using OpenMP (IPDPS’02)
90% efficiency obtained on 12 processor PowerPC RS64 III Distributed memory implementation using MPI (POHLL’02)
Comparison of Parallel DDL Schemes Comparison of Parallel DDL Schemes
0
2
4
6
8
10
1 6 11 16 21 26
WHT size log(N)
Spe
edup Best sequential with
DDLParallel without DDL
Coarse-grained DDL
Fine-grained DDLwith ID shift
PowerPC RS64 III10 threads
Overall Parallel SpeedupOverall Parallel Speedup
0
2
4
6
8
10
1 6 11 16 21 26
WHT size log(N)
Sp
eed
up
1 thread
8 threads
10 threads
PowerPC RS64 III
Performance of Digit Permutations on CRAY T3EPerformance of Digit Permutations on CRAY T3E
0 50 100 150 200 250 300 3500
200
400
600
800
1000
1200N
um
ber
of te
nsor
perm
uta
tions
Bandwidth in MB/sec/node
Distribution of Global Tensor Permutations on 128 Processors (shmem put)
Parameters: Pi, 1 i n, no. of processor
Parameters: Dimension, and Ti, 1 i n, no. of processor 2m
AG
CU
Interconnection Network
CU AG
M
CU AG
M
CU AG
M
CU AG
M
Host
I/O
Int
erfa
ce
. . .
M = MemoryAG = Address GeneratorCU = Computation Unit
Parameters: Dimension, Pd
I/O Interface
dPPTFIPPTFIPPTFIPPTFIP 111281
122282
133283
144284 )()()()(
Architecture FrameworkArchitecture Framework
Pease Algorithm DataflowPease Algorithm Dataflow
dPFILTFILLTFILLTFIL )()()()( 28162128
168
164228
164
168328
162
Stage 0 Stage 1 Stage 2 Stage 3
0
2
4
6
8
10
12
14
1
3
5
7
9
11
13
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
162L
Optimal DataflowOptimal Dataflow
dPILFIILLTFILLILTFIILLLILTFIILL ))()(()()())(()())(( 282282
84
162128
168
1642
842282
82
164
1682
843282
82
162
Stage 3Stage 1Stage 0 Stage 2
Pease v.s. OptimalPease v.s. Optimal
Performance comparison of Pease algorithm and optimal algorithm
0
2000
4000
6000
8000
10000
12000
5 6 7 8
FFT size
no
. o
f clo
ck c
ycle
s
Pease
Optimal
Local Remote Local RemoteAccess Access Access Access
1 32 96 3900 128 0 6402 32 96 3860 128 0 6403 64 64 3900 128 0 6404 128 0 640 128 0 6405 64 64 3840 64 64 25606 32 96 3880 64 64 2560
PerformanceStage Peformance
Pease Optimal
Performance: SpeedupPerformance: Speedup
Speedup
0
0.5
1
1.5
2
2.5
3
3.5
4
5 6 7 8 9 10 11 12 13 14 15 16
Address Size (Bits)
Sp
eed
up
(4*N
*lo
g(N
)/N
o.
of
Clo
cks)
Speedup = S/C
C = no. of. clock cycles using 4-processor FFT engine
S = minimum no. of clock cycles using single processor (4*2n*n,
2n is the number of points)
SPIRALSPIRAL Approach Approach
DSP Transform (DFT, DCT, Wavelets etc.)
Computing Platform
given
given
PossibleImplementations
PerformanceEvaluation
Inte
llig
en
t S
earc
hPossibleAlgorithms
SP
IRA
L S
earc
h S
pac
e
(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )
adaptedimplementation
ClassicalClassical Code Generation System Code Generation System
DSP Transform (DFT, DCT, Wavelets etc.)
Computing Platform
given
given
ExpertProgrammer
PerformanceEvaluation
Math/AlgorithmExpert
(Pentium III, Pentium 4, Athlon, SUN, PowerPC, Alpha, … )
adaptedimplementation
AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(
8IIDCT
)(4IIDCT
)(4IVDCT
R1)()( 2/2
)(2/
)(2/
)(n
IVn
IIn
IIn IFDCTDCTPDCT
2FR3 R6
2FR4
R3
R1
R6
2F
2FR4
2)(
22
1FDCT II
)(2IIDCT
)(2IIDST
)(2IVDCT
)(2IIDST
)(2IIDCT
R1R6 SDCTPDCT II
nIV
n )()(
)(4IIDCT)(
2IVDCT
MathematicalMathematical Framework: Summary Framework: Summary
fast algorithms represented as ruletrees (easy generation/manipulation) and as formulas (can be translated into code)
formulas built from few constructs and primitives
many different algorithms/formulas generated from few rules (combinatorial explosion)
these algorithms are (essentially) equal in arithmetic cost, but differ in data flow
FormulaFormula Generation Generation
recursive application
Formula Generator
data base (extensible!)
data type
rules control
transforms formulasruletrees
translation
search engine formula translation
(spl compiler)export
runtime
written in GAP/AREP (computer algebra system)
all computation/manipulation is symbolic
exact arithmetic
easy extensible rule and transform data base
verification of rules and formulas
cut here for otheroptimization problems
NumberNumber of Formulas/Algorithms of Formulas/Algorithms
k
123456789
# DFT, size 2^k
16
40296
27744162570361280~1.01 • 10^27~2.31 • 10^61
~2.86 • 10^133
# DCT-IV, size 2^k
110
12631242
19244433627343815121631354242
~1.07 • 10^38~2.30 • 10^76
~1.06 • 10^153
differ in data flow not in arithmetic cost exponential search space
ExtensibilityExtensibility of SPIRAL of SPIRAL
New constructs and primitives (potentially required by radically different transforms) are readily included in SPL
New transforms are readily included on the high level
New instructions sets available (e.g., SSE) are included by extending the SPL compiler
(easy, due to SPIRAL’s framework)
(moderate effort, due to template mechanism)
(doable one time effort)
GeneratedGenerated Vector Code DFTs: Pentium III Vector Code DFTs: Pentium IIIg
flo
ps
DFT 2^n, Pentium III, 1 GHz, using Intel C compiler 6.0
n
speedups (to C code) up to factor of 2.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
4 5 6 7 8 9 10 11 12 13
Intel MKL
Spiral SSE
Spiral C/F95 vect
Spiral C/F95
FFTW 2.1.3