Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
SPIRAL DSP Transform Compiler:Application Specific Hardware Synthesis
Peter A. Milder1 ([email protected])Franz Franchetti, James C. Hoe, and Markus Pueschel2
Department of ECEDepartment of ECECarnegie Mellon University
CMU/ECE/Hoe, February 2013, slide‐1
now with 1SUNY Stonybrook and 2ETH
The SPIRAL ProjectThe SPIRAL Project• High performance implementations of linear DSP
f (DFT DCT DWT fil )transforms (DFT, DCT, DWT, filters, etc) are an important class of design problemsH d d i d t i i t i k d i• Hand design and tuning is tricky and expensive– needs both math and implementation knowledge
i i d di– time‐consuming and tedious – needs to repeat effort for every new context
• SPIRAL research goal: A flexible push‐button design generator that produces SW & HW implementations comparable with expert hand design
CMU/ECE/Hoe, February 2013, slide‐2
comparable with expert hand design
Why we can do better than hand designy g
• SPIRAL is only focused on linear DSP transforms
• These transforms are highly structured, highly regular and very well understood mathematicallyegu a a d e y e u de stood at e at ca y
• Algorithmic implementations of a transform can be d f ll i k f lenumerated following a known set of rules
• For a given objective function and mapping target, aFor a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough
CMU/ECE/Hoe, February 2013, slide‐3
implementations
SPIRAL Framework
SPIRAL
I want a DFT of size 1024on a {Xilinx, P4, Cell....}
SPIRAL automationstarts here
where mostwhere mosttools beginautomatingthe problem
CMU/ECE/Hoe, February 2013, slide‐4
Principle 1: Domain knowledge in the systemPrinciple 2: Optimization at a high level of abstraction
www.spiral.net/hardware/dftgen.html
CMU/ECE/Hoe, February 2013, slide‐5
High‐Level, Quality, and SpecializationHigh Level, Quality, and Specialization
High level:High‐level:tools know
better than you
RTL Synthesis: general purpose
y
RTL Synthesis: general‐purposebut special handling of
structures like FSM, arith, etc.
Place and Route: works the same
, ,
CMU/ECE/Hoe, February 2013, slide‐6
Place‐and‐Route: works the sameno matter what design
OutlineOutline• SPIRAL Formula Framework
• SPIRAL for HW FFT cores
• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT un core
CMU/ECE/Hoe, February 2013, slide‐7
Linear TransformsLinear Transforms• Linear transform is a matrix‐vector multiplication
– computing by definition takes O(N2) operations– the matrix has structure
• E.g. discrete Fourier transform: y = DFTN xy0y1.
x0x1.
k 0 .. N-1
j
yj.
= xk.
.Njkie 2j
0 ..
CMU/ECE/Hoe, February 2013, slide‐8
.yN-1
.xN-1
N-1
“Fast” AlgorithmsFast Algorithms• a “fast” algorithm factors the matrix into a sequence of structured sparse matricesstructured, sparse matricescheaper sparse multiplies O(N log(N)) operations
• E g Cooley Tukey Factorization of DFT• E.g. Cooley‐Tukey Factorization of DFT4
11
1111
11
1111
111111ii
11
1
1111
111
1
1111
11
111111
11
iii
ii
• Matrix formula representation
44
CMU/ECE/Hoe, February 2013, slide‐9
4222
42224 LDFTIDIDFTDFT
Factorization RulesFactorization RulesE.g. Cooley‐Tukeyg y y
mnnmn
mnnmnmn LDFTIDIDFTDFT
11
– DFT2 is – D is a diagonal matrix of twiddle factors
1111
– L is a stride permutation matrix– AB=[aj,kB] is the tensor (or kronecker) product
A In
a0,0a0,0a0,000 a0,1a0,1a0,10
0
a1,0a1 00 a1,1a1 1
00BBB
e g I B
CMU/ECE/Hoe, February 2013, slide‐10
A In a1,0a1,00a1,1a1,10
0
B
e.g., In B
“Fast” Fourier Transform AlgorithmsFast Fourier Transform Algorithms• Recursively factorize by the Cooley‐Tukey rule until only leaf cases remain (e g DFT for radix r)only leaf cases remain (e.g. DFTr for radix‐r)
8242
82428 LDFTIDIDFTDFT
• Exponential number of alternatives
82
4222
42222
8242 LLDFTIDIDFTIDIDFT
Exponential number of alternatives 8DFT
2DFT 4DFT
8DFT
4DFT 2DFT
• Each ruletree corresponds a different algorithm
2DFT 4DFT2DFT 2DFT2DFT 2DFT
CMU/ECE/Hoe, February 2013, slide‐11
• Each ruletree corresponds a different algorithm• All cost O(N log(N))
A System of Transforms and Rulesy
2
)(2 2/1,1 FdiagDCT II
QnIV
nII
nII
n FIDCTDCTPDCT 22/)(
2/)(
2/)(
DDCTSDCT IIn
IVn )()(
IV )(r
IVn MMDCT 1
)(
PDFTIDIDFTDFT
CDSTDCTBDFT In
Inn )( )(
2/)(2/
50+ transforms150+ rules
))(()()( // hFIIIhF ddnkdk
dnn
PDFTIDIDFTDFT mnmnnm
EhCirchF )()(
( )n
WHT I WHT I
EWIPIWDWTWDWT knnnn )())(()( 2/2/2/
EhCirchFn )()(
CMU/ECE/Hoe, February 2013, slide‐12
1 1 12 2 2 21
( )n n n n n ni i i t
iWHT I WHT I
Algorithmic Design Spaceg g psize # of DFT # of DCT‐IV
248
1640
1101268
163264
40296
27744162570361280
12631242
1924443362734381512163135424264
128256512
162570361280~1.01 1027~2.31 1061~2 86 10133
7343815121631354242~1.07 1038~2.30 1076~1 06 10153512 2.86 10133 1.06 10153
Different characteristics: data flow, numerical stability operation ordering working set size
CMU/ECE/Hoe, February 2013, slide‐13
stability, operation ordering, working set size, datapath regularity
Design Space: SW DCT 32 on P4Design Space: SW DCT 32 on P4 Histogram of 10,000 randomly selected algorithms
histogram by runtime(P4, 3.2 GHz)
CMU/ECE/Hoe, February 2013, slide‐14
histogram by num. accuracy
OutlineOutline• SPIRAL Formula Framework
• SPIRAL for HW FFT cores
• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT un core
CMU/ECE/Hoe, February 2013, slide‐15
Formula to HW (Combinational)Formula to HW (Combinational)• Given where is:
– apply then– apply , then – is a permutation permute– apply , times in parallel
i di l l– is a diagonal scale
B A
A×7
×8
CMU/ECE/Hoe, February 2013, slide‐16
A ×4
×2
DFT8 Datapath Example
xx
x
x x
x
x
x
x
x
CMU/ECE/Hoe, February 2013, slide‐17
82
4222
42222
82428 LLDFTIDIDFTIDIDFTDFT
(formula is applied from right to left)
Pease DFT8 Example8 p
x x x
x x xx
x
x
x
x
xx x x
x x x
stage 1 stage 2 stage 3
CMU/ECE/Hoe, February 2013, slide‐18
How about good HW?g• Matrix formulas have a natural mapping to dataflow and hence combinational datapathdataflow and hence combinational datapath
• However, real hardware designs must fit a given resource constraintresource constraint sequential datapath that reuse available HW
identify repeated kernels– identify repeated kernels– instantiate kernels under resource constraints
h d l t ti t i t ti t d– schedule computation to reuse instantiated kernels
We want to do the analysis and mapping at
CMU/ECE/Hoe, February 2013, slide‐19
We want to do the analysis and mapping at formula level, with high‐level algorithm knowledge
Tensor as Streaming Parallelismg
partially streamedfully parallel fully streamed
CMU/ECE/Hoe, February 2013, slide‐20
Pease DFT Example: DFTPease DFT Example: DFT8x x x
x x xx
x
x
x
x
xx x x
x x x
stage 1 stage 2 stage 3
CMU/ECE/Hoe, February 2013, slide‐21
Pease DFT Example: DFT StreamingPease DFT Example: DFT8 Streaming
x x xx x x
x x x
f(L82)f(L8
2) f(L82)f(L8
2) f(L82)f(L8
2) f(L82)f(R8)
x x x
x x xf(L8
2)f(L82) f(L8
2)f(L82) f(L8
2)f(L82) f(L8
2)f(R8)
stage 1 stage 2 stage 3
CMU/ECE/Hoe, February 2013, slide‐22
Regular Structure for HW• Simple regular structure embodied in Pease FFT
Regular Structure for HW
• Example:
CMU/ECE/Hoe, February 2013, slide‐23
Formally representing horizontal reuseFormally representing horizontal reuse
hr hrhr
not horizontally horizontally partially horizontally
CMU/ECE/Hoe, February 2013, slide‐24
yreused
yreused
p y yreused
Iterative Reuse of Logicg
latencyt
CMU/ECE/Hoe, February 2013, slide‐25
latencyFine-grained control over cost/latency tradeoff
cost
Example: rewriting rulesf ifor streaming reuse
CMU/ECE/Hoe, February 2013, slide‐26
26
Applicability to other transforms?pp y• DFT radix 2
1
0
22222 11
k
ii
k
kkk LDFTITR
• DFT radix 2r
0i
1/
0
22222
rk
i
k
rkrrkk LDFTITR
• 2‐D DFTnxn 1
2
nnnn DFTIL
0i
nxn
• WHT
0i
nnn
1/
2rk
k
kk LWHTIWHT
• DCT (type II) Hk
kkk
PLLLADP 221
2
0
222i
rkrrk LWHTI
CMU/ECE/Hoe, February 2013, slide‐27
• DCT (type II) H
iik kkk PLLLADP 2
22
22
0
22 11
FPGA: Area vs ThroughputFPGA: Area vs. Throughput
Pareto optimal
49x slices132x throughput
CMU/ECE/Hoe, February 2013, slide‐28
28
OutlineOutline• SPIRAL Formula Framework
• SPIRAL for HW FFT cores
• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT un core
CMU/ECE/Hoe, February 2013, slide‐29
A 2D‐FFT AlgorithmA 2D‐FFT Algorithm• Row‐column algorithm:g
2D-
Row StageColumn Stage
Dataset:Dataset:(Logical abstractionof the 2D dataset)
… …
CMU/ECE/Hoe, February 2013, slide‐30
Off‐chip Data SetsOff chip Data Sets
off‐chipDRAM
DFTnk l
on‐chipSRAMDRAM kernelSRAM
• Need to balance– kernel processing bandwidth– off‐chip memory bandwidth
CMU/ECE/Hoe, February 2013, slide‐31
– on‐chip storage capacity
Inefficient DRAM Access Patterns• Row‐wise traversal ‐> Sequential accesses• Column‐wise traversal ‐> Large strided accesses
row‐major 2D array
lin
0
n
ear mem
…
Row buffersizem
space…
CMU/ECE/Hoe, February 2013, slide‐32
nn2
How to Optimize the Access Patterns p
…
0k2
row‐major “blocked”
nk
linear mn
…
…
kRow buffer
sizemem
spa
n
…
2
ace
CMU/ECE/Hoe, February 2013, slide‐33
n2in row‐buffersized chunks [Akin, et al., FCCM 2012]
Design Generator w/ Tensor Formalismg /
row column algorithm
row stagecolumn stage
2D row‐column algorithm
symmetric algorithm
2D-
symmetric algorithm
symmetric algorithm y gwith tiling
read tileslinearizehi
FFT processingtransposeand re‐tile
write tilescolumn‐wise
CMU/ECE/Hoe, February 2013, slide‐34
row‐wiseon‐chipand re‐tileon‐chip
column wise
[Akin, et al., FCCM 2012]
2D‐FFT (double) Raw Performance( )
CMU/ECE/Hoe, February 2013, slide‐35
Problem Size[Akin, et al., FCCM 2012]
2D‐FFT (double) BW Efficiency( ) y
CMU/ECE/Hoe, February 2013, slide‐36
Problem Size[Akin, et al., FCCM 2012]
2D‐FFT (double) Power Efficiency( ) y
CMU/ECE/Hoe, February 2013, slide‐37
Problem Size[Akin, et al., FCCM 2012]
ConclusionsConclusions• Encapsulating domain knowledge in a domain p g gspecific tool for high‐level design automation
• SPIRAL– mathematical approach to DSP transform implementation (cores and “un”‐core)
– generalizable to other linear DSP transforms– as good as best expert designer
• Thank you
CMU/ECE/Hoe, February 2013, slide‐38