SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter

SPIRAL DSP Transform Compiler:Application Specific Hardware Synthesis

Peter A. Milder1 ([email protected])Franz Franchetti, James C. Hoe, and Markus Pueschel2

Department of ECEDepartment of ECECarnegie Mellon University

CMU/ECE/Hoe, February 2013, slide‐1

now with 1SUNY Stonybrook and 2ETH

The SPIRAL ProjectThe SPIRAL Project• High performance implementations of linear DSP

f (DFT DCT DWT fil )transforms (DFT, DCT, DWT, filters, etc) are an important class of design problemsH d d i d t i i t i k d i• Hand design and tuning is tricky and expensive– needs both math and implementation knowledge

i i d di– time‐consuming and tedious – needs to repeat effort for every new context

• SPIRAL research goal: A flexible push‐button design generator that produces SW & HW implementations comparable with expert hand design


comparable with expert hand design

Why we can do better than hand designy g

• SPIRAL is only focused on linear DSP transforms

• These transforms are highly structured, highly regular and very well understood mathematicallyegu a a d e y e u de stood at e at ca y

• Algorithmic implementations of a transform can be d f ll i k f lenumerated following a known set of rules

• For a given objective function and mapping target, aFor a given objective function and mapping target, a computer generates a solution at least as good as the best human effortby trying enough


implementations

SPIRAL Framework

SPIRAL

I want a DFT of size 1024on a {Xilinx, P4, Cell....}

SPIRAL automationstarts here

where mostwhere mosttools beginautomatingthe problem


Principle 1: Domain knowledge in the systemPrinciple 2: Optimization at a high level of abstraction

www.spiral.net/hardware/dftgen.html


High‐Level, Quality, and SpecializationHigh Level, Quality, and Specialization

High level:High‐level:tools know

better than you

RTL Synthesis: general purpose

y

RTL Synthesis: general‐purposebut special handling of

structures like FSM, arith, etc.

Place and Route: works the same

, ,


Place‐and‐Route: works the sameno matter what design

OutlineOutline• SPIRAL Formula Framework

• SPIRAL for HW FFT cores

• SPIRAL for HW FFT “un”‐coreSPIRAL for HW FFT un core


Linear TransformsLinear Transforms• Linear transform is a matrix‐vector multiplication

– computing by definition takes O(N2) operations– the matrix has structure

• E.g. discrete Fourier transform: y = DFTN xy0y1.

x0x1.

k 0 .. N-1

j

yj.

= xk.

.Njkie 2j

0 ..


.yN-1

.xN-1

N-1

“Fast” AlgorithmsFast Algorithms• a “fast” algorithm factors the matrix into a sequence of structured sparse matricesstructured, sparse matricescheaper sparse multiplies O(N log(N)) operations

• E g Cooley Tukey Factorization of DFT• E.g. Cooley‐Tukey Factorization of DFT4

11

1111

11

1111

111111ii

11

1

1111

111

1

1111

11

111111

11

iii

ii

• Matrix formula representation

44


4222

42224 LDFTIDIDFTDFT

Factorization RulesFactorization RulesE.g. Cooley‐Tukeyg y y

mnnmn

mnnmnmn LDFTIDIDFTDFT

11

– DFT2 is – D is a diagonal matrix of twiddle factors

1111

– L is a stride permutation matrix– AB=[aj,kB] is the tensor (or kronecker) product

A In

a0,0a0,0a0,000 a0,1a0,1a0,10

0

a1,0a1 00 a1,1a1 1

00BBB

e g I B


A In a1,0a1,00a1,1a1,10

0

B

e.g., In B

“Fast” Fourier Transform AlgorithmsFast Fourier Transform Algorithms• Recursively factorize by the Cooley‐Tukey rule until only leaf cases remain (e g DFT for radix r)only leaf cases remain (e.g. DFTr for radix‐r)

8242

82428 LDFTIDIDFTDFT

• Exponential number of alternatives

82

4222

42222

8242 LLDFTIDIDFTIDIDFT

Exponential number of alternatives 8DFT

2DFT 4DFT

8DFT

4DFT 2DFT

• Each ruletree corresponds a different algorithm

2DFT 4DFT2DFT 2DFT2DFT 2DFT


• Each ruletree corresponds a different algorithm• All cost O(N log(N))

A System of Transforms and Rulesy

2

)(2 2/1,1 FdiagDCT II

QnIV

nII

nII

n FIDCTDCTPDCT 22/)(

2/)(

2/)(

DDCTSDCT IIn

IVn )()(

IV )(r

IVn MMDCT 1

)(

PDFTIDIDFTDFT

CDSTDCTBDFT In

Inn )( )(

2/)(2/

50+ transforms150+ rules

))(()()( // hFIIIhF ddnkdk

dnn

PDFTIDIDFTDFT mnmnnm

EhCirchF )()(

( )n

WHT I WHT I

EWIPIWDWTWDWT knnnn )())(()( 2/2/2/

EhCirchFn )()(


1 1 12 2 2 21

( )n n n n n ni i i t

iWHT I WHT I

Algorithmic Design Spaceg g psize # of DFT # of DCT‐IV

248

1640

1101268

163264

40296

27744162570361280

12631242

1924443362734381512163135424264

128256512

162570361280~1.01 1027~2.31 1061~2 86 10133

7343815121631354242~1.07 1038~2.30 1076~1 06 10153512 2.86 10133 1.06 10153

Different characteristics: data flow, numerical stability operation ordering working set size


stability, operation ordering, working set size, datapath regularity

Design Space: SW DCT 32 on P4Design Space: SW DCT 32 on P4 Histogram of 10,000 randomly selected algorithms

histogram by runtime(P4, 3.2 GHz)


histogram by num. accuracy





Formula to HW (Combinational)Formula to HW (Combinational)• Given where is:

– apply then– apply , then – is a permutation permute– apply , times in parallel

i di l l– is a diagonal scale

B A

A×7

×8


A ×4

×2

DFT8 Datapath Example

xx

x

x x

x

x

x

x

x


82

4222

42222

82428 LLDFTIDIDFTIDIDFTDFT

(formula is applied from right to left)

Pease DFT8 Example8 p

x x x

x x xx

x

x

x

x

xx x x

x x x

stage 1 stage 2 stage 3


How about good HW?g• Matrix formulas have a natural mapping to dataflow and hence combinational datapathdataflow and hence combinational datapath

• However, real hardware designs must fit a given resource constraintresource constraint sequential datapath that reuse available HW

identify repeated kernels– identify repeated kernels– instantiate kernels under resource constraints

h d l t ti t i t ti t d– schedule computation to reuse instantiated kernels

We want to do the analysis and mapping at


We want to do the analysis and mapping at formula level, with high‐level algorithm knowledge

Tensor as Streaming Parallelismg

partially streamedfully parallel fully streamed


Pease DFT Example: DFTPease DFT Example: DFT8x x x

x x xx

x

x

x

x

xx x x

x x x



Pease DFT Example: DFT StreamingPease DFT Example: DFT8 Streaming

x x xx x x

x x x

f(L82)f(L8

2) f(L82)f(L8

2) f(L82)f(L8

2) f(L82)f(R8)

x x x

x x xf(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(L82) f(L8

2)f(R8)



Regular Structure for HW• Simple regular structure embodied in Pease FFT

Regular Structure for HW

• Example:


Formally representing horizontal reuseFormally representing horizontal reuse

hr hrhr

not horizontally horizontally partially horizontally


yreused

yreused

p y yreused

Iterative Reuse of Logicg

latencyt


latencyFine-grained control over cost/latency tradeoff

cost

Example: rewriting rulesf ifor streaming reuse


26

Applicability to other transforms?pp y• DFT radix 2

1

0

22222 11

k

ii

k

kkk LDFTITR

• DFT radix 2r

0i

1/

0

22222

rk

i

k

rkrrkk LDFTITR

• 2‐D DFTnxn 1

2

nnnn DFTIL

0i

nxn

• WHT

0i

nnn

1/

2rk

k

kk LWHTIWHT

• DCT (type II) Hk

kkk

PLLLADP 221

2

0

222i

rkrrk LWHTI


• DCT (type II) H

iik kkk PLLLADP 2

22

22

0

22 11

FPGA: Area vs ThroughputFPGA: Area vs. Throughput

Pareto optimal

49x slices132x throughput


28





A 2D‐FFT AlgorithmA 2D‐FFT Algorithm• Row‐column algorithm:g

2D-

Row StageColumn Stage

Dataset:Dataset:(Logical abstractionof the 2D dataset)

… …


Off‐chip Data SetsOff chip Data Sets

off‐chipDRAM

DFTnk l

on‐chipSRAMDRAM kernelSRAM

• Need to balance– kernel processing bandwidth– off‐chip memory bandwidth


– on‐chip storage capacity

Inefficient DRAM Access Patterns• Row‐wise traversal ‐> Sequential accesses• Column‐wise traversal ‐> Large strided accesses

row‐major 2D array

lin

0

n

ear mem

…

Row buffersizem

space…


nn2

How to Optimize the Access Patterns p

…

0k2

row‐major “blocked”

nk

linear mn

…

…

kRow buffer

sizemem

spa

n

…

2

ace


n2in row‐buffersized chunks [Akin, et al., FCCM 2012]

Design Generator w/ Tensor Formalismg /

row column algorithm

row stagecolumn stage

2D row‐column algorithm

symmetric algorithm

2D-

symmetric algorithm

symmetric algorithm y gwith tiling

read tileslinearizehi

FFT processingtransposeand re‐tile

write tilescolumn‐wise


row‐wiseon‐chipand re‐tileon‐chip

column wise

[Akin, et al., FCCM 2012]

2D‐FFT (double) Raw Performance( )


Problem Size[Akin, et al., FCCM 2012]

2D‐FFT (double) BW Efficiency( ) y



2D‐FFT (double) Power Efficiency( ) y



ConclusionsConclusions• Encapsulating domain knowledge in a domain p g gspecific tool for high‐level design automation

• SPIRAL– mathematical approach to DSP transform implementation (cores and “un”‐core)

– generalizable to other linear DSP transforms– as good as best expert designer

• Thank you


Documents

SPIRAL DSP Transform Compilerinst.eecs.berkeley.edu/~cs294-88/sp13/lectures/SPIRAL... · 2013-04-08 · SPIRAL DSP Transform Compiler: Application Specific Hardware Synthesis Peter