26
Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer Engineering University of California, Santa Barbara Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Computer Science Department University of California, Los Angeles

Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Layout Driven Data Communication Optimization

for High Level Synthesis

Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer

Dept. of Electrical and Computer Engineering

University of California, Santa Barbara

Adam Kaplan, Philip Brisk and Majid Sarrafzadeh

Computer Science Department

University of California, Los Angeles

Page 2: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

High Level SynthesisInput: Application description

written in *C (C, SystemC, HandelC, SpecC)

for (y_pos=ygrid_start-y_fmid-1,res_pos=0; y_pos<0; y_pos+=ygrid_step) { for (x_pos=xgrid_start-x_fmid-1;

x_pos<0; x_pos+=xgrid_step,res_pos++) { (*reflect)(filt,x_fdim,y_fdim,x_pos,

y_pos,temp,FILTER); sum=0.0; for (y_filt_lin=x_fdim,x_filt=y_im_lin=0; y_filt_lin<=filt_size; y_im_lin+=x_dim,y_filt_lin+=x_fdim) for (im_pos=y_im_lin;

x_filt<y_filt_lin; x_filt++,im_pos++)

sum+=image[im_pos]*temp[x_filt]; result[res_pos] = sum; }

first_col = x_pos+1; (*reflect)(filt,x_fdim,y_fdim,0,y_pos,temp,FILTER);

Internal filter of an image convolver

SSA CDFG

Maximize Maximize “performance” (area, “performance” (area, latency, power, …) latency, power, …) subject to input subject to input constraintsconstraints

Output: “Hardware” (RTL Specification)

Page 3: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Target Architectures “Spatial” architectures

Local control between data path, global data flow between control nodes Lots of distributed computational units, memory Coarse/fine grained reconfigurable architectures

Techniques could be used for other architectures May not make sense Our design flow has little resource sharing

Fine grainconfigurableplatform

Coarse grainprogrammableplatform

Page 4: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Obligatory Design Flow SlideSUIF:

Syntactic &SemanticAnalysis

ApplicationSpecification

ASTMachine

SUIF:CompilerBackend

SSACDFG

4. Synthesize behavioral HDL code to RTL code

Behavioral Synthesis

Logical & Physical Synthesis

8. Synthesize RTL code

Entity 1

Entity 3Entity 2

Entity 4

6. Determine structural controland data communicationbetween basic block entities

7. Generate synthesizable RTL code

CFG Entity5. Create CFG interface

entity cfg is…

architecture behavioral of cfg…

2. Transform instruction list to dataflow graph

1. Create interface

++

+ *

*

3. Transform dataflow graph to behavioral HDL code

Basic Block Entity

entity basic_block is…

architecture behavioral of basic_block…

entity basic_block is

Page 5: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Design Example

/* perform radix 4 iterations */ for(i = 1;

i <= n4pow; i++) { nn *= 4; in = n / nn; FR4TR(in, nn, b, b + in, b + 2 * in, b + 3 * in); }

/* perform inplace reordering */ FORD1(n2pow, b); FORD2(n2pow, b);

/* take conjugates */ for(i = 3;

i < n; i += 2) b[i] = -b[i];

return 1;}

int FAST(real *b, int n) { real fn; int i, in, nn, n2pow, n4pow, nthpo; n2pow = fastlog2(n); if(n2pow <= 0)

return 0; nthpo = n; fn = nthpo; n4pow = n2pow / 2;

/* radix 2 iteration required; do it now */ if(n2pow % 2) { nn = 2; in = n / nn; FR2TR(in, b, b + in); } else nn = 1;

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Node 8

Node 9

Node 10

“FAST” function from MediaBench

Some nodes missing - simple computation, merged into others

Lines below show data communication

Page 6: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Characterizing Data Communication

Examples of data communication schemes

Control Node 3

Control Node 2

Control Node 4

Memory(Register

Bank,RAM)

Control Node 4

Control Node 2

Control Node 3

Bus

Distributed Distributed Centralized Centralized Data communication = wireData communication = memory access

Page 7: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Identifying Data Communication Determine relationship between place(s) where data is

defined and where data is used

b …

a …

a

a …

a …

c …b …

b c

Naïve method: all use-points of a variable depend on all definitions of that variable

Not all use points “use” a variable

Need analysis to minimize Need analysis to minimize the amount of data the amount of data communicationcommunication

Global Data Communication = 5 variables

Page 8: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Must determine relationship between where data is generated and where data is used

Problem formulations [DAC03]: Minimize the total number of

bits communicated between all pairs of control nodes

Today: Minimize overall wirelength SSA (Static Single Assignment)

Changes each variable to have a unique definition point

Must add -nodes to merge definitions

Use of SSA in Compilation

b …

a …

a

a …

a …

c …b …

b c

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)

Page 9: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

SSA algorithms Find location of -nodes Rename variables

Three main SSA algorithms Minimal, Pruned – Cytron et al. Semi-pruned – Briggs et al.

Differ in number and location of -nodes Minimal – insert -nodes at

iterated dominance frontier (IDF) Semi-pruned – insert -node at

IDF if variable live outside some basic block Pruned – insert -node at

IDF if variable live at that time

SSA Fundamentals

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)

c2 (c1)b3 (b1,b2)

MinimalMinimal

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)b3 (b1,b2)

Semi-PrunedSemi-Pruned

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)

PrunedPruned

Page 10: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

i j

jiwTEW ),(

TEW Ratio

0.1

1

10

100

benchmark

TE

W r

ati

o (

vs

. p

run

ed

) Minimal

Semi-pruned

Results: SSA for Data Comm. Minimization

Edge Weight w(i,j)– number of bits communicated from node i to j

Total Edge Weight (TEW) - corresponds to amount of data communication

“MediaBench”marks

Page 11: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Further Minimizing Data Communication

Current SSA algorithms place -nodes temporally In software compilation, live ranges should be short Appropriate in hardware?

Spatial -node distribution

Temporal -node distribution

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)

TEW = 4

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)TEW = 3

Page 12: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Spatial -nodes Distribution Algorithm

d – number of uses of -node destination s – number of -node source values Number of temporal links Number of spatial links dsCS

dsCT

a3(a0,a1,a2)

a3 a3

s = 3s = 3

d = 2d = 2

Optimal assuming “ideal” n-dimensional floorplan

1. Given a CDFG G(N cfg , E cfg) 2. perform_SSA( G) 3. calculate_def_use _chains( G) 4. remove_back_edges( G) 5. topological_sort( G) 6. for each node n Ncfg 7. for each -node n 8. s |.sources | 9. d |def_use_chain( .dest) | 10. if s d < s + d 11. move_to_spatial_locations( ) 12. restore _back_edges( G)

Page 13: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Physically Aware Compiler Transforms

Consider layout information during compilation Modify transforms to consider physical info Ideal: full physical synthesis – extremely

accurate, but way too time consuming

PhysicalSynthesis

HardwareCompilation

application

Floor-planner

Approximate using floorplanningMuch fasterGives “good enough” high level

physical picture Our previous data comm. work

No physical informationCan lead to negative results

Let’s Get Physical!

Page 14: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Physically Aware Data Communication

Modify placement of Φ-functions to consider wirelength

1. Given a CFG Gcfg(Vcfg, Ecfg)

2. perform_ssa(Gcfg)

3. calculate_def_use_chains(Gcfg)

4. remove_back_edges(Gcfg)

5. topological_sort(Gcfg)

6. foreach vertex v Vcfg

7. foreach -node v

8. s .sources

9. d |def_use_chain(.dest)|

10. IDF iterated_dominance_fronter(s)

11. PossiblePlacements findPlacementOptions(IDF)

12. place()

selectBest(PossiblePlacements)

13. distribute/duplicate to place()

-Placement Algorithm

1. Given a set of CFG Nodes R

2. -options

3. insert(R) into-options

4. foreach instruction i R

5. if( i is a destination of -function f )

6. return -options

7. temp_-options

8. foreach non-dominated child c of R

9. temp_-options crossProductJoin(temp__options, findPlacementOptions(c))

10. return-options temp_-options

FindPlacementOptions Algorithm

Page 15: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Algorithm in Action

b1 …

a2 …

a4

a3 …

a1 …

c1 …b2 …

b1 c1

a4 (a2,a3)

a4 (a2,a3) a4 (a2,a3) a4 (a2,a3)

a4 (a2,a3)

a4 (a2,a3)

Evaluate all options for -nodes Replicate when necessary Limit amount of replication - most

often leads to more wirelength Can play tricks to limit redundant

placementsTraditional (temporal)

Spatial [DAC03]Spatial [DAC03]

Traditional (temporal)

Any of these options could yield the best wirelength

Highly dependent on the floorplan

Page 16: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Algorithm in Action

FAST function from MediaBench testsuite

F T

T F

N3

nn_4, i_2 nn_5, i_3

N9

Page 17: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Algorithm in Action

F T

T Fnn_4, i_2 nn_5, i_3

N3

N9

F T

T F

N3

nn_4, i_2 nn_5, i_3

N9

Page 18: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

PhysicalSynthesis

HardwareCompilation

FullFloor-

planner

1. Initial optimization minimizes data communication

2. Full SA based floorplanning3. Reoptimization based to minimize

floorplanning4. Full SA based floorplanning

Floorplan Wirelength

1

10

100

1000

10000

100000

1000000

10000000

benchmark

wire

leng

th (l

ogar

ithm

ic)

WL (first)

WL (second)

Spectacularly negative results

Full Floorplanning Results

Simple iterative approach

Page 19: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Incremental Floorplanning

Incremental Placement [Coudert et al]: Given an optimized placement and a set of changes to

the netlist (e.g., due to technology remapping) modify the placement to improve it.

Equally applicable to floorplanning

6

1

2

3

4

6

Initial Floorplan Modified Floorplan

Perturbations 1

2

3

4

6

6

1

floorplanmodules (e.g. due to -function movement) floorplan

Page 20: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

1

2

3

4

6

6

|

2/2.3 - 9/10.1 -

11/12.4 - 16/18 -

5/5.6 - 27/30.4 -

32/36 -

-

3

-

2

1

4

Incremental Floorplan

Our Incremental Floorplanner

IncrementalFloorplanner

6

1

2

3

4

6

Initial Floorplan Modified Floorplan

Perturbations1

2

3

4

6

Page 21: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Our Incremental Floorplanner

1. Calculate area & room of each node: bottom up slicing tree traversal

2. Area redistribution Top down traversal Increase area if necessary

Not enough space at root Aspect ratios become too distorted

1

2

3

4

6

6

|

2/2.3 - 9/10.1 -

11/12.4 - 16/18 -

5/5.6 - 27/30.4 -

32/36 -

-

3

-

2

1

4

Incremental FloorplanModified Floorplan

1

2

3

4

Simple, yet effective

Other more complicated algorithms might work better

Page 22: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

MediaBench Functions

Benchmark Blocks Links Weight Initial WL

1adpcmcoder

33 31 54 2688 35568

2adpcm

decoder26 23 44 1952 21588

3internal

filter10 143 60 17088 411637

4Internalexpand

101 94 257 14336 317031

5compress

output34 17 60 2368 29114

6mpeg2dec

block62 13 66 2272 34510

7mpeg2dec

vector16 4 26 1024 4366

8 FAST 14 4 15 704 3714

9 FR4TR 77 87 155 704 340697

10 det 12 5 13 7936 3772

Page 23: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 avrg

Initial Overall Optimal Overall Incremental Phi Optimal Phi Incremental

Incremental Floorplanning Results

Norm

alized

Wir

ele

ng

th

Benchmarks

“Optimal” Approach:12% Overall Wirelength Reduction

25% Phi-node Wirelength Reduction

Our Approach:6% Overall Wirelength

Reduction 8% Phi-node Wirelength

Reduction

avg

Page 24: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Related Work

Hardware compilation projects using SSA PDG+SSA form [UCSB] CASH [CMU] SA-C [UCR] Sea Cucumber [BYU]

Physically aware behavioral synthesis techniques SA for scheduling, binding and floorplanning [Prabhakaran97] SA for binding and floorplanning [Yung-Ming94] Scheduling, allocation and binding [Dougherty00] Fasolt: bus topology [Knapp92] High level synthesis [Tarafdar00]

Incremental CAD Problem overview/challenges [Coudert00] Floorplanning [Crenshaw99]

Page 25: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Conclusions

It’s been a long strange trip…

SSA a nice IR for hardware compilationExplicitly shows data flowUseful for exploiting parallelism

Compiler techniques applied to hardware design can reduce wirelengthThey must be aware of physical informationThey must use an incremental floorplanning

Page 26: Layout Driven Data Communication Optimization for High Level Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Dept. of Electrical and Computer

Questions?

(and cue for applause)