31
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer Engineering University of California, Santa Barbara {gong, wanggang, kastner}@ece.ucsb.edu http://express.ece.ucsb.edu November 7, 2005

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

Storage Assignment during High-level Synthesis

for Configurable Architectures

Wenrui Gong Gang Wang Ryan Kastner

Department of Electrical and Computer EngineeringUniversity of California, Santa Barbara{gong, wanggang, kastner}@ece.ucsb.edu

http://express.ece.ucsb.edu

November 7, 2005

Page 2: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 2

What are we dealing with?

FPGA-based reconfigurable architectures with distributed block RAM modules

Synthesizing high-level programs into designs

Block RAM Block RAM

Block RAM Block RAM

Configurable Logic Blocks

Page 3: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 3

control logic

Options of Storage Assignment

MUX

datapath control logic

datapathdatapath

datapathdatapath

Given the same storage/logic resources, different storage assignments exist

OR

Page 4: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 4

Objective Different arrangements achieve different

performances.

Objective: achieve the best performance (throughput) under the resource constraints, improve resource utilizations, and meet design goals (time, frequencies, etc.)

Page 5: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 5

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 6: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 6

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 7: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 7

Target Architecture

FPGA-based fine-grained reconfigurable computing architecture with distributed block RAM modules

Page 8: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 8

Memory Access Latencies

Memory access delay = BRAM access delay + interconnect delays BRAM access time are fixed with the architecture Interconnect delays are variables. One clock cycle to access near data, or two or even

more to access data far away from the CLB.

Difficult to precisely estimate execution time.

Page 9: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 9

Outline

Target architectures Data partitioning problem

Problem formulation Data partitioning algorithm

Memory optimizations Experimental results Concluding remarks

Page 10: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 10

Problem Formulation

Inputs: An l-level nested loop L A set of n data arrays N An architecture with BRAM modules M.

Partitioning problem: partition data arrays N into a set of data portions P, and seek an assignment from P to block RAM modules M.

Objective: optimize latency

Block RAM Block RAM

Block RAM Block RAM

Configurable Logic Blocks

Page 11: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 11

Overview of Data Partitioning Algorithm

Code analysis Determine possible partitioning directions

Architectural-level synthesis Resource allocation, scheduling and binding Discover the design properties

Granularity adjustment Use experimental cost function to estimate performances

Page 12: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 12

Code Analysis Iteration space and data spaces

Index functions determine access footprints

iteration space data space S

Page 13: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 13

Iteration/Data Space Partitioning

Partitioning on the iteration space derive corresponding partitioning on data spaces Using the index functions

Communication-free partitioning

iteration space data space S

Page 14: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 14

Iteration/Data Space Partitioning

Communication-efficient partitioning Data access footprints overlapped The reason of remote memory accesses, when not

placed together

iteration space data space S

Page 15: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 15

Architectural-level Synthesis

Synthesize the innermost iteration body Pipelining designs

Collect performance results execution time T, initial intervals II, and resource utilization umul, uBRAM, and uCLB

Page 16: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 16

Estimating the Execution Time

Resource utilizations determine the performance of the pipelined designs

Execution time are linear to the number of initial intervals and the granularity.

When more resources are not occupied, more operations could be performed simultaneously.

Page 17: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 17

Granularity Adjustment

For each possible partitioning direction, check different granularity to obtain the best performance Coarsest: use as less block RAM modules as possible

control logic

datapathdatapath

Page 18: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 18

Granularity Adjustment

For each possible partitioning direction, check different granularity to obtain the best performance Finest: distribute data to all block RAM modules

control logic

datapathdatapath datapathdatapath

datapathdatapath datapathdatapath

Page 19: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 19

Cost Function

An experiential formulation based our architectural-level synthesis results. Estimate global memory accesses mr and total memory

accesses mt, and their ratio

Factor benefits memory accesses to nearby block RAM modules

Page 20: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 20

Outline

Target architectures Data partitioning problem Memory optimizations

Scalar replacement Data prefetching

Experimental results Concluding remarks

Page 21: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 21

Scalar Replacement Scalar replacement increases data reuses and reduces

memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again

Page 22: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 22

Data Prefetching and Buffer Insertion

Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier

One (two, or more) cycle depend on the size of chip and the # of BRAM Reduce the length of critical paths

Page 23: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 23

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 24: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 24

Experimental Setup

Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and

DSP SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle

Page 25: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 25

Collected Results

Pre-layout and post-layout timing and area results are collected Original: assign one block RAM to the entire data array Partitioned: the iteration/data spaces are partitioned

under resource constraints. Optimized: memory optimizations applied on the

partitioned designs.

Pre-layout Timing/Area Post-layout Timing/Area SOBEL (large)

# of cycles Freq(MHz) Latency(ms) Ares(%) Freq(MHz) Latency(ms) Ares(%)

original 29,718 160.9 184.7 3.32 151.19 196.6 4.10 partitioned 2,032 145.92 13.9 41.97 105.37 19.2 52.60

optimized 263 185.19 1.4 44.32 125.94 2.1 53.91

Page 26: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 26

Results: Execution Time

The average speedup: 2.75 times Under given resources, partitioned to 4 portions.

After further optimizations: 4.80 times faster.

0

0.2

0.4

0.6

0.8

1

1.2

SUSAN Bilinear 1D Gauss 2D Gauss

Norm

ali

zie

d L

ate

ncie

s

Original Partitioned Optimized

Page 27: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 27

Results: Achievable Clock Frequencies

About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.

0

20

40

60

80

100

120

140

160

180

200

SUSAN Bilinear 1D Gauss 2D Gauss SOBEL SOBEL

Ach

ievab

le C

lock F

req

uen

cie

s

Original Partitioned Optimized

Page 28: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 28

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 29: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 29

Concluding Remarks

A data and iteration space partitioning approach for homogeneous block RAM modules integrated with existing architectural-level synthesis

techniques parallelize input designs dramatically improve system performance

Page 30: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 30

Thank You

Prof Ryan Kastner and Gang Wang Reviewers All audiences

Page 31: Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

11/7/2005GONG et al: Storage Assignment 31

Questions