Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

Storage Assignment during High-level Synthesis

for Configurable Architectures

Wenrui Gong Gang Wang Ryan Kastner

Department of Electrical and Computer EngineeringUniversity of California, Santa Barbara{gong, wanggang, kastner}@ece.ucsb.edu

http://express.ece.ucsb.edu

November 7, 2005

11/7/2005GONG et al: Storage Assignment 2

What are we dealing with?

FPGA-based reconfigurable architectures with distributed block RAM modules

Synthesizing high-level programs into designs

Block RAM Block RAM

Block RAM Block RAM

Configurable Logic Blocks


control logic

Options of Storage Assignment

MUX

datapath control logic

datapathdatapath

datapathdatapath

Given the same storage/logic resources, different storage assignments exist

OR


Objective Different arrangements achieve different

performances.

Objective: achieve the best performance (throughput) under the resource constraints, improve resource utilizations, and meet design goals (time, frequencies, etc.)


Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks


Outline



Target Architecture

FPGA-based fine-grained reconfigurable computing architecture with distributed block RAM modules


Memory Access Latencies

Memory access delay = BRAM access delay + interconnect delays BRAM access time are fixed with the architecture Interconnect delays are variables. One clock cycle to access near data, or two or even

more to access data far away from the CLB.

Difficult to precisely estimate execution time.


Outline

Target architectures Data partitioning problem

Problem formulation Data partitioning algorithm

Memory optimizations Experimental results Concluding remarks


Problem Formulation

Inputs: An l-level nested loop L A set of n data arrays N An architecture with BRAM modules M.

Partitioning problem: partition data arrays N into a set of data portions P, and seek an assignment from P to block RAM modules M.

Objective: optimize latency

Block RAM Block RAM

Block RAM Block RAM

Configurable Logic Blocks


Overview of Data Partitioning Algorithm

Code analysis Determine possible partitioning directions

Architectural-level synthesis Resource allocation, scheduling and binding Discover the design properties

Granularity adjustment Use experimental cost function to estimate performances


Code Analysis Iteration space and data spaces

Index functions determine access footprints

iteration space data space S


Iteration/Data Space Partitioning

Partitioning on the iteration space derive corresponding partitioning on data spaces Using the index functions

Communication-free partitioning



Iteration/Data Space Partitioning

Communication-efficient partitioning Data access footprints overlapped The reason of remote memory accesses, when not

placed together



Architectural-level Synthesis

Synthesize the innermost iteration body Pipelining designs

Collect performance results execution time T, initial intervals II, and resource utilization umul, uBRAM, and uCLB


Estimating the Execution Time

Resource utilizations determine the performance of the pipelined designs

Execution time are linear to the number of initial intervals and the granularity.

When more resources are not occupied, more operations could be performed simultaneously.


Granularity Adjustment

For each possible partitioning direction, check different granularity to obtain the best performance Coarsest: use as less block RAM modules as possible

control logic

datapathdatapath


Granularity Adjustment

For each possible partitioning direction, check different granularity to obtain the best performance Finest: distribute data to all block RAM modules

control logic

datapathdatapath datapathdatapath

datapathdatapath datapathdatapath


Cost Function

An experiential formulation based our architectural-level synthesis results. Estimate global memory accesses mr and total memory

accesses mt, and their ratio

Factor benefits memory accesses to nearby block RAM modules


Outline

Target architectures Data partitioning problem Memory optimizations

Scalar replacement Data prefetching

Experimental results Concluding remarks


Scalar Replacement Scalar replacement increases data reuses and reduces

memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again


Data Prefetching and Buffer Insertion

Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier

One (two, or more) cycle depend on the size of chip and the # of BRAM Reduce the length of critical paths


Outline



Experimental Setup

Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and

DSP SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle


Collected Results

Pre-layout and post-layout timing and area results are collected Original: assign one block RAM to the entire data array Partitioned: the iteration/data spaces are partitioned

under resource constraints. Optimized: memory optimizations applied on the

partitioned designs.

Pre-layout Timing/Area Post-layout Timing/Area SOBEL (large)

# of cycles Freq(MHz) Latency(ms) Ares(%) Freq(MHz) Latency(ms) Ares(%)

original 29,718 160.9 184.7 3.32 151.19 196.6 4.10 partitioned 2,032 145.92 13.9 41.97 105.37 19.2 52.60

optimized 263 185.19 1.4 44.32 125.94 2.1 53.91


Results: Execution Time

The average speedup: 2.75 times Under given resources, partitioned to 4 portions.

After further optimizations: 4.80 times faster.

0

0.2

0.4

0.6

0.8

1

1.2

SUSAN Bilinear 1D Gauss 2D Gauss

Norm

ali

zie

d L

ate

ncie

s

Original Partitioned Optimized


Results: Achievable Clock Frequencies

About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.

0

20

40

60

80

100

120

140

160

180

200

SUSAN Bilinear 1D Gauss 2D Gauss SOBEL SOBEL

Ach

ievab

le C

lock F

req

uen

cie

s

Original Partitioned Optimized


Outline



Concluding Remarks

A data and iteration space partitioning approach for homogeneous block RAM modules integrated with existing architectural-level synthesis

techniques parallelize input designs dramatically improve system performance


Thank You

Prof Ryan Kastner and Gang Wang Reviewers All audiences


Questions

Documents

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer