View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Storage Assignment during High-level Synthesis
for Configurable Architectures
Wenrui Gong Gang Wang Ryan Kastner
Department of Electrical and Computer EngineeringUniversity of California, Santa Barbara{gong, wanggang, kastner}@ece.ucsb.edu
http://express.ece.ucsb.edu
November 7, 2005
11/7/2005GONG et al: Storage Assignment 2
What are we dealing with?
FPGA-based reconfigurable architectures with distributed block RAM modules
Synthesizing high-level programs into designs
Block RAM Block RAM
Block RAM Block RAM
Configurable Logic Blocks
11/7/2005GONG et al: Storage Assignment 3
control logic
Options of Storage Assignment
MUX
datapath control logic
datapathdatapath
datapathdatapath
Given the same storage/logic resources, different storage assignments exist
OR
11/7/2005GONG et al: Storage Assignment 4
Objective Different arrangements achieve different
performances.
Objective: achieve the best performance (throughput) under the resource constraints, improve resource utilizations, and meet design goals (time, frequencies, etc.)
11/7/2005GONG et al: Storage Assignment 5
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
11/7/2005GONG et al: Storage Assignment 6
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
11/7/2005GONG et al: Storage Assignment 7
Target Architecture
FPGA-based fine-grained reconfigurable computing architecture with distributed block RAM modules
11/7/2005GONG et al: Storage Assignment 8
Memory Access Latencies
Memory access delay = BRAM access delay + interconnect delays BRAM access time are fixed with the architecture Interconnect delays are variables. One clock cycle to access near data, or two or even
more to access data far away from the CLB.
Difficult to precisely estimate execution time.
11/7/2005GONG et al: Storage Assignment 9
Outline
Target architectures Data partitioning problem
Problem formulation Data partitioning algorithm
Memory optimizations Experimental results Concluding remarks
11/7/2005GONG et al: Storage Assignment 10
Problem Formulation
Inputs: An l-level nested loop L A set of n data arrays N An architecture with BRAM modules M.
Partitioning problem: partition data arrays N into a set of data portions P, and seek an assignment from P to block RAM modules M.
Objective: optimize latency
Block RAM Block RAM
Block RAM Block RAM
Configurable Logic Blocks
11/7/2005GONG et al: Storage Assignment 11
Overview of Data Partitioning Algorithm
Code analysis Determine possible partitioning directions
Architectural-level synthesis Resource allocation, scheduling and binding Discover the design properties
Granularity adjustment Use experimental cost function to estimate performances
11/7/2005GONG et al: Storage Assignment 12
Code Analysis Iteration space and data spaces
Index functions determine access footprints
iteration space data space S
11/7/2005GONG et al: Storage Assignment 13
Iteration/Data Space Partitioning
Partitioning on the iteration space derive corresponding partitioning on data spaces Using the index functions
Communication-free partitioning
iteration space data space S
11/7/2005GONG et al: Storage Assignment 14
Iteration/Data Space Partitioning
Communication-efficient partitioning Data access footprints overlapped The reason of remote memory accesses, when not
placed together
iteration space data space S
11/7/2005GONG et al: Storage Assignment 15
Architectural-level Synthesis
Synthesize the innermost iteration body Pipelining designs
Collect performance results execution time T, initial intervals II, and resource utilization umul, uBRAM, and uCLB
11/7/2005GONG et al: Storage Assignment 16
Estimating the Execution Time
Resource utilizations determine the performance of the pipelined designs
Execution time are linear to the number of initial intervals and the granularity.
When more resources are not occupied, more operations could be performed simultaneously.
11/7/2005GONG et al: Storage Assignment 17
Granularity Adjustment
For each possible partitioning direction, check different granularity to obtain the best performance Coarsest: use as less block RAM modules as possible
control logic
datapathdatapath
11/7/2005GONG et al: Storage Assignment 18
Granularity Adjustment
For each possible partitioning direction, check different granularity to obtain the best performance Finest: distribute data to all block RAM modules
control logic
datapathdatapath datapathdatapath
datapathdatapath datapathdatapath
11/7/2005GONG et al: Storage Assignment 19
Cost Function
An experiential formulation based our architectural-level synthesis results. Estimate global memory accesses mr and total memory
accesses mt, and their ratio
Factor benefits memory accesses to nearby block RAM modules
11/7/2005GONG et al: Storage Assignment 20
Outline
Target architectures Data partitioning problem Memory optimizations
Scalar replacement Data prefetching
Experimental results Concluding remarks
11/7/2005GONG et al: Storage Assignment 21
Scalar Replacement Scalar replacement increases data reuses and reduces
memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again
11/7/2005GONG et al: Storage Assignment 22
Data Prefetching and Buffer Insertion
Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier
One (two, or more) cycle depend on the size of chip and the # of BRAM Reduce the length of critical paths
11/7/2005GONG et al: Storage Assignment 23
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
11/7/2005GONG et al: Storage Assignment 24
Experimental Setup
Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and
DSP SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle
11/7/2005GONG et al: Storage Assignment 25
Collected Results
Pre-layout and post-layout timing and area results are collected Original: assign one block RAM to the entire data array Partitioned: the iteration/data spaces are partitioned
under resource constraints. Optimized: memory optimizations applied on the
partitioned designs.
Pre-layout Timing/Area Post-layout Timing/Area SOBEL (large)
# of cycles Freq(MHz) Latency(ms) Ares(%) Freq(MHz) Latency(ms) Ares(%)
original 29,718 160.9 184.7 3.32 151.19 196.6 4.10 partitioned 2,032 145.92 13.9 41.97 105.37 19.2 52.60
optimized 263 185.19 1.4 44.32 125.94 2.1 53.91
11/7/2005GONG et al: Storage Assignment 26
Results: Execution Time
The average speedup: 2.75 times Under given resources, partitioned to 4 portions.
After further optimizations: 4.80 times faster.
0
0.2
0.4
0.6
0.8
1
1.2
SUSAN Bilinear 1D Gauss 2D Gauss
Norm
ali
zie
d L
ate
ncie
s
Original Partitioned Optimized
11/7/2005GONG et al: Storage Assignment 27
Results: Achievable Clock Frequencies
About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.
0
20
40
60
80
100
120
140
160
180
200
SUSAN Bilinear 1D Gauss 2D Gauss SOBEL SOBEL
Ach
ievab
le C
lock F
req
uen
cie
s
Original Partitioned Optimized
11/7/2005GONG et al: Storage Assignment 28
Outline
Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks
11/7/2005GONG et al: Storage Assignment 29
Concluding Remarks
A data and iteration space partitioning approach for homogeneous block RAM modules integrated with existing architectural-level synthesis
techniques parallelize input designs dramatically improve system performance
11/7/2005GONG et al: Storage Assignment 30
Thank You
Prof Ryan Kastner and Gang Wang Reviewers All audiences
11/7/2005GONG et al: Storage Assignment 31
Questions