26
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer Engineering University of California, Santa Barbara {gong, wanggang, kastner}@ece.ucsb.edu http://express.ece.ucsb.edu June 10, 2005

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM

Wenrui Gong Gang Wang Ryan KastnerDepartment of Electrical and Computer Engineering

University of California, Santa Barbara{gong, wanggang, kastner}@ece.ucsb.edu

http://express.ece.ucsb.edu

June 10, 2005

Page 2: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 2

What are we dealing with? Mapping high-level programs into FPGA-based

reconfigurable computing architectures with distributed block RAM modules

Objective: Improve utilizations of available storage resources, optimize system performance, and meet design goals

Page 3: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 3

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 4: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 4

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 5: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 5

Target Architecture FPGA-based fine-grained reconfigurable computing

architecture with distributed block RAM modules

Page 6: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 6

Memory Access Latencies

Memory access delay including access delay and propagation delays. Propagation delays are variables.

One clock cycle to access near data, or two or even more to access data far away from the CLB.

Difficult to distinguish which ones are near and which ones are remote before physical synthesis More difficult than traditional data partitioning in parallelizing

compilation for NUMA machines

Page 7: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 7

Outline

Target architectures Data partitioning problem

Problem formulation Data partitioning algorithm

Memory optimizations Experimental results Concluding remarks

Page 8: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 8

Problem Formulation Inputs:

An l-level nested loop L A set of n data arrays N An architecture with m BRAM modules M.

Assumptions: Index expressions of array references are affine functions of

loop indices; No indirect array references, or other similar pointer

operations; All data arrays are assigned to block RAM modules No duplicate data.

Page 9: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 9

Problem Formulation (cont’d) Partitioning problem: partition data arrays N into a

set of data portions P, and seek an assignment from P to block RAM modules M.

Constraints: 1) hardware resource constraint 2) capacity constraint of each block RAM module 3) all data arrays are assigned to block RAM and each data

element is assigned to one and only one block RAM module. Objective: minimize the total execution time (or

maximize the system throughput) under the above constraints.

Page 10: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 10

Overview of Data Partitioning Algorithm

Code analysis to determine possible partitioning directions

Architectural-level synthesis discover the design properties Resource allocation, scheduling and binding

Granularity adjustment Use experiential cost function to estimate performances

Page 11: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 11

Code Analysis Calculate the iteration space IS(L) Calculate the data space DS(Ni) Obtain data access footprint F using the affine

functions of loop indices Analyze F and IS(L) to obtain a set of possible

partitioning directions.

Page 12: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 12

Architectural-level Synthesis Synthesize and pipeline the innermost iteration

body, and collect execution time T, initial intervals II, and resource utilization um, ur, and ua

Page 13: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 13

Granularity Adjustment For each possible partitioning direction, check

different granularity to obtain the best performance Calculate the finest and coarsest grain for a homogeneous

partitioning Finest: as less iterations as possible in one block RAM module,

use all block RAM modules

Coarsest: use as less block RAM modules as possible

Estimate global memory accesses mr and total memory accesses mt, and their ratio

Use cost function to estimate the execution time

Page 14: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 14

Cost Function

An experiential formulation based our architectural-level synthesis results. Estimate initial intervals for pipelined designs Benefit memory accesses to nearby block RAM modules Different resource utilizations and granularities affect the

initial intervals

Page 15: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 15

Outline

Target architectures Data partitioning problem Memory optimizations

Scalar replacement Data prefetching

Experimental results Concluding remarks

Page 16: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 16

Scalar Replacement Scalar replacement increases data reuses and

reduces memory access Memory are accessed in the previous iteration Use contents already in registers rather than access it again

Page 17: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 17

Data Prefetching and Buffer Insertion

Buffer insertion reduces critical paths, and optimizes clock frequencies. Schedule the global memory access one cycle earlier Reduce the length of critical paths

Page 18: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 18

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 19: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 19

Experimental Setup Target architecture: Xilinx Virtex II FPGA. Target frequency: 150 MHz. Benchmarks: image processing applications and DSP

SOBEL edge detection Bilinear filtering 2D Gauss blurring 1D Gauss filter SUSAN principle.

Page 20: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 20

Results: Architectural Exploration Correlation bank: Different partitions of the array S deliver a wide

variety of candidate solutions With quite different overall performance after

synthesis and physical design.

Page 21: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 21

Results: Execution Time

The average speedup: 2.75 times, and after further optimizations, the average speedup is 4.80 times faster.

Page 22: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 22

Results: Achievable Clock Frequencies

About 10 percent slower than the original ones. After optimizations, about 7 percent faster than those of partitioned ones.

Page 23: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 23

Outline

Target architectures Data partitioning problem Memory optimizations Experimental results Concluding remarks

Page 24: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 24

Concluding Remarks A data and iteration space partitioning approach for

homogeneous block RAM modules integrated with existing architectural-level synthesis

techniques parallelize input designs dramatically improve system performance

Future work Irregular memory access Heterogeneous block RAM modules

Page 25: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 25

Thank You

Prof Ryan Kastner and Gang Wang Reviewers All audiences

Page 26: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer

3/22/2005GONG et al: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM 26

Questions