Northwestern VLSI CAD Group
K. Bazargan R. Kastner M. Sarrafzadeh
Physical Design for Reconfigurable Computing Systems
using Firm Templates
Department of Electrical &Computer Engineering
Northwestern University
Sep 10, 992
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 993
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 994
CPU
Data Memory
Control
Data
Data Data
Instruction Memory (Program)
RFUOPs CPU instructions
The Architecture of a Reconfigurable System
RFU
Sep 10, 995
Execution of a Sample Program
RFU
t y
x
x = 3*a - b;…
C = RFUOP1(x,5);
y = 4*x - c;
for (i=0;i<3;i++){
x+=RFUOP2(y);
++y;
}
z = RFUOP1(x,3);
a = z - y;
b = RFUOP3(a,b);
c = a - b;…
CodeCode DFGDFG
=> (on CPU)
(on RFU)=>
=>
=>
No room on RFU to run allin parallel ==> run in sequence
=>
=>
(in parallel)=>
=>
=>
Sep 10, 996
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 997
Application Example: Image Restoration
The value of the center pixel in the next iteration: xk+1 = *y + xk - * (d**xk)y: the pixel value from the original degraded image
xk: the pixel value from the previous iteration
d**xk denotes the weighted sumr1* (eight neighbor pixels) + r0 * center pixel r1 r1 r1
r1 r1 r1
r1 r1r0
Sep 10, 998
m
o
n
Image Restoration (cont.)
• Incentive:– Processing of large images using
FPGA’s with limited resources
• Strategy:– Segmentation of the image into
smaller sized images suitablefor the FPGA
– Segments of size m x nare surrounded by an overlap of o.
Sep 10, 999
MEMORY m
o
n RFU
Image Restoration: Data Flow Strategy• Data flow strategy
– Pixels of individual segments are restored in parallel by hardware.
– Restored segments are written back after the overlap is discarded
Sep 10, 9910
Degraded Image Restored Image
Image Restoration Example
Sep 10, 9911
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 9912
Configuration Memory
Config. Bits RFUOPs
RFU Manager
System Components
PlacementEngine
CacheManager
Prefetch/BranchPrediction Unit
Control
Program Manager
InstructionMem. (Prog.)
CPU instructions
Data
CPU
RFU
Data Memory
Data
Data
Sep 10, 9913
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 9914
Online Placement: Problem Definition• Input:
– RFU dimensions (W, H)– List of RFUOP events: (w, h, arrival, departure)
arrival
departure
• Output:– For each module, either
• Rejected (not able to place) [penalty?]• Accepted: (x,y) accepted
rejected
Sep 10, 9915
Online Placement
• When a new RFUOP arrives,– Is there enough room?– If yes, which location is best?
• Previous work– Bin-packing heuristics (1-D) - O(n2)
• First Fit, Best Fit, Shelf, Look ahead, …
– [Chazelle’83] The Bottom-Left heuristic. O(n2)– [Healy-Creavin’97] O(n2 lg n)
+ = ?
CurrentPlacement
New moduleto be inserted
Sep 10, 9916
Our Online Placement• Our approach:
– Divide the empty space into explicit “empty rectangles”
• When a new RFUOP arrives– Is there enough room? (any ER large enough?)– If yes, which location is best? (which ER is best?)
• Packing rule– Best Fit, Bottom Left, First Fit
Sep 10, 9917
Heuristics for Choosing an Empty Rectangle
New moduleto be inserted
+ = ?A
B
CurrentPlacement
Area( ) < Area( ) Choose A
BF (Best Fit)
Places the new module in the empty rectangle which causes less wasted space.
FF (First Fit)
Any of A or B could be chosen for placing the new module.
BL (Bottom Left)
P1
P2
Chooses the empty rect which is more to the bottom left
y(P2) < y(P1) Choose B
Sep 10, 9918
Our Online Placement
• Managing the empty space– Keep empty rectangles explicitly,
use “range tree” to store/access empty rects.– Efficient use of RFU real estate
• KAMER: Keep all O(n2) maximal empty rectangles
• Our approach:– Divide the empty space into explicit “empty
rectangles”
• When a new RFUOP arrives– Is there enough room? (any ER large enough?)– If yes, which location is best? (which ER is best?)
Sep 10, 9919
Keeping All Empty Rectangles
Sep 10, 9920
Our Online Placement• Our approach:
– Divide the empty space into explicit “empty rectangles”
• When a new RFUOP arrives– Is there enough room? (any ER large enough?)– If yes, which location is best? (which ER is best?)
• Managing the empty space– Keep empty rectangles explicitly,
use “range tree” to store/access empty rects.– Efficient use of RFU real estate
• KAMER: Keep all O(n2) maximal empty rectangles
– Fast but sub-optimal• Keep only O(n) empty rectangles
– Shorter Seg. (SSEG), Square Empty Rects. (SQR), ...
Sep 10, 9921
Keeping O(n) Empty Rectangles - SSEG
Sep 10, 9922
Heuristics for Choosing a Segment
SSEG (Shorter Seg) BER (Balanced Empty Rects) LSQR (Larger Rect Square)
SQR (Square Rects)LER (Large Empty Rects)LSEG (Longer Seg)
S1
S2
Chooses the shorter of the twosegments.
Chooses the longer of the twosegments.
AB
C
D
S1
S2
AB
C
D
A
B
C
D
A
B
C
D
Chooses the segment which creates less area difference.
Chooses the segment which creates the larger rectangle closer to square.
S1 < S2
S1 < S2
Area(B) - Area(A) > Area(D) - Area(C) AspectRatio(B) > AspectRatio(D)
Chooses the segment which creates the larger empty rectangle.
Chooses the segment which creates empty rectangles closer to squares.
Area(B) > Area(D)
Max{AR(A),AR(B)} < Max{AR(C),AR(D)}AR = AspectRatio
Sep 10, 9923
How Good is a Placement?• Acceptance rate
– percentage of modules accepted (placed)
• Volume penalty– Area complexity– Time-span in the system loop iterations– Penalty of rejecting a module
penalty = volume = area * time
• Input data– Randomly generated dimensions– Randomly generated enter/leave time
Sep 10, 9924
Program
snapshot
Sep 10, 9925
Online Placement Results
Bin-Pack
Data set KAMER SSEG BER LSQR LSEG LER SQR
ra2048 79.25 74.26 61.52 70.36 52.83 73.87 70.36ra4096 84.59 79.1 66.84 74.39 58.37 79.49 74.73ra8192 79.71 73.39 63.23 69.87 55.87 74.88 68.11
FF
ra16384 81.35 75.08 63.59 70.42 55.73 76.13 69.38 Avg(FF) 81.23 75.46 63.80 71.26 55.70 76.09 70.65
ra2048 82.52 77.49 67.18 75.05 58.93 76.46 74.66ra4096 87.06 81.76 73.22 80.32 64.57 81.66 79.78ra8192 82.28 77.57 67.85 73.91 59.04 76.12 73.77
BF
ra16384 84.04 78.81 68.5 75.36 60.92 78.25 75.44 Avg(BF) 83.97 78.91 69.19 76.16 60.86 78.12 75.91
ra2048 81.84 76.22 61.72 73.29 55.57 76.07 71.83ra4096 86.18 81.93 70.29 78.56 62.33 81.42 78.54ra8192 81.17 75.71 65.04 72.9 59.71 76.54 72.18
BL
ra16384 83.46 77.39 64.97 74.53 58.23 78.29 73.25 Avg(BL) 83.16 77.81 65.50 74.82 58.96 78.08 73.95
Percentage of accepted modules using different bin-packing and empty space partitioning rules
Sep 10, 9926
Online Placement Results (cont.)
Penalties for different partitioning heuristics when BF is used
0.0E+00
2.0E+07
4.0E+07
6.0E+07
8.0E+07
1.0E+08
1.2E+08
1.4E+08
1.6E+08
1.8E+08
KAMER SSEG BER LSQR LSEG LER SQRPartitioning heuristic
Pen
alty
A2048 A4096 A8192 A16384
Sep 10, 9927
Online Placement Results (cont.)
Running Time Comparison(Time to place "A16384" file)
35.77 34.27 34.74
2.23 2.12 2.24
0
5
10
15
20
25
30
35
40
KAMER SSEG
Tim
e (s
ec.)
BF
FF
BL
Sep 10, 9928
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 9929
ty
x
3-D Floorplanning
RFU
DFGDFG ScheduleSchedule
RFU CPU
RFU area
time
Sep 10, 9930
ty
x
3-D Floorplanning
RFU
By deleting this RFUOP(CPU performs theoperation)...
DFGDFG ScheduleSchedule
RFU CPU
Sep 10, 9931
ty
x
3-D Floorplanning
RFU
This RFUOP can bemoved on the RFU
DFGDFG ScheduleSchedule
RFU CPU
Sep 10, 9932
ty
x
3-D Floorplanning
RFU
DFGDFG ScheduleSchedule
RFU CPU
These RFUOPs can beperformed earlier...
Sep 10, 9933
ty
x
3-D Floorplanning
RFU
DFGDFG ScheduleSchedule
RFU CPU
Sep 10, 9934
Our Current 3-D Floorplanners
• No change in the schedule– Fixed insertion and deletions of RFUOPs
• Annealing based.– Move set
• Move operation from CPU set to RFU set• Move operation from RFU set to CPU set• Displace an already placed RFUOP on the RFU
– Cost function• Penalty in rejecting modules (sum of volumes of the
RFUOPs in the CPU set)• No overlap allowed during annealing
• Greedy– Sort the modules on decreasing vol., apply KAMER
Sep 10, 9935
Our Current 3-D Floorplanners (cont.)
• KAMER-BF-Decreasing – Sort the modules on their volumes– Use KAMER to find a fast placement of the modules
• Low-temp. annealing (LTSA)– Similar to KAMER-BFD, but use KAMER to place
only the X% largest modules– Use low-temp annealing to place the rest
• Zero-temp. annealing (ZTSA) -- Greedy– Use KAMER to place as many modules as you can– Use only displace and move from CPU to RFU
annealing moves.
Sep 10, 9936
Our Current 3-D Floorplanners (cont.)
• BFOP - Best Fit Online Placement – Sort the RFUOPs on volume (decreasing)– For each RFUOP, find candidate “corners”– Choose the corner which results in min wasted
area(similar to well-studied 2-D Bin Packing problem)
ty
x
A Floor corresponding to time t1t1
cornerst1
Sep 10, 9937
Algorithm Dataset
Offlineacc. rate
Onlineacc. rate
Ratio OfflinePenalty
OnlinePenalty
Ratio
T50 70 84 83.33% 147287 213153 69.10%T100 72 83 86.75% 253566 307879 82.36%S100 86 84 102.38% 464049 508923 91.18%S200 81 89.5 90.50% 539435 612623 88.05%S1024 84.5 84.6 99.88% 4468662 4643786 96.23%
LTSAX=100%
A1024 87 89 97.75% 427761 456627 93.68% Avg 80.08 85.68 93.43% 1050126 1123831 86.77%
T50 76 84 90.48% 148975 213153 69.89%T100 82 83 98.79% 225603 307879 73.28%S100 81 84 96.43% 287153 508923 56.42%S200 85.5 89.5 95.53% 359980 612623 58.76%
LTSAX=20%
A1024 81 89 91.01% 213036 456627 46.65% Avg 81.10 85.90 94.45% 246949 419841 61.00%
Annealing-Based Offline vs. Online
Percentage of accepted modules and penalties using two offline parameters.The higher the RFU acceptance rate and lower the penalty, the better the algorithm.
Sep 10, 9938
Offline Placement Results - All
Comparison of different offline algorithms
0
100000
200000
300000
400000
500000
600000
700000
Tiny50 Tiny100 Small100 Small200 A100
Data files
Pen
alty
of
pla
cem
ent
KAMER -BFD
LTSA
ZTSA
BFOP
Sep 10, 9939
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 9940
Flexible Modules• Library of soft templates
– Flexible shapes• Constant area, different width,height• Problem? Hard to build (PD should be done for each
shape)
– Median• Use the same area, but square shape
– Rotation
• Placement method– Use best shape (min wasted area)
Sep 10, 9941
Using Flexible Modules in BFOPQuality improvement when using flexible modules
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
Tiny50
Tiny10
0
Small
100
Small
200
Small
1024
A100
A2048
avg
Data files
Imp
rove
men
t (p
erce
nta
ge)
Median Median/Rotation
Median uses a square module with the same area
Sep 10, 9942
Flexible Modules (cont.)• “Firm” templates
– Slice the module into x horizontal or vertical strips– If cannot place the module, use the 2-split, 3-split,
… until you can fit.
• Problem? – Routing!– Limited module types can be split (like carry chains,
etc. with min communication between stages)
Vertical 3-split
Sep 10, 9943
Quality Improvements Using Firm Templates
Placment improvement when using firm templates (in OBFD)
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
100.00%
Split-2 Split-3 Split-4 Split-5 Split-6Per
cen
tag
e im
pro
vem
ent
ove
r n
o-
spli
t
Tiny50
Tiny100
Small100
Small200
Small1024
A100
A1024
avg
Sep 10, 9944
Outline• FPGA: What and why?• What is Reconfigurable Computing
System (RCS)?• Application example• RCS: System components• Online placement: problem
definition and our approach• Offline placement and scheduling• Flexible modules and firm templates• Conclusion and future work
Sep 10, 9945
Conclusion• Which online algorithm?
– If speed is an issue, SSEG, ow KAMER
• Online or offline?– If you have the schedule => offline
• Which offline algorithm?– BFOP is the best (faster+better quality)
• Median? Flexibility? Firm templates?– Surprisingly, median gives little improvement– If flexible shape avail, better than splitting
(no additional routing problem)– How many splits?
• no-split 2-split: 23% improvement• 5-split 6-split: 3% improvement