Upload
rana-mohammad-bilal
View
216
Download
0
Embed Size (px)
Citation preview
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 1/30
INTEGRATED SELECTION,
PARTITIONING ANDPLACEMENT FRAMEWORK
FOR RECONFIGURABLE
ARCHITECTURES
By: RANA MUHAMMAD BILAL
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 2/30
PRESENTATION CONTENTS
INTRODUCTION
PROBLEM MODEL
PRIOR ART + CONTRIBUTIONS OF WORK
DESCRIPTION OF WORK OVERVIEW
GENETIC SELECTION ALGORITHM
RECURSIVE BACK TRACKING PARTITION
ALGORITHM GREEDY PARTITIONING ALGORITHM
PRIORITY PLACEMENT ALGORITHM
RESULTS
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 3/30
INTRODUCTION
Market demands on performance, design turnaround and size Research Interest in
Reconfigurable computing
90% of time spent in 10% of code [90/10 rule] Port selected compute intensive code
blocks to hardware [Hot Areas]
Design
to be
Implemented
General Purpose Proces
(Software)
ASIC/FPGA (Hardware)
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 4/30
INTRODUCTION
Port selected code to
Hardware (save size +
time)… Can we reducesize further ???
Dynamic Reconfiguration:
Reuse hardware over time!
(+Size reduced, -
Reconfiguration cost ) Partial Dynamic
Reconfiguration: Tailor cut
Hardware reuse along
space (Only reconfigurewhen feasible + needed
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 5/30
INTRODUCTION
Partial Reconfiguration:CommunicationNetwork reconfigurationoverhead Tiled
Partially reconfigurableSystems [16]
Intelligent choice of BinSizes to compensate
reduced flexibility(Contribution: Firstalgorithm forpartitioning)
Recon. Fabric
GPP
Recon. Fabric
GPP
[16]Markus Koester, Wayne Luk , Jens Hagemeyer, Mario Porrmann and Ulrich R ückert, “Design Optimizations for Tiled Partially Reconfigurable Systems” in IEEE
TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 6, JUNE 2011
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 6/30
Problem Model
Formulate Design Problem Identify Hot Areas generate
CIS Table
Extract Loop Trace
Available area on fabric Tasks to do
Selection (ChooseImplementation Variants)
Partitioning (Partition
Reconfigurable Area in Bins) Placement (Assign Bins for the
execution of ImplementationVariants)
Recon. Fabric
GPP
Circuit instantiated on a Tile canbe Coprocessor/Custom
Instruction (Contribution:
Framework applicable to both
models)(Contribution: Integrated Solution
forSelection/Partitioning/Placement)
•Reconfiguration and
Execution can Overlap
•One Reconfiguration at a time
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 7/30
Prior Research Work
„Selection‟ Algorithms [9][10][19] either specific to „Coprocessor‟ model or „Custom
Instruction‟ model
No Consideration for multiple Implementation Alternatives
Partial Reconfiguration not supported/Joint optimizationwith placement and partitioning not considered
„Placement‟ Algorithms [14][15] Communication Overheads for partial reconfiguration
neglected Multi-sized tiles/bins not supported
Joint optimization with „Selection‟ and „Partitioning‟ notconsidered
No Prior work on „Partitioning‟ Algorithm
Miaoqing Huang, Vikram K. Narayana, Mohamed Bakhouya, Jaafar Gaber, Tarek El-Ghazawi “Efficient Mapping of Task Graphs onto Reconfigurable Hardware
Using Architectural Variants” in IEEE Transactions on Computers, Aug 2011
Honglei Han, Wenju Liu, Wu Jigang and Guiyuan Jiang, “Efficient Algorithm for Hardware/Software Partitioning and
Scheduling on MPSoC” in JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 8/30
Designing on a TPR
Architecturefor(i=0,i<1,i++)
{
first_term = i*i;
for(j=0,j<2,j++)
{
second_term = j*j;
answer(i,j) = first_term – second_term
}}
Hot Area 1
Implementation alternative 1:
-Execute on GPP
-Area requirement on fabric = 0 logic blocks
-Execution time = 4 clock cycles
Implementation alternative 2:
-Implement custom hardware to „square‟
-Area requirement on fabric = 1 logic blocks
-Execution time = 2 clock cycles
Hot Area 2
Implementation alternative 1:
-Execute on GPP
-Area requirement on fabric = 0 logic blocks
-Execution time = 7 clock cyclesImplementation alternative 2:
-Implement custom hardware to „square‟ and
subtract in GPP
-Area requirement on fabric = 2 logic blocks
-Execution time = 4 clock cycles
Implementation alternative 3:
-Implement custom hardware to „square‟ and
subtract
-Area requirement on fabric = 3 logic blocks-Execution time = 3 clock cycles
Loop trace/Execution sequence
122
j
i Aij
j
iT
ij
CIS
Table C h r o m o s o m e
2 3 3
g e n e
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 9/30
Overview of Framework
G e n e r a t e n e w
p o p u l a t i o n
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 10/30
Step 1: Genetic Selection
P: Population Limit
Fitness: Execution Time
(Goal of Genetic Optimizer is to Minimize
Fitness)
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 11/30
Step2 – Partitioning 1Recursive Backtracking
3,2,1 3
3 2
1
1
List GoalBacktracking (index, current)
If index > length of list
return
for i from index to length of list
If current + list[i] = goal
candidate_solutionadd list[i]
solutionsadd candidate_solution
candidate_solution [ ]
return
If current + list[i] < goal
candidate_solutionadd list[i]
Backtracking (index + 1, current + list[i])
goal = Available Area
list = descending order sorted list of
all area requirements specified by
loop trace and chromosome under
consideration.
index = entry number of „list‟ under
processing
current = cumulative sum of „list‟
entries traversed in a particular
thread. stored in
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 12/30
Step2 – Partitioning 2Greedy Partitioning
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 13/30
Step 3: Placement
Get Area model from partitioning algorithm
Repeat: until at end of chromosome
Select next gene
If: Corresponding implementation variant already placed
Use same bin placement
Else:
Loop: through all empty bins
Place in smallest bin satisfying area requirement
If: not placed until this step
Loop: through all filled binsDetermine future_reuse_ index* of placed
implementation variant
Place in smallest bin with smallestfuture_reuse_index
satisfying area requirement
*future_reuse_ index: number of times same
task type reoccurs with same implementation
variant selection)
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 14/30
Step 3: Placement
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 15/30
Example: Place_n_Partition
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 16/30
Example: Evolution
Chromosome Fitness Value
(Execution Time)
223 14 clock
cycles
131 14 clock
cycles111 18 clock
cycles
123 11 clock
cycles
Chromosome Fitness Value
(Execution Time)
213 13 clock
cycles
133 10 clock
cycles
Next
population
1 chromosome with
best fitness is passed
unchanged to next
generation as “Elite”
1 2 1
1 3 1
1 2 3
1 3 3
Exchange genes around a
randomly chosen position
Choose two
chromosomes at random
for crossover
Choose one chromosome
at random for mutation
Randomly choose a gene and
assign a random value (within
bounds) to it
2 2 3 2 1 3
crossove
r
elit
e
mutation
possible selection solutions for this simple problem 2 x 3 x 3
= 18.
Combining with Partition & Placement: 72 points.
We only explored 8 points
execution time reduced from 18 clock cycles to 10 clock
cycles.
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 17/30
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 18/30
Results
α = 2 α = 3 α = 4 α = 5 α = 6
10.43841 16.32047 12.27786753 15.42699725 17.46641
4.428044 17.07921 26.28062361 8.516886931 15.74194
4.74934 13.38983 11.43867925 10.92150171 9.143519
12.88344 12.38318 11.30434783 9.322033898 19.22111
6.220096 6.487696 7.267144319 11.42533937 14.86291
0
2
4
6
8
10
12
14
16
18
α = 2 α = 3 α = 4 α = 5 α = 6
Percent difference of PU b/w DP and GA
Percent difference of PU b/w DP and GA
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 19/30
Test Case: Locality Sensitive
Hashing
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 20/30
Locality Sensitive Hashing: Area
Req
Virtex-6 has 8 Registers and 4LUTs per Slice Slices used =
Max(Reg/8, LUT/4)
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 21/30
Locality Sensitive Hashing: Loop
Trace/CIS
1111112333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
3333333333333333333333333333333333333333331111112333333333333333333333333333333333333333333333333333333333333333333
3333333333333333333333333333333333333333333333333333333333333333333333333333333333331111112333333333333333333333333
3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
3333333333341111112333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
3333333333333333333333333333333333333333333333333333331111112333333333333333333333333333333333333333333333333333333
3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333331111112333333333333
3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
3333333333333333333333341111112333333333333333333333333333333333333333333333333333333333333333333333333333333333333
33333333333333333333333333333333333333333333333333333333333333333311111123333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333331111112
3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
333333333333333333333333333333333334
A = 111111
B = 2C = 150 times 3
D = 4
ABCABCABCDABCABCABCDABCABCABCD
0 117 236 833
0 136 Inf Inf0 58 233 925
0 161 Inf Inf
1536 960 672 480
12 6 0 0
57600 38400 16800 12600
3 1 0 0
CIS: Area Requirement (Slices)
CIS: Execution Time(Cycles)
Loop Trace
Loop Trace (Original)
Substitutions
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 22/30
LSH: Solution
1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1
1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3
1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4
1 1 4 4 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1
0 0 0 14148 0 15553 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 925 15073 0 16478 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1536 1548 15073 15553 16478 29078 30614 30626 43226 43229 44765 44777 57377 58913 58925 71525 73061 73073 85673 85676 87212 87224 99824 101360 101372 113972 115508 115520 128120
1536 1548 14148 15553 15565 29078 30614 30626 43226 43229 44765 44777 57377 58913 58925 71525 73061 73073 85673 85676 87212 87224 99824 101360 101372 113972 115508 115520 128120 128123
Partitioning: 1 bin of Size 925 Slices
Loop Trace
Selection
Placement
econfiguration Map
Execution Map
Software Execution Time = 532341 cycles
Best Time (without Reconfig, Best CIS) = 117777 cycles
Area Required for Best CIS = 17529 Slices
Achieved Time = 128123 cycles
Area used = 1028 Slices
Just 1.94 percent less than best possible Execution time
Using 17 times less area!
Reconfig. Overhead: 1838 Cycles
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 23/30
LSH: Solution
1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1
1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3 1 2 1 1 2 1 1 2 1 3
1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4
1 1 4 4 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1
Partitioning: 1 bin of Size 925 Slices
Loop Trace
Selection
Placement
econfiguration Map
Execution Map
Software Execution Time = 532341 cycles
Best Time (without Reconfig, Best CIS) = 117777 cycles
Area Required for Best CIS = 17529 Slices
Achieved Time = 127341 cycles
Area used = 925 Slices
Just 1.79 percent less than best possible Execution time
Using 19 times less area!
Reconfig. Overhead: 0 Cycles
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 925 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1536 1548 14148 15684 15696 28296 29832 29844 42444 42447 43983 43995 56595 58131 58143 70743 72279 72291 84891 84894 86430 86442 99042 100578 100590 113190 114726 114738 127338
1536 1548 14148 15684 15696 28296 29832 29844 42444 42447 43983 43995 56595 58131 58143 70743 72279 72291 84891 84894 86430 86442 99042 100578 100590 113190 114726 114738 127338 127341
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 24/30
LSH: Results (Execution Time
[Cycles])
Slices Area Ratio Software Dynamic Programming Genetic Algorithm
8765 α = 2 532341 125613 118392
5843 α = 3 532341 136320 119229
4382 α = 4 532341 131315 119341
3506 α = 5 532341 139991 119369
2922 α = 6 532341 145911 119776
115000
120000
125000
130000
135000
140000
145000
150000
α = 2 α = 3 α = 4 α = 5 α = 6
DynamicProgramming
Genetic Algorithm
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 25/30
QUESTIONS?
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 26/30
Push to PreviousConfiguration, if
feasible
A physical_area
area
Selected CIS
takes up entire
available area in
current
Configuration
Area available in
Current Configuration,
but not feasible for
CIS of previous tasks
[1]Dynamic Reconfig of CFU by T. Mitra, 2009c
Dynamic Programming (Ref)
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 27/30
Line 3: Flooring Takes you to the end of Last Complete Config. If
it is empty, no need to explore more configurations
Line 7: Similar logic as Line 3, if all lopes done and thr is an
empty configuration, then end
[1]Dynamic Reconfig of CFU by T. Mitra, 2009c
Dynamic Programming (Ref)
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 28/30
M M B M M B M M B M M M B M M B M M B M M M B M M B M M B M
1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 1 2 3 1 2 3 4
1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 4 1 1 4 1
Partitioning: 1 bin of Size 925 Slices
Loop Trace
Selection
Placement
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 29/30
Configurabl
e Fabric
Configurabl
e Fabric
Virtual
Area
Virtual
AreaConfigurabl
e Fabric
Dynamic Reconfiguration Partial DynamicReconfiguration
Virtual AreaReconfiguration
Overhead
8/12/2019 Stretching Performance Potentials Beyond Hardware Constraints
http://slidepdf.com/reader/full/stretching-performance-potentials-beyond-hardware-constraints 30/30
Recon. Fabric
GPP
Recon. Fabric
GPP
Partial Reconfigurable SystemTiled Reconfigurable System
Any desired chunk of fabric may be
reconfiguredCommunication network can’t be
static
Pre-defined reconfigurable regions
Communication network can bestatic