1 University of MichiganElectrical Engineering and Computer Science
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System
Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke
Advanced Computer Architecture LaboratoryUniversity of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Introduction
• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,
signal processing– 10-100 Gops required– 200 mW power budget
• Applications dominated by tight loops processing large amounts of streaming data
3.5G (HSDPA)WiMax
Stereo Headset
TV out
PC / MacMemory
card
20 GB HD
[ARM 2005]
3 University of MichiganElectrical Engineering and Computer Science
Loop Accelerators
• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9
.C.C
Automated C gates solution
• Correct by construction
• Close designer productivity gap
• Achieve short time-to-market
4 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Template
• Parameterized execution resources, storage, connectivity
• Hardware realization of modulo scheduled loop
5 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Design Flow
FU Alloc.c
C Code,Performance(Throughput)
AbstractArch
1
ModuloSchedule
Op1 Op2Op3 …tim
e
FUs
ScheduledOps
2
RF
FU FU
BuildDatapath
ConcreteArch
3
FU FUInstantiateArch
Synthesize
Verilog,Control Signals
.v
LoopAccelerator
5 4
6 University of MichiganElectrical Engineering and Computer Science
Modulo Scheduling andDatapath Derivation
• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements
from schedule
r1 = Mem[r2]r3 = r1 + 12
Source Code Datapath
MEM +
12
ADD
LOADtime 1
time 4
FU1 FU2
Schedule. . .
7 University of MichiganElectrical Engineering and Computer Science
Cost Sensitive Scheduling
• Different scheduling alternatives not equal
+1
LD1
+1
LD1
+2
LD2
LD2
+2
time
FU1 FU2 FU3
FU1 FU2 FU30
1
2
+1
+2
LD2
LD1time
FU1 FU2 FU3
FU1 FU2 FU30
1
2
• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost
8 University of MichiganElectrical Engineering and Computer Science
Scheduling to Reduce Cost
• Hardware cost is function of final schedule• Increased hardware sharing = reduced cost
1
2
FU • Reusing hardware is “free”
• Traditional metrics (register pressure) not sufficient
3
4
FU
No additional costfor longer lifetime
FU
9 University of MichiganElectrical Engineering and Computer Science
Initial Approach: Greedy
• Standard iterative modulo scheduler, augmented with hardware cost model
• Choose alternative which increases cost the least
while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model}
Hardware cost =FU cost + Storage cost + Wire cost
+ - * <<
10 University of MichiganElectrical Engineering and Computer Science
Results – Greedy Scheduling
• 5% average cost savings
• Local scope local minima• Much more cost savings possible
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
FU Storage MUX
11 University of MichiganElectrical Engineering and Computer Science
Optimal Modulo Scheduling+1 +2
LD3
-5
+4
(1,0)
(1,1) (3,0) (3,1)
(2,0) (2,1)
Op1
Op2
Op3
Loop Search Space
(FU #, time)
• Optimal modulo schedulingextends [Eichenberger ’97]
Storage cost = widthi depthi
FU cost = cost(FUi)
12 University of MichiganElectrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
Results – Optimal Scheduling
• 27% average cost savings
FU Storage MUX
13 University of MichiganElectrical Engineering and Computer Science
Problem Decomposition
• Exact solutions are not practical– (#FU II stages) ^ #ops possible schedules– 20 lines of C code 100 hours– Excessive runtimes even for modest-size loops
• Decompose into more manageable sub-problems– Partitioned scheduling– Time-space decomposition
14 University of MichiganElectrical Engineering and Computer Science
Partitioned Scheduling
• Partition the operations into small groups• Schedule groups of operations sequentially
– Account for hardware contribution of previously scheduled groups
– Backtrack if infeasible state reached
1 2
43
5
OptimalModulo
Scheduler
1
3
5
OptimalModulo
Scheduler
1 2
43
5
15 University of MichiganElectrical Engineering and Computer Science
Operation Partitioning
• Traditional partitioning: minimize edge cuts– Does not necessarily lead to good cost
• Goal: maximize hardware sharing opportunities within a group
+
LD+
LD<<
+
*
+
LD+
LD
16 University of MichiganElectrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
Results – Partitioned Scheduling
• 8% average cost savings• With large number of partitions, similar to greedy
FU Storage MUX
17 University of MichiganElectrical Engineering and Computer Science
Partition Size for Sharp
• Improve cost by considering more ops at a time
0
5000
10000
15000
20000
25000
30000
3 6 9 12 15 18 21 24 27 30 full
Partition Size
Co
st in
Gat
es
18 University of MichiganElectrical Engineering and Computer Science
Time-Space Decomposition
1 2
43
5
1
3
52time 0:
time 1: 4
1
3
5 2
time
FU1 FU2 FU3
0
1 4
1
3
5
2
FU 1:
FU 2: 4
FU 3:
1
35
2
time
FU1 FU2 FU3
0
1 4
Time, space
Space, time
• Reduce scheduling complexity• View all operations together
• Optimize for register depth during time assignment, register width and FU cost during space assignment
19 University of MichiganElectrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
Results – Time-Space Scheduling
• Time, space: 19% average cost savings• Space, time: 20% average cost savings
FU Storage MUX
20 University of MichiganElectrical Engineering and Computer Science
Real Cost Savings
Viterbi, naïve scheduler, 0.66 mm2
Viterbi, space-time decomposedscheduler, 0.37 mm2
43.2% overall area savings
21 University of MichiganElectrical Engineering and Computer Science
Conclusion
• Automated C loop accelerator synthesis system• Modulo scheduler must be cost aware• Decomposition methods make problem tractable
– 20% average cost savings with space-time decomposition
– Importance of global view of all operations• Individual savings up to 43%• Compile times of 1 minute – 30 minutes
22 University of MichiganElectrical Engineering and Computer Science
Questions?
• For more information: http://cccp.eecs.umich.edu