Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

1 University of MichiganElectrical Engineering and Computer Science

Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

Introduction

• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

3.5G (HSDPA)WiMax

Stereo Headset

TV out

PC / MacMemory

20 GB HD

[ARM 2005]

Loop Accelerators

• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

Automated C gates solution

• Correct by construction

• Close designer productivity gap

• Achieve short time-to-market

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

ScheduledOps

BuildDatapath

ConcreteArch

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

LoopAccelerator

Modulo Scheduling andDatapath Derivation

• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements

from schedule

r1 = Mem[r2]r3 = r1 + 12

Source Code Datapath

LOADtime 1

time 4

FU1 FU2

Schedule. . .

Cost Sensitive Scheduling

• Different scheduling alternatives not equal

FU1 FU2 FU3

FU1 FU2 FU30

LD1time

FU1 FU2 FU3

FU1 FU2 FU30

• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost

Scheduling to Reduce Cost

• Hardware cost is function of final schedule• Increased hardware sharing = reduced cost

FU • Reusing hardware is “free”

• Traditional metrics (register pressure) not sufficient

No additional costfor longer lifetime

Initial Approach: Greedy

• Standard iterative modulo scheduler, augmented with hardware cost model

• Choose alternative which increases cost the least

while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model}

Hardware cost =FU cost + Storage cost + Wire cost

+ - * <<

Results – Greedy Scheduling

• 5% average cost savings

• Local scope local minima• Much more cost savings possible

sobel fir dequant dcac viterbi sharp sha Average

FU Storage MUX

Optimal Modulo Scheduling+1 +2

(1,1) (3,0) (3,1)

(2,0) (2,1)

Loop Search Space

(FU #, time)

• Optimal modulo schedulingextends [Eichenberger ’97]

Storage cost = widthi depthi

FU cost = cost(FUi)

Results – Optimal Scheduling

• 27% average cost savings

FU Storage MUX

Problem Decomposition

• Exact solutions are not practical– (#FU II stages) ^ #ops possible schedules– 20 lines of C code 100 hours– Excessive runtimes even for modest-size loops

• Decompose into more manageable sub-problems– Partitioned scheduling– Time-space decomposition

Partitioned Scheduling

• Partition the operations into small groups• Schedule groups of operations sequentially

– Account for hardware contribution of previously scheduled groups

– Backtrack if infeasible state reached

OptimalModulo

Scheduler

OptimalModulo

Scheduler

Operation Partitioning

• Traditional partitioning: minimize edge cuts– Does not necessarily lead to good cost

• Goal: maximize hardware sharing opportunities within a group

Results – Partitioned Scheduling

• 8% average cost savings• With large number of partitions, similar to greedy

FU Storage MUX

Partition Size for Sharp

• Improve cost by considering more ops at a time

3 6 9 12 15 18 21 24 27 30 full

Partition Size

Time-Space Decomposition

52time 0:

time 1: 4

FU1 FU2 FU3

FU 2: 4

FU1 FU2 FU3

Time, space

Space, time

• Reduce scheduling complexity• View all operations together

• Optimize for register depth during time assignment, register width and FU cost during space assignment

art ts st

Results – Time-Space Scheduling

• Time, space: 19% average cost savings• Space, time: 20% average cost savings

FU Storage MUX

Real Cost Savings

Viterbi, naïve scheduler, 0.66 mm2

Viterbi, space-time decomposedscheduler, 0.37 mm2

43.2% overall area savings

Conclusion

• Automated C loop accelerator synthesis system• Modulo scheduler must be cost aware• Decomposition methods make problem tractable

– 20% average cost savings with space-time decomposition

– Importance of global view of all operations• Individual savings up to 43%• Compile times of 1 minute – 30 minutes

Questions?

• For more information: http://cccp.eecs.umich.edu

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Documents

Modulo l298n

Modulo Grado11

Using the STM8L16x AES hardware accelerator · Using the STM8L16x AES hardware accelerator Introduction The purpose of cryptography is to protect sensitive data to avoid it being

Portada modulo

Modulo Geo5

Predictable Accelerator Design with Time-Sensitive Affine ...zhiruz/pdfs/dahlia-pldi2020.pdfto obtain a better accelerator design by adding this annota-tion to the innermost loop on

Modulo Articular

Modulo 3 · Title: Modulo 3 Author: CamScanner Subject: Modulo 3

Modulo Admin

Modulo Optimización

VISION GENERAL MODULO AM VISION GENERAL MODULO AM

Afero Modulo-2 Product Briefdeveloper.afero.io/static/custom/files/Modulo-2ProductBrief.pdf · Afero Modulo-2 Product Brief Afero development starts here. The Afero Modulo-2 development

Modulo 2DSDS

Modulo Art

Modulo Dispneia

Pre. - UNM Gallupsmarandache/Ashbacher-collection.pdfetc. 31 modulo 10 = 3 32 modulo 10 = 9 33 modulo 10 = 7 34 modulo 10 = I 35 modulo 10 = 3 36 modulo 10 = 9 etc. 41 modulo 10 =

Modulo IPV6

Modulo oratoria

Modulo VI

Modulo Final