Transcript
Page 1: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

1 University of MichiganElectrical Engineering and Computer Science

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

Page 2: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

2 University of MichiganElectrical Engineering and Computer Science

Introduction

• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

3.5G (HSDPA)WiMax

Stereo Headset

TV out

PC / MacMemory

card

20 GB HD

[ARM 2005]

Page 3: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

3 University of MichiganElectrical Engineering and Computer Science

Loop Accelerators

• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

.C.C

Automated C gates solution

• Correct by construction

• Close designer productivity gap

• Achieve short time-to-market

Page 4: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

4 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

Page 5: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

5 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

1

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

2

RF

FU FU

BuildDatapath

ConcreteArch

3

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

.v

LoopAccelerator

5 4

Page 6: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

6 University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling andDatapath Derivation

• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements

from schedule

r1 = Mem[r2]r3 = r1 + 12

Source Code Datapath

MEM +

12

ADD

LOADtime 1

time 4

FU1 FU2

Schedule. . .

Page 7: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

7 University of MichiganElectrical Engineering and Computer Science

Cost Sensitive Scheduling

• Different scheduling alternatives not equal

+1

LD1

+1

LD1

+2

LD2

LD2

+2

time

FU1 FU2 FU3

FU1 FU2 FU30

1

2

+1

+2

LD2

LD1time

FU1 FU2 FU3

FU1 FU2 FU30

1

2

• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost

Page 8: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

8 University of MichiganElectrical Engineering and Computer Science

Scheduling to Reduce Cost

• Hardware cost is function of final schedule• Increased hardware sharing = reduced cost

1

2

FU • Reusing hardware is “free”

• Traditional metrics (register pressure) not sufficient

3

4

FU

No additional costfor longer lifetime

FU

Page 9: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

9 University of MichiganElectrical Engineering and Computer Science

Initial Approach: Greedy

• Standard iterative modulo scheduler, augmented with hardware cost model

• Choose alternative which increases cost the least

while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model}

Hardware cost =FU cost + Storage cost + Wire cost

+ - * <<

Page 10: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

10 University of MichiganElectrical Engineering and Computer Science

Results – Greedy Scheduling

• 5% average cost savings

• Local scope local minima• Much more cost savings possible

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

FU Storage MUX

Page 11: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

11 University of MichiganElectrical Engineering and Computer Science

Optimal Modulo Scheduling+1 +2

LD3

-5

+4

(1,0)

(1,1) (3,0) (3,1)

(2,0) (2,1)

Op1

Op2

Op3

Loop Search Space

(FU #, time)

• Optimal modulo schedulingextends [Eichenberger ’97]

Storage cost = widthi depthi

FU cost = cost(FUi)

Page 12: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

12 University of MichiganElectrical Engineering and Computer Science

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

Results – Optimal Scheduling

• 27% average cost savings

FU Storage MUX

Page 13: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

13 University of MichiganElectrical Engineering and Computer Science

Problem Decomposition

• Exact solutions are not practical– (#FU II stages) ^ #ops possible schedules– 20 lines of C code 100 hours– Excessive runtimes even for modest-size loops

• Decompose into more manageable sub-problems– Partitioned scheduling– Time-space decomposition

Page 14: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

14 University of MichiganElectrical Engineering and Computer Science

Partitioned Scheduling

• Partition the operations into small groups• Schedule groups of operations sequentially

– Account for hardware contribution of previously scheduled groups

– Backtrack if infeasible state reached

1 2

43

5

OptimalModulo

Scheduler

1

3

5

OptimalModulo

Scheduler

1 2

43

5

Page 15: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

15 University of MichiganElectrical Engineering and Computer Science

Operation Partitioning

• Traditional partitioning: minimize edge cuts– Does not necessarily lead to good cost

• Goal: maximize hardware sharing opportunities within a group

+

LD+

LD<<

+

*

+

LD+

LD

Page 16: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

16 University of MichiganElectrical Engineering and Computer Science

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

Results – Partitioned Scheduling

• 8% average cost savings• With large number of partitions, similar to greedy

FU Storage MUX

Page 17: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

17 University of MichiganElectrical Engineering and Computer Science

Partition Size for Sharp

• Improve cost by considering more ops at a time

0

5000

10000

15000

20000

25000

30000

3 6 9 12 15 18 21 24 27 30 full

Partition Size

Co

st in

Gat

es

Page 18: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

18 University of MichiganElectrical Engineering and Computer Science

Time-Space Decomposition

1 2

43

5

1

3

52time 0:

time 1: 4

1

3

5 2

time

FU1 FU2 FU3

0

1 4

1

3

5

2

FU 1:

FU 2: 4

FU 3:

1

35

2

time

FU1 FU2 FU3

0

1 4

Time, space

Space, time

• Reduce scheduling complexity• View all operations together

• Optimize for register depth during time assignment, register width and FU cost during space assignment

Page 19: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

19 University of MichiganElectrical Engineering and Computer Science

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

Results – Time-Space Scheduling

• Time, space: 19% average cost savings• Space, time: 20% average cost savings

FU Storage MUX

Page 20: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

20 University of MichiganElectrical Engineering and Computer Science

Real Cost Savings

Viterbi, naïve scheduler, 0.66 mm2

Viterbi, space-time decomposedscheduler, 0.37 mm2

43.2% overall area savings

Page 21: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

21 University of MichiganElectrical Engineering and Computer Science

Conclusion

• Automated C loop accelerator synthesis system• Modulo scheduler must be cost aware• Decomposition methods make problem tractable

– 20% average cost savings with space-time decomposition

– Importance of global view of all operations• Individual savings up to 43%• Compile times of 1 minute – 30 minutes

Page 22: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

22 University of MichiganElectrical Engineering and Computer Science

Questions?

• For more information: http://cccp.eecs.umich.edu