22
1 University of Michigan Electrical Engineering and Computer Science Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

  • Upload
    conlan

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. 20 GB HD. Introduction. Emerging applications have high performance, cost, energy demands - PowerPoint PPT Presentation

Citation preview

Page 1: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

1 University of MichiganElectrical Engineering and Computer Science

Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

Page 2: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

2 University of MichiganElectrical Engineering and Computer Science

Introduction

• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

3.5G (HSDPA)WiMax

Stereo Headset

TV out

PC / MacMemory

card

20 GB HD

[ARM 2005]

Page 3: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

3 University of MichiganElectrical Engineering and Computer Science

Loop Accelerators

• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

.C.C

Automated C gates solution

• Correct by construction

• Close designer productivity gap

• Achieve short time-to-market

Page 4: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

4 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

Page 5: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

5 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

1

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

2

RF

FU FU

BuildDatapath

ConcreteArch

3

FU FUInstantiateArch

Synthesize

Verilog,Control Signals

.v

LoopAccelerator

5 4

Page 6: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

6 University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling andDatapath Derivation

• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements

from schedule

r1 = Mem[r2]r3 = r1 + 12

Source Code Datapath

MEM +

12

ADD

LOADtime 1

time 4

FU1 FU2

Schedule. . .

Page 7: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

7 University of MichiganElectrical Engineering and Computer Science

Cost Sensitive Scheduling

• Different scheduling alternatives not equal

+1

LD1

+1

LD1

+2

LD2

LD2

+2

time

FU1 FU2 FU3

FU1 FU2 FU30

1

2

+1

+2

LD2

LD1time

FU1 FU2 FU3

FU1 FU2 FU30

1

2

• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost

Page 8: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

8 University of MichiganElectrical Engineering and Computer Science

Scheduling to Reduce Cost

• Hardware cost is function of final schedule• Increased hardware sharing = reduced cost

1

2

FU • Reusing hardware is “free”

• Traditional metrics (register pressure) not sufficient

3

4

FU

No additional costfor longer lifetime

FU

Page 9: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

9 University of MichiganElectrical Engineering and Computer Science

Initial Approach: Greedy

• Standard iterative modulo scheduler, augmented with hardware cost model

• Choose alternative which increases cost the least

while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model}

Hardware cost =FU cost + Storage cost + Wire cost

+ - * <<

Page 10: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

10 University of MichiganElectrical Engineering and Computer Science

Results – Greedy Scheduling

• 5% average cost savings

• Local scope local minima• Much more cost savings possible

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

FU Storage MUX

Page 11: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

11 University of MichiganElectrical Engineering and Computer Science

Optimal Modulo Scheduling+1 +2

LD3

-5

+4

(1,0)

(1,1) (3,0) (3,1)

(2,0) (2,1)

Op1

Op2

Op3

Loop Search Space

(FU #, time)

• Optimal modulo schedulingextends [Eichenberger ’97]

Storage cost = widthi depthi

FU cost = cost(FUi)

Page 12: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

12 University of MichiganElectrical Engineering and Computer Science

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

na

ïve

gre

ed

y

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

Results – Optimal Scheduling

• 27% average cost savings

FU Storage MUX

Page 13: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

13 University of MichiganElectrical Engineering and Computer Science

Problem Decomposition

• Exact solutions are not practical– (#FU II stages) ^ #ops possible schedules– 20 lines of C code 100 hours– Excessive runtimes even for modest-size loops

• Decompose into more manageable sub-problems– Partitioned scheduling– Time-space decomposition

Page 14: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

14 University of MichiganElectrical Engineering and Computer Science

Partitioned Scheduling

• Partition the operations into small groups• Schedule groups of operations sequentially

– Account for hardware contribution of previously scheduled groups

– Backtrack if infeasible state reached

1 2

43

5

OptimalModulo

Scheduler

1

3

5

OptimalModulo

Scheduler

1 2

43

5

Page 15: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

15 University of MichiganElectrical Engineering and Computer Science

Operation Partitioning

• Traditional partitioning: minimize edge cuts– Does not necessarily lead to good cost

• Goal: maximize hardware sharing opportunities within a group

+

LD+

LD<<

+

*

+

LD+

LD

Page 16: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

16 University of MichiganElectrical Engineering and Computer Science

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

na

ïve

gre

ed

y

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

Results – Partitioned Scheduling

• 8% average cost savings• With large number of partitions, similar to greedy

FU Storage MUX

Page 17: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

17 University of MichiganElectrical Engineering and Computer Science

Partition Size for Sharp

• Improve cost by considering more ops at a time

0

5000

10000

15000

20000

25000

30000

3 6 9 12 15 18 21 24 27 30 full

Partition Size

Co

st in

Gat

es

Page 18: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

18 University of MichiganElectrical Engineering and Computer Science

Time-Space Decomposition

1 2

43

5

1

3

52time 0:

time 1: 4

1

3

5 2

time

FU1 FU2 FU3

0

1 4

1

3

5

2

FU 1:

FU 2: 4

FU 3:

1

35

2

time

FU1 FU2 FU3

0

1 4

Time, space

Space, time

• Reduce scheduling complexity• View all operations together

• Optimize for register depth during time assignment, register width and FU cost during space assignment

Page 19: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

19 University of MichiganElectrical Engineering and Computer Science

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

na

ïve

gre

ed

yp

art

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

0

0.2

0.4

0.6

0.8

1

1.2

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

na

ïve

gre

ed

yp

art ts st

op

t

sobel fir dequant dcac viterbi sharp sha Average

No

rma

lize

d G

ate

Co

st

Results – Time-Space Scheduling

• Time, space: 19% average cost savings• Space, time: 20% average cost savings

FU Storage MUX

Page 20: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

20 University of MichiganElectrical Engineering and Computer Science

Real Cost Savings

Viterbi, naïve scheduler, 0.66 mm2

Viterbi, space-time decomposedscheduler, 0.37 mm2

43.2% overall area savings

Page 21: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

21 University of MichiganElectrical Engineering and Computer Science

Conclusion

• Automated C loop accelerator synthesis system• Modulo scheduler must be cost aware• Decomposition methods make problem tractable

– 20% average cost savings with space-time decomposition

– Importance of global view of all operations• Individual savings up to 43%• Compile times of 1 minute – 30 minutes

Page 22: Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

22 University of MichiganElectrical Engineering and Computer Science

Questions?

• For more information: http://cccp.eecs.umich.edu