57
UCIFF: Unified Clustering, Instruction scheduling and Fast Frequency selection for Heterogeneous Clustered VLIW Vasileios Porpodas and Marcelo Cintra University of Edinburgh LCPC 2012 slide 1 of 38 www.inf.ed.ac.uk

UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF: Unified Clustering,

Instruction scheduling and

Fast Frequency selection for

Heterogeneous Clustered

VLIW

Vasileios Porpodas and Marcelo Cintra

University of Edinburgh

LCPC 2012

slide 1 of 38 www.inf.ed.ac.uk

Page 2: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Scheduling, Scalability and Energy

• Energy becomes a major design constraint• Dynamic Instruction scheduling in hardware consumes a

large part of the energy budget• Statically scheduled processors are an energy-efficient

alternative to dynamically scheduled processors• VLIW processors are high-performance statically

scheduled processors

slide 2 of 38 www.inf.ed.ac.uk

Page 3: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Scheduling, Scalability and Energy

• Energy becomes a major design constraint• Dynamic Instruction scheduling in hardware consumes a

large part of the energy budget• Statically scheduled processors are an energy-efficient

alternative to dynamically scheduled processors• VLIW processors are high-performance statically

scheduled processors

• Resource scalability is necessary for both energyand performance• Clustered VLIW processors operate at an attractive

energy-performance point• No global buses, only point-to-point communication• Instruction scheduling done by the compiler

slide 2 of 38 www.inf.ed.ac.uk

Page 4: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Clustered VLIW

• Statically scheduled

• Scalable

• Energy efficient

• Inter-Cluster delay0

ClusterCluster

Cluster Cluster

3

1

2

slide 3 of 38 www.inf.ed.ac.uk

Page 5: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Clustered VLIW

• Statically scheduled

• Scalable

• Energy efficient

• Inter-Cluster delay

• Relies on compiler

• Explicit ILP

0

ClusterCluster

Cluster Cluster

3

1

2

Instr0 Instr1 Instr2 Instr3VLIW

slide 3 of 38 www.inf.ed.ac.uk

Page 6: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Cluster Utilization

• Few of the clusters arefully utilized

Free

10 2 3

Busy

Clusters

Ins

tr.

slide 4 of 38 www.inf.ed.ac.uk

Page 7: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Cluster Utilization

• Few of the clusters arefully utilized

• Slack in schedule

• Opportunity to saveenergy withoutperformance impact

Free

10 2 3

Busy

Clusters

Ins

tr.

slide 4 of 38 www.inf.ed.ac.uk

Page 8: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Heterogeneous Clustered VLIW

• Each cluster operates atits own frequency

0

ClusterCluster

Cluster Cluster

3

1

2Freq

Freq Freq

Freq

slide 5 of 38 www.inf.ed.ac.uk

Page 9: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Heterogeneous Clustered VLIW

• Each cluster operates atits own frequency

• Exploit clusterunder-utilization

• Save energy by slowingdown under-utilizedclusters

0

ClusterCluster

Cluster Cluster

3

1

2Freq

Freq Freq

Freq

slide 5 of 38 www.inf.ed.ac.uk

Page 10: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

The Problem of Energy Efficiency on

Clustered VLIW

• Cluster utilization varies• Some clusters are fully utilized• Others are running idle

• Per-cluster DVFS required for energy efficiency• Hardware DVFS not applicable (breaks semantics)• Effective compile-time DVFS required

slide 6 of 38 www.inf.ed.ac.uk

Page 11: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Outline

Introduction

Problem Definition and Existing Solutions

UCIFF

Experimental Setup and Results

Conclusion

slide 7 of 38 www.inf.ed.ac.uk

Page 12: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Problem Definition

• How to determine“best” frequency foreach cluster

• ”Best” freq. is the onethat leads to bestDelay/Energy/ED/ED2

• Determine frequenciesduring InstructionScheduling

0

ClusterCluster

Cluster Cluster

3

1

2Freq

FreqFreq

Freq

slide 8 of 38 www.inf.ed.ac.uk

Page 13: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Existing Solution [CGO’07] (Decoupled)

{f0 0, f }

{f0 , f }1

{f0 , f }2

{f 0, f }1

{f , f }1 1

{f , f }1 2

{f 0, f }2

{f , f }2 1

{f , f }2 2Fre

quen

y C

onfi

gura

tion

s

slide 9 of 38 www.inf.ed.ac.uk

Page 14: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Existing Solution [CGO’07] (Decoupled)

{f0 0, f }

{f0 , f }1

{f0 , f }2

{f 0, f }1

{f , f }1 1

{f , f }1 2

{f 0, f }2

{f , f }2 1

{f , f }2 2Fre

quen

y C

onfi

gura

tion

s

1.Freq. Configuration

EstimateBest

Configuration

(Fx,Fy)

slide 9 of 38 www.inf.ed.ac.uk

Page 15: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Existing Solution [CGO’07] (Decoupled)

1.Freq. Configuration

EstimateBest

Configuration

(Fx,Fy)

{f0 0, f }

{f0 , f }1

{f0 , f }2

{f 0, f }1

{f , f }1 1

{f , f }1 2

{f 0, f }2

{f , f }2 1

{f , f }2 2Fre

quen

y C

onfi

gura

tion

s

Clu

ster

ing

Inst

r Sc

hed

Scheduling

2.Clustered+ScheduledCode

slide 9 of 38 www.inf.ed.ac.uk

Page 16: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Existing Solution [CGO’07] (Decoupled)

• Phase ordering problem

Clu

ster

ing

Inst

r Sc

hed

Scheduling

2.Clustered+ScheduledCode

1.Freq. Configuration

EstimateBest

Configuration

(Fx,Fy)

slide 9 of 38 www.inf.ed.ac.uk

Page 17: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Existing Solution [CGO’07] (Decoupled)

• Phase ordering problem

• Estimation of Best freq. requires knowledge ofPerformance and Energy

• Performance and Energy measurement requiresschedule

• Scheduling requires that the frequencies are set

Clu

ster

ing

Inst

r Sc

hed

Scheduling

2.Clustered+ScheduledCode

1.Freq. Configuration

EstimateBest

Configuration

(Fx,Fy)

slide 9 of 38 www.inf.ed.ac.uk

Page 18: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Outline

Introduction

Problem Definition and Existing Solutions

UCIFF

Experimental Setup and Results

Conclusion

slide 10 of 38 www.inf.ed.ac.uk

Page 19: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF (unified)

• No cyclic dependency

{f0 0, f }

Code{f0 , f }1

{f0 , f }2

{f 0, f }1

{f , f }1 1

{f , f }1 2

{f 0, f }2

{f , f }2 1

{f , f }2 2Fre

quen

y C

onfi

gura

tion

s

UCIFFCluster assignment

+Clustered+ScheduledFreq. Configuration

Instruction SchedulingFrequency selection

slide 11 of 38 www.inf.ed.ac.uk

Page 20: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Full-Search Solution

• Brute-force: Try (schedule) all configurations

• After trying all find the best

• Obviously the slowest method

slide 12 of 38 www.inf.ed.ac.uk

Page 21: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Full-Search Solution

0{f 0, f }

{f 0, f }2

0{f , f }

0{f , f }{f , f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

ScheduledFre

quen

cy C

onfi

gura

tion

s

Instr.

1

2

0

slide 12 of 38 www.inf.ed.ac.uk

Page 22: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Full-Search Solution

0{f 0, f }

{f 0, f }2

0{f , f }

0{f , f }{f , f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

ScheduledFre

quen

cy C

onfi

gura

tion

s

Instr.

1

2

0

Schedule

slide 12 of 38 www.inf.ed.ac.uk

Page 23: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Full-Search Solution

0{f 0, f }

{f 0, f }2

Eva

luat

e

{f 0, f }2

0{f , f }

0{f , f }{f , f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

ScheduledFre

quen

cy C

onfi

gura

tion

s

Instr.

1

2

0

Schedule

B

Final Schedule

Best Freq ConfigurationB

slide 12 of 38 www.inf.ed.ac.uk

Page 24: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Theoretical Oracle Solution

• A-priori knowledge of the best frequencyconfiguration

• Schedules only the configuration which willgenerate the best schedule

• Fastest but Non-implementable

slide 13 of 38 www.inf.ed.ac.uk

Page 25: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Theoretical Oracle Solution

0{f 0, f }

{f 0, f }2

0{f , f }

0{f , f }{f , f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

ScheduledFre

quen

cy C

onfi

gura

tion

s

Instr.

1

2

0

slide 13 of 38 www.inf.ed.ac.uk

Page 26: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Theoretical Oracle Solution

0{f 0, f }

{f 0, f }2

0{f , f }

0{f , f }{f , f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

ScheduledFre

quen

cy C

onfi

gura

tion

s

Instr.

1

2

0

Schedule

slide 13 of 38 www.inf.ed.ac.uk

Page 27: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Theoretical Oracle Solution

{f 0, f }2 {f 0, f }2

ScheduledFre

quen

cy C

onfi

gura

tion

s

Instr.

Schedule

B

Final Schedule

slide 13 of 38 www.inf.ed.ac.uk

Page 28: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

• Finds a good solution without resorting tofull-search• Hill-Climbing over frequency configurations• Partial schedules “STEP” scheduling cycles each• At the end of each “STEP” it finds the best configuration

and schedules it along with its neighbors for the next“STEP”

slide 14 of 38 www.inf.ed.ac.uk

Page 29: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

Active Partial Schedule

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

slide 15 of 38 www.inf.ed.ac.uk

Page 30: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

Active Partial Schedule Best Freq ConfigurationB Neighbor to BN

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2 B

N

NE

valu

ate

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

slide 15 of 38 www.inf.ed.ac.uk

Page 31: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

Active Partial Schedule Best Freq ConfigurationB Neighbor to BActive Set

NKill Freq. Configuration

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2 B

N

N

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

slide 15 of 38 www.inf.ed.ac.uk

Page 32: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

Inactive Partial ScheduleActive Partial Schedule Best Freq ConfigurationB

Neighbor to BActive Set

NKill Freq. Configuration

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

slide 15 of 38 www.inf.ed.ac.uk

Page 33: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

N

N

N

BE

valu

ate

Inactive Partial ScheduleActive Partial Schedule Best Freq ConfigurationB

Neighbor to BActive Set

NKill Freq. Configuration

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

slide 15 of 38 www.inf.ed.ac.uk

Page 34: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

Inactive Partial ScheduleActive Partial Schedule Best Freq ConfigurationB

Neighbor to BActive Set

NKill Freq. Configuration

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

All instructionsscheduled

slide 15 of 38 www.inf.ed.ac.uk

Page 35: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

0{f 0, f }

{f 0, f }2

STEP STEP STEP

{f 0, f }2B

Eva

luat

e

Final Schedule

Inactive Partial ScheduleActive Partial Schedule Best Freq ConfigurationB

Neighbor to BActive Set

NKill Freq. Configuration

0{f , f }1

0{f , f }2

{f 0, f }1

, f }{f 1 1

{f , f }1 2

{f , f }1

{f , f }2

2 2

Scheduled Instr.Fre

quen

cy C

onfi

gura

tion

s

slide 15 of 38 www.inf.ed.ac.uk

Page 36: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF

• Benefits of UCIFF hill-climbing:• More accurate frequency selection than CGO’07 (no

phase-ordering problem)• Not based on estimation of performance or energy

consumption, it measures the actual schedule.• Less scheduling time than full-search• Accuracy close to full-search

slide 16 of 38 www.inf.ed.ac.uk

Page 37: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Outline

Introduction

Problem Definition and Existing Solutions

UCIFF

Experimental Setup and Results

Conclusion

slide 17 of 38 www.inf.ed.ac.uk

Page 38: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Experimental Setup

• Compiler• GCC-4.5.0• Modified Haifa-Scheduler• Energy model built into the scheduler

• Architecture• IA64-based 4-cluster/4-issue VLIW 1-cycle inter-cluster

delay• 4 possible frequencies. Fastest:Slowest = 7:4

• Benchmarks• MediabenchII Video Benchmark suite

• Compare• Decoupled, Full-Search, Oracle, UCIFF

slide 18 of 38 www.inf.ed.ac.uk

Page 39: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Time Complexity Comparison

0

50

100

150

200

250

300

Full−Search UCIFF Oracle/Decoupled

Normalized Scheduled Instructions

cjpegdjpeg

h263ench263dec

mpeg2encmpeg2dec

• 5× faster thanFull-Search

• 30× slower thantheoretical oracle

slide 19 of 38 www.inf.ed.ac.uk

Page 40: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Delay Comparison

1

1.02

1.04

1.06

1.08

1.1

Decoupled UCIFF Oracle/Full-Search

Normalized DELAY• Decoupled about

5% worse thanOracle

• UCIFF almostidentical to Oracle

• Delay is biasedtowards highfrequencies

slide 20 of 38 www.inf.ed.ac.uk

Page 41: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

ED2 Comparison

1

1.2

1.4

1.6

1.8

2

2.2

Decoupled UCIFF Oracle/Full-Search

Normalized ED2

• Decoupled 1.6×worse than oracle

• UCIFF within 5%of Oracle

• ED2 is hard toestimate

slide 21 of 38 www.inf.ed.ac.uk

Page 42: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Delay Estimation Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

0% 5% 10% 25% 50%

% o

f est

imat

ions

with

in r

ange

Error Margin

(DECOUPLED, DELAY)

• Decoupled: only60% ofestimations arethe same asOracle

slide 22 of 38 www.inf.ed.ac.uk

Page 43: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Delay Estimation Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

0% 5% 10% 25% 50%

% o

f est

imat

ions

with

in r

ange

Error Margin

(UCIFF, DELAY)

• Decoupled: only60% ofestimations arethe same asOracle

• UCIFF very closeto Oracle (90%)

slide 22 of 38 www.inf.ed.ac.uk

Page 44: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

ED2 Estimation Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

0% 5% 10% 25% 50%% o

f est

imat

ions

with

in m

argi

n

Error Margin

(DECOUPLED, ED2)

cjpegdjpeg

h263ench263dec

mpeg2encmpeg2dec

• Decoupled is veryinaccurate 40% ofestimations within25% of the Oracle

slide 23 of 38 www.inf.ed.ac.uk

Page 45: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

ED2 Estimation Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

0% 5% 10% 25% 50%% o

f est

imat

ions

with

in m

argi

n

Error Margin

(UCIFF, ED2)

• Decoupled is veryinaccurate 40% ofestimations within25% of the Oracle

• UCIFF is at 95%within 25% ofOracle

slide 23 of 38 www.inf.ed.ac.uk

Page 46: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Conclusion

• Proposed UCIFF, a unified scheduling algorithmthat• Performs cluster assignment• Performs instruction scheduling• Selects cluster frequencies

in a heterogeneous clustered VLIW

• UCIFF is more accurate and generates betterschedules than the current state-of-the-art

• UCIFF is faster than Full-Search while generatingcode of equivalent quality

slide 24 of 38 www.inf.ed.ac.uk

Page 47: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF: Unified Clustering,

Instruction scheduling and

Fast Frequency selection for

Heterogeneous Clustered

VLIW

Vasileios Porpodas and Marcelo Cintra

University of Edinburgh

LCPC 2012

slide 25 of 38 www.inf.ed.ac.uk

Page 48: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Backup slides

• Bibliography

• CGO’07 Estimations

• Scheduling for heterogeneous (various freq.)

• UCIFF Energy Model

• DVFS regions

• UCIFF Neighbors

• UCIFF Algorithm

slide 26 of 38 www.inf.ed.ac.uk

Page 49: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Bibliography

• [CGO’07] A. Aleta, J. Codina, A. Gonzalez, and D.Kaeli. Heterogeneous clustered vliwmicroarchitectures. In CGO, pages 354-366, 2007.

Backup Slides

slide 27 of 38 www.inf.ed.ac.uk

Page 50: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

CGO’07 Energy & Performance Estimation• Performance Estimation:

1 Perform Scheduling on a homogeneous architecture2 Cycles = cycles of homogeneous multiplied by the

arithmetic mean of the clock periods of the heterogeneousclusters: Time = cycleshom × (

∑cl Tcl)/NumOfClusters

• Energy Estimation (similar to UCIFF except:)1 Dynamic energy of cluster is equal to a fraction of that of

the homogeneous cluster, proportional to the ratio of thecluster’s frequency to the average frequency:Edyn,ins(cl) =Edyn,ins hom(cl) × fcl/[

∑cl(fcl)/NumOfClusters]

2 Energy of interconnect is equal to that of thehomogeneous Edyn,icc = Picc × NumICCshomogeneous

Backup Slides

slide 28 of 38 www.inf.ed.ac.uk

Page 51: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

Scheduling for clusters of various Freq.• Scheduler’s internal frequency is the lowest integer

common multiple of all possible frequencies of allclusters (Tsched)

• Instruction latencies are specific to each cluster andare a multiple of the original latencies:Tcl/Tsched × Original Latency

11

22

1 1

Tsched

3

Scheduling SlotInstr. 1, latency 1Instr. 2, latency 2Instr. 3, latency 1

2

2

3

3

Tcl0Freq:

Tcl1Freq:1.5 ff

1

2

1

2

3 3

Freq:f Freq:

Tsched

Tcl0 Tcl1f

HeterogeneousHomogeneous

Backup Slidesslide 29 of 38 www.inf.ed.ac.uk

Page 52: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

DVFS Region

• Scheduling regions are too small for DVFS (H/Wlimitation)

• Possible solutions:• Micro-Architecture: Push DVFS points into a FIFO queue

and take the average• Software1: Sampling at a rate acceptable by H/W• Software2: Get an average single DVFS point for the

whole program

Backup Slides

slide 30 of 38 www.inf.ed.ac.uk

Page 53: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF Energy Model

• Total Energy:• E =

∑clusters [Est(cl) + Edyn(cl)]

• Static Energy:• Est(cl) = Pst × cyclescl × Tcl

• Pst(cl) = Cst × Vcl

• Dynamic Energy:• Edyn(cl) = Edyn,ins(cl) + Edyn,icc

• Edyn,ins(cl) =∑

ins [Pins(cl) × Latency(ins, cl)]• Pins(cl) = Cdyn × fcl × V 2

cl

• Edyn,icc = Picc × NumICCs• Picc = Cdyn × ffastest × V 2

fastest

Backup Slides

slide 31 of 38 www.inf.ed.ac.uk

Page 54: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

UCIFF Neighbors

• The Configuration{fna, fnb, fnc , ...} is aUCIFF neighbor of{fa, fb, fc , ...} if nx = x

for all x except one (sayy) such that|ny − y | < NDistance. Freq. configuration

Neighbors of

f0 f1 f2

f0

f1

f2

f3

f3

Backup Slides

slide 32 of 38 www.inf.ed.ac.uk

Page 55: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

1 /* Unified Cluster assignment Instr. Scheduling and Fast Frequency selection.2 In1: METRIC_TYPE that the scheduler should optimize for.3 In2: Schedule STEP instructions before evaluating and getting the best.4 In3: STEPVAR: Decrement STEP by STEPVAR upon each evaluation.5 In4: NEIGHBORS: The number of neighbors per cluster.6 Out: Scheduled Code and Best Frequency Configuration. */7 uciff (METRIC_TYPE, STEP, STEPVAR, NEIGHBORS)8 {9 Schedule for STEP cycles and find the Best Freq Configuration (BFC)

10 do11 if (BFC not set) /* If first run */12 NEIGHBORS_SET = all frequency configurations13 else14 NEIGHBORS_SET = neighbors of BFC /*up to NEIGHBORS per cluster*/15 for FCONF in NEIGHBORS_SET16 /* Partially schedule the ready instructions of FCONF frequency

→֒configuration for STEP cycles, optimizing METRIC_TYPE */17 SCORE = cluster_and_schedule (METRIC_TYPE, STEP, FCONF)18 Store the scheduler’s calculated SCORE into SCORECARD [FCONF]19 Decrement STEP by STEPVAR until 1. /* Variable steps (optional) */20 BFC = Best Freq Configuration of SCORECARD, clear SCORECARD21 while there are unscheduled instructions in active set22 return BFC and scheduled code of BFC23 }

Backup Slides

slide 34 of 38 www.inf.ed.ac.uk

Page 56: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

1 /* In1: METRIC_TYPE: The metric type that the scheduler will optimize for.2 In2: STEP: Num of instrs to schedule before switching to next freq. conf.3 In3: FCONF: The architecture’s current frequency configuration.4 Out: Scheduled Code and metric value. */5 cluster_and_schedule (METRIC_TYPE, STEP, FCONF)6 {7 /* Restore ready list for this frequency configuration */8 READY_LIST = READY_LIST_ARRAY [FCONF]9 /* Restore current cycle. CYCLE is the scheduler’s internal cycle. */

10 CYCLE = LAST_CYCLE [FCONF]11 Restore the Reservation Table state that corresponds to FCONF12 while (instructions left to schedule && STEP > 0)13 update READY_LIST with ready to issue at CYCLE, include deferred14 sort READY_LIST based on list-scheduling priorities15 while (READY_LIST not empty)16 select INSN, the highest priority instruction from the READY_LIST17 create LIST_OF_CLUSTERS[] that INSN can be scheduled at on CYCLE18 BEST_CLUSTER=best of LIST_OF_CLUSTERS[] by comparing for each cluster

→֒calculate_heuristic(METRIC_TYPE,CLUSTER,FCONF,INSN,IPCL[])19 /* Try scheduling INSN on the best cluster */20 if (INSN can be scheduled on BEST_CLUSTER at CYCLE)21 schedule INSN, occupy LATENCY[FCONF][BEST_CLUSTER][INSN] slots22 IPCL [CLUSTER] ++ /* count number of instructions per cluster */23 remove INSN from READY_LIST24 /* If failed to schedule INSN on best cluster, defer to next cycle */25 if (INSN unscheduled)26 remove INSN from READY_LIST and re-insert it at CYCLE + 127 /* No instructions left in ready list for CYCLE, then CYCLE ++ */28 CYCLE ++29 /* If we have scheduled for STEP cycles, finalize and exit */30 if (CYCLE $>$ LAST_CYCLE[FCONF] + STEP)31 Update READY_LIST_ARRAY[], LAST_CYCLE[] and Reservation Table32 return33 }

slide 36 of 38 www.inf.ed.ac.uk

Page 57: UCIFF: Unified Clustering, Instruction scheduling and Fast ...vporpo.me/papers/uciff_lcpc2012_slides.pdf · • Estimation of Best freq. requires knowledge of Performance and Energy

1 /* In1: METRIC_TYPE: The metric type that the scheduler will optimize for.2 In2: CLUSTER: The cluster that INSN will be tested on.3 In3: FCONF: The architecture’s current frequency configuration.4 In4: INSN: The instruction currently under consideration.5 In5: IPCL: The Instruction count Per CLuster (for dyn energy).6 Out: metric value of METRIC_TYPE if INSN scheduled on CLUSTER under FCONF*/7 calculate_heuristic (METRIC_TYPE, CLUSTER, FCONF, INSN, IPCL[])8 {9 START_CYCLE = earliest cycle INSN can be scheduled at on CLUSTER

10 UCIFF_SC = START_CYCLE + LATENCY[FCONF][CLUSTER][INSN]11 switch (METRIC_TYPE)12 case ENERGY: return energy (CLUSTER, FCONF, UCIFF_SC, IPCL[])13 case EDP: return edp (CLUSTER, FCONF, UCIFF_SC, IPCL[])14 case ED2: return ed2 (CLUSTER, FCONF, UCIFF_SC, IPCL[])15 case DELAY: return UCIFF_SC16 }

Backup Slides

slide 38 of 38 www.inf.ed.ac.uk