Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems

International Symposium on Low Power Electronics and Design

Energy-Efficient Non-Minimal Path On-chip Interconnection Network

for Heterogeneous Systems

Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai

University of Minnesota – Twin Cities

2

Network-on-Chips

CoreR

Leads to latencyLeads to energy

consumption

ScalableProvides high

bandwidth

CoreR

CoreR

CoreR

CoreR

CoreR

CoreR

CoreR

Heterogeneous System

DataParallel

DataParallel

DataParallel

DataParallel

Super-scalar

Super-scalar

Super-scalar

Super-scalar

3

Only some routers are fully utilized

4

DVFS for Reducing NoC Energy

Dynamic Voltage and Frequency Scaling • Router energy dominates• DVFS reduces router energy, but leads to delay• Previous work are conservative on aggressiveness

We need more aggressive DVFS

5

Limitations of Aggressive DVFS

Dynamic Voltage

Frequency Scaling

Our Previous Work *

This Work

Latency Throughput

• DVFS to reduce energy• Limitations of Aggressive DVFS– Increase latency– Reduce throughputWork for limited traffic pattern

Sensitive Insensitive

Hig

h

Latency

Thro

ughp

utLo

w

Contention

* Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011

1 2 3 4

1 2 3 4

Flexible-Pipeline Routers

Frequency = 0.5F

Frequency = 0.5F

TFlexible pipeline reduces router pipeline delay

T

T

6

7

Exploiting DVFS Opportunity

(a) Minimal path routing

High utilization

Mid utilization

Low utilization

1

Src1 Dest1

(b) Non-minimal path routing

1’

Src1 Dest1

8

• Dynamic Energy: EDyn ∝ Vdd2

• Static Energy: ESta ∝ Vdd

• Clock Energy: EClk ∝ (Freq* Vdd2)

Router Speed

DVFS Parameters Normalized EnergyFreq (GHz) Vdd (V)

High 1.5 1.2 1.0Mid 0.75 1.0 0.67Low 0.375 0.8 0.49

Exploiting DVFS Opportunity (cont.)

Operating at Mid-frequency gets most benefit

9


100% frequency

50% frequency

25% frequency

1

Src1 Dest1


1’

Src1 Dest1

Exploiting DVFS Opportunity (cont.)

1. Performance

2. Dynamic Energy

3. Static Energy

More benefit with bigger network

10

• Introduction• Non-minimal path selection

- Issue- Solution- Challenges

• Infrastructure (CPU+GPU)• Results• Conclusion

Outline

11

Non-minimal Path Routing


High utilization

Mid utilization

Low utilizationSrc Dest

(b) Non-minimal path routingSrc Dest

12

Too Close !



High utilization

Mid utilization

Low utilizationSrc Dest

Src Dest

PerformanceStatic Energy

Dynamic Energy

13

Non-minimal path routing

Too Aggressive !

Src1 Dest1

High utilization

Mid utilization

Low utilization

Static EnergyDynamic Energy

14

Dynamic Network Tuning

Input

Slack == 1

Slack = 0

Output

Dx>=3 || Dy>=3

Y

Min. path port

N

N

YLeast busy port

Initial State

Utilization Monitor

V/F Scaling

Router:Packet:

Busy information propagation

How to determine Slack?

Busy Information Propagation• Busy Metrics- Buffer utilization- Crossbar utilization- Router utilization

• Propagation- Regional congestion awareness

[Grot et al. HPCA08]

15

Regional Congestion Awareness

16

• Local data collection• Propagation to neighboring routers• Aggregation of local & non-local data

Slack in Applications

Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time

Thread 0 Thread 1 Thread 2 Thread n Thread 0

Thread 0 read miss

Thread 0 ready

Thread 0 schedule

• CPU: Not necessarily, but assume NO slack• GPU: Based on # of threads

17

M G

C L2

18

Tile-Based Multicore System

CPU Core/GPU SM/L2 Cache/

MC

RR

G G

MEM

C L2 C L2

G G G G

M L2 C L2

MEM

MEM

MEM

C L2

G G G G

G M

C L2

G G

C M

C L2

G G

19

Benchmark

• Benchmarks– CPU: afi, ammp, art, equake, kmeans, scalparc– GPU: blackscholes, lps, lib, nn, bfs

• Evaluate ALL 30 CPU+GPU combinations• For presentation purpose, classify- CPU: 1) Memory-bound

2) Computation-bound- GPU: 1) Latency-tolerant

2) Latency-intolerant

Based on: L1 cache miss rate

Based on: Slack cycles

20

Benchmark Categorization


Hig

h

Latency

Thro

ughp

ut

Low

(I) memory-bound CPU + latency-tolerant GPU

(II) computation-bound CPU + latency-tolerant GPU

(III) memory-bound CPU + latency-intolerant GPU

(IV) computation-bound CPU + latency-intolerant GPU

Category I Category II Category III Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001

1Baseline DVFS DVFS+Non-min

Net

wor

k En

ergy

21

Network Energy Saving

(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPUEnergy saving is significant on certain workloads

Category I

Category II

Category III

Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001

1

Baseline DVFSDVFS+Non-min

Nor

mal

ized

IPC

22

Performance Impact (CPU)

(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU

equake+LPS art+NN ammp+LIB0.9

0.910.920.930.940.950.960.970.980.99

1

Baseline DVFSDVFS+Non-min

Nor

mal

ized

IPC

Category I Category II Category III Category IV0.600000000000001

0.650000000000001

0.700000000000001

0.750000000000001

0.800000000000001

0.850000000000001

0.900000000000001

0.950000000000001

1

Baseline DVFS DVFS+Non-min

Nor

mal

ized

IPC

23

Performance Impact (GPU)

(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU

Performance penalty is minimal compared to DVFS

24

Non-minimal Path NoC+ Balance on-chip workloads+ Reduce NoC energy

Workload Mix• High throughput• Latency Insensitive


Hig

hLo

w

Latency

Thro

ughp

ut

Conclusion

Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed

25

Thank You!

Exploiting Slack in GPU

0 5 10 15 20 25 50 1000

0.20.40.60.8

11.2

BlackScholes LPS LIB NNBFS RAY MUM

Delay of Scheduling (cycles)

Syst

em S

peed

Up

26

Predict slack based on # of available warps

Exploiting Slack in GPU

0 5 10 15 20 25 300

5

10

15

20

25

BlackScholes

LPS

LIBNN

BFSRAY

MUM

Tolerable Delay Cycles

Avai

l War

ps

27

Documents

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems