HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc

HAT: Heterogeneous Adap1ve Thro4ling for On-‐Chip Networks

Kevin Kai-‐Wei Chang Rachata Ausavarungnirun

Chris Fallin Onur Mutlu

Execu1ve Summary Problem: Packets contend in on-‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-‐of-‐the-‐art source throFling policies

2

Outline • Background and Mo1va1on • Mechanism • Comparison Points • Results • Conclusions

3

On-‐Chip Networks

4

PE

Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE

•  Connect cores, caches, memory controllers, etc. –  Buses and crossbars are not scalable

PE PE PE …

IInterconnect

On-‐Chip Networks

5

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R PE

R Router

Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE

•  Connect cores, caches, memory controllers, etc –  Buses and crossbars are not scalable

•  Packet switched •  2D mesh: Most commonly used topology

•  Primarily serve cache misses and memory requests

Network Conges1on Reduces Performance

6

Network conges1on: êsystem performance

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

P P P

P

P

Conges1on

R Router Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE

P Packet

Limited shared resources (buffers and links) •  due to design constraints: Power, chip area, and @ming

Goal •  Improve system performance in a highly congested network

•  Observa1on: Reducing network load (number of packets in the network) decreases network conges@on, hence improves system performance

•  Approach: Source throFling (temporarily delaying new traffic injec@on) to reduce network load

7

PE PE PE …

Source Thro4ling

8

I

Network Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE

P Packet I

P P

P

P P

P

P

P P

P

P

P P

P P

Long network latency when the network is congested

PE PE PE …

Source Thro4ling

9

I

Network Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE

P Packet I

P P

P

P P

P

P

P P

P

P

P P

P P Thro4le Thro4le Thro4le P

•  ThroFling makes some packets wait longer to inject •  Average network throughput increases, hence higher system performance

P P

P P

P P

Key Ques1ons of Source Thro4ling •  Every cycle when a node has a packet to inject, source throFling blocks the packet with probability P –  We call P “thro4ling rate” (ranges from 0 to 1)

•  Thro4ling rate can be set independently on every node

Two key ques1ons: 1.  Which applica@ons to throFle? 2.  How much to throFle? Naïve mechanism: ThroFle every single node with a constant throFling rate

10

Key Observa1on #1

11

ThroFling network-‐intensive applica@ons leads to higher system performance Configura1on: 16-‐node system, 4x4 mesh network, 8 gromacs (network-‐non-‐intensive), and 8 mcf (network-‐intensive)

gromacs … mcf

I P

P

P P

P P P

P P P

Thro4le

ThroFling gromacs decreases system performance by 2% due to minimal network load reduc@on

ThroFling mcf increases system performance by 9% (gromacs: +14% mcf: +5%) due to reduced conges@on

gromacs …

Key Observa1on #1

12


mcf

I P

P

P P

P P P

P P P

Thro4le

gromacs …

Key Observa1on #1

13


mcf

I P

P P

P P

Thro4le

•  ThroFling mcf reduces conges@on •  gromacs benefits more from less network latency

•  ThroFling network-‐intensive applica@ons reduces conges@on •  Benefits both network-‐non-‐intensive and network-‐intensive applica@ons

mcf

Key Observa1on #2 There is no single thro4ling rate that works well for every applica@on workload

Network runs best at or below a certain network load Dynamically adjust throFling rate to avoid overload and under-‐u@liza@on

14

… mcf

I P P

P

P P P

P P

gromacs gromacs

I P P

P

Thro4le 0.8

… Thro4le 0.8 Thro4le 0.8 Thro4le 0.8

Outline • Background and Mo@va@on • Mechanism • Comparison Points • Results • Conclusions

15

Heterogeneous Adap1ve Thro4ling (HAT) 1.   Applica1on-‐aware thro4ling:

ThroFle network-‐intensive applica@ons that interfere with network-‐non-‐intensive applica@ons

2.   Network-‐load-‐aware thro4ling rate adjustment: Dynamically adjust throFling rate to adapt to different workloads and program phases

16




17

Applica1on-‐Aware Thro4ling 1.   Measure applica1ons’ network intensity

2.   Thro4le network-‐intensive applica1ons

18

Network-‐non-‐intensive (Unthro4led)

How to select unthroFled applica@ons?

Σ MPKI < Threshold Higher L1 MPKI

Use L1 MPKI (misses per thousand instruc@ons) to es@mate network intensity

•  Leaving too many applica@ons unthroFled overloads the network èSelect unthroFled applica@ons so that their total network intensity is less than the total network capacity

Network-‐intensive (Thro4led)

App




19

•  Different workloads require different throFling rates to avoid overloading the network

•  But, network load (frac@on of occupied buffers/links) is an accurate indicator of conges@on

•  Key idea: Measure current network load and dynamically adjust throFling rate based on load if network load > target:

Increase throttling rate else:

Decrease throttling rate

Dynamic Thro4ling Rate Adjustment

20

If network is congested, throFle more

If network is not congested, avoid unnecessary throFling




21

Epoch-‐Based Opera1on •  Applica1on classifica1on and thro4ling rate adjustment are expensive if done every cycle

•  Solu1on: recompute at epoch granularity

22

Time Current Epoch (100K cycles)

Next Epoch (100K cycles)

During epoch: Every node: 1) Measure L1 MPKI 2) Measure network load

Beginning of epoch: All nodes send measured info to a central controller, which: 1)  Classifies applica@ons 2)  Adjusts throFling rate 3)  Sends new classifica@on and

throFling rate to each node

Pugng It Together: Key Contribu1ons 1. Applica1on-‐aware thro4ling

– ThroFle network-‐intensive applica@ons based on applica@ons’ network intensi@es

2. Network-‐load-‐aware thro4ling rate adjustment – Dynamically adjust throFling rate based on network load to avoid overloading the network

23

HAT is the first work to combine applica@on-‐aware throFling and network-‐load-‐aware rate adjustment


24

Comparison Points •  Source thro4ling for bufferless NoCs

[Nychis+ Hotnets’10, SIGCOMM’12]

–  ThroFle network-‐intensive applica@ons when other applica@ons cannot inject

–  Does not take network load into account – We call this “Heterogeneous ThroFling”

•  Source thro4ling for buffered networks [ThoFethodi+ HPCA’01] –  ThroFle every applica@on when the network load exceeds a dynamically tuned threshold

–  Not applica@on-‐aware –  Fully blocks packet injec@ons while throFling – We call this “Self-‐Tuned ThroFling”

25


26

Methodology •  Chip Mul1processor Simulator

–  64-‐node mul@-‐core systems with a 2D-‐mesh topology –  Closed-‐loop core/cache/NoC cycle-‐level model –  64KB L1, perfect L2 (always hits to stress NoC)

•  Router Designs –  Virtual-‐channel buffered router: 4 VCs, 4 flits/VC

[Dally+ IEEE TPDS’92] •  Input buffers to hold contending packets

–  Bufferless deflec1on router: BLESS [Moscibroda+ ISCA’09] •  Misroute (deflect) contending packets

27

Methodology •  Workloads

–  60 mul@-‐core workloads of SPEC CPU2006 benchmarks –  4 network-‐intensive workload categories based on the network intensity of applica@ons (Low/Medium/High)

•  Metrics System performance: Fairness: Energy efficiency:

28 €

MaximumSlowdown =maxiIPCi

alone

IPCishared

€

PerfPerWatt =WeightedSpeedup

Power

€

Weighted Speedup =IPCi

shared

IPCialone

i∑

0 5

10 15 20 25 30 35 40 45 50

HL HML HM H AVG

Weighted Speedu

p

Workload Categories

No ThroFling Heterogeneous HAT

Performance: Bufferless NoC (BLESS)

29

1. HAT provides beFer performance improvement than state-‐of-‐the-‐art throFling approaches 2. Highest improvement on heterogeneous workloads -‐ L and M are more sensi1ve to network latency

+ 7.4%

+ 9.4% + 11.5%

+ 11.3%

+ 8.6%

0 5

10 15 20 25 30 35 40 45 50

HL HML HM H AVG

Weighted Speedu

p

Workload Categories

No ThroFling Self-‐Tuned Heterogeneous HAT

Performance: Buffered NoC

30

HAT provides beFer performance improvement than prior approaches

+ 3.5%

0.0

0.2

0.4

0.6

0.8

1.0

1.2

�Buffered

No ThroFling Self-‐Tuned Heterogeneous HAT

Applica1on Fairness

31 HAT provides beFer fairness than prior works

-‐ 15% -‐ 5%

0.0

0.2

0.4

0.6

0.8

1.0

1.2

BLESS

Normalized

Maxim

um Slowdw

on

(Low

er is be4

er)

No ThroFling Heterogeneous HAT

0.0

0.2

0.4

0.6

0.8

1.0

1.2

BLESS Buffered

Normalized

Perf. pe

r Wa4

No ThroFling HAT

Network Energy Efficiency

32

+ 8.5% + 5%

HAT increases energy efficiency by reducing network load by 4.7% (BLESS) or 0.5% (buffered)

Other Results in Paper •  Performance on CHIPPER [Fallin+ HPCA’11]

– HAT improves system performance

•  Performance on mul1threaded workloads – HAT is not designed for mul@threaded workloads, but it slightly improves system performance

•  Parameter sensi@vity sweep of HAT – HAT provides consistent system performance improvement on different network sizes

33

Conclusions Problem: Packets contend in on-‐chip networks (NoCs), causing conges@on, thus reducing system performance Approach: Source throFling (temporarily delaying packet injec@ons) to reduce conges@on 1) Which applica1ons to thro4le? Observa@on: ThroFling network-‐intensive applica@ons leads to higher system performance èKey idea 1: Applica@on-‐aware source throFling 2) How much to thro4le? Observa@on: There is no single thro4ling rate that works well for every applica@on workload èKey idea 2: Dynamic throFling rate adjustment Result: Improves both system performance and energy efficiency over state-‐of-‐the-‐art source throFling policies

34

HAT: Heterogeneous Adap1ve Thro4ling for On-‐Chip Networks

Kevin Kai-‐Wei Chang Rachata Ausavarungnirun

Chris Fallin Onur Mutlu

Thro4ling Rate Steps

36

Network Sizes

37

0

10

20

30

40

50

1.3%

3.9%

1.9%

11.1%4.9%

3.7%

VC-BUFFEREDHAT

0

10

20

30

40

50

Wei

ghte

d S

peed

up(H

ighe

r is

Bet

ter)

5.8%

5.9%

3.3%

6.0%

20.0%

10.4%

0

10

20

30

40

50

16 64 144

Network Size

7.2%

10.4%

6.2%

10.8%

17.2%

9.7%BLESSHAT

0

2

4

6

8

0

2

4

6

8

10

12

14

Unf

airn

ess

(Low

er is

Bet

ter)CHIPPER

HAT

16 64 144 0

2

4

6

8

Network Size

Performance on CHIPPER/BLESS

38

0 4 8

12 16 20 24 28 32 36 40 44 48

HL HML HM H AVG

Weig

hte

d S

peedup

(Hig

her

is B

ett

er)

20.0%

12.3%

25.6%

19.2%20.7%

5.9%

5.9%

6.7%

5.8%5.4%

17.2%

15.4%

24.4%16.9%14.2%

10.4%

8.6%

11.3%

11.5%9.4%

HL HML HM H AVG 0

4

8

12

16

20

Unfa

irness

(Low

er

is B

ett

er)

CHIPPERCHIPPER-HETERO.

CHIPPER-HAT

BLESSBLESS-HETERO.

BLESS-HAT

Injec@on rate (number of injected packets / cycle): +8.5% (BLESS) or +4.8% (CHIPPER)

Mul1threaded Workloads

39

Mo1va1on

40

6 7 8 9

10 11 12 13 14 15 16

80 82 84 86 88 90 92 94 96 98 100

Weighted Speedu

p

Thro4ling Rate (%)

Workload 1 Workload 2

90%

94%

Sensi1vity to Other Parameters •  Network load target: WS peaks at b/w 50% and 65% network u@liza@on – Drops more than 10% beyond that

•  Epoch length: WS varies by less than 1% b/w 20K and 1M cycles

•  Low-‐intensity workloads: HAT does not impact system performance

•  Unthro4led network intensity threshold:

41

Implementa1on •  L1 MPKI: One L1 miss hardware counter and one instruc@on hardware counter

•  Network load: One hardware counter to monitor the number of flits routed to neighboring nodes

•  Computa1on: Done at one central CPU node – At most several thousand cycles (a few percent overhead for 100K-‐cycle epoch)

42

Documents

HAT:%Heterogeneous%Adap1ve% … · 2012-10-28 · On9Chip%Networks% 4 PE& Processing&Element (Cores,&L2&Banks,&Memory&Controllers,&etc)& PE& • Connect cores,caches,memory% controllers,etc