Transcript

10/8/21

1

Battery

Sensor

ฮผP

Mem

Network-on-Chip Performance Analysis and Optimization for Deep Learning Applications

Preliminary Examination10 June 2021

Sumit K. Mandal

1

2

ยง Motivation

ยง Preliminary Work-1: Performance Analysis of NoCsโ€“ Priority-Aware NoCsโ€“ NoCs with Bursty Trafficโ€“ NoCs with Deflection Routing

ยง Preliminary Work-2: Communication-Aware Hardware Accelerator for DNNs

ยง Proposed Workโ€“ Multi-Objective Optimization to Design Optimized NoCโ€“ Accelerator for Graph Convolutional Network (GCN)

ยง Conclusions

Agenda

2

10/8/21

2

3

Motivation: Communication in Deep Learning

ยง Deep Neural Networks (DNNs) are growing deeper and denser day-by-day

ยง Hardware accelerators for DNNs reduce the execution time with respect to general purpose processors

ยง DNNs exhibit high communication volume between layersโ€“ Contributes up to 90% of the inference latencyโ€“ Contributes up to 40% of the energy

ยง Need for an efficient on-chip communication techniques for DNN accelerators

Our internal results from NeuroSim [1]

[1] Chen, Pai-Yu, Xiaochen Peng, and Shimeng Yu. "NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning." IEEE Trans. on CAD of Integrated Circuits and Systems 37.12 (2018): 3067-3080.

3

4

ยง System modelling challengesโ€“ Both SW and HW are growing in complexityโ€“ Complex applications require long runs

for meaningful pre-silicon performance, power, and thermal analysis

ยง Research needโ€“ Communication fabric: Central shared resourceโ€“ Fast and accurate system level modelingโ€“ Automated generation of high-level performance

and power models for SoCs

Motivation: NoC Performance AnalysisOur internal results from gem5 [1]

[1] Binkert, Nathan, et al. "The gem5 simulator." ACM SIGARCH computer architecture news 39.2 (2011): 1-7.

4

10/8/21

3

5

ยง Motivation

ยง Preliminary Work-1: Performance Analysis of NoCsโ€“ Priority-Aware NoCsโ€“ NoCs with Bursty Trafficโ€“ NoCs with Deflection Routing

ยง Preliminary Work-2: Communication-Aware Hardware Accelerator for DNNs

ยง Proposed Workโ€“ Multi-Objective Optimization to Design Optimized NoCโ€“ Accelerator for Graph Convolutional Network (GCN)

ยง Conclusions

Agenda

5

6

ยง Industrial NoCs use routers with priority arbitration to achieve predictable latencyโ€“ Packets in NoC have higher priority than new packets

Priority-Aware Networks-on-Chip Basics

Ring

QnQ1 Q2

Abstract Model

Qn

๐๐

๐

1

21

2 2

1

Q1

Q2

Ring

12

๐

queue arbiter server splitModeling primitives

Physical Network

ยง Muxโ€™ in routers modeled as priority arbitersand servers

ยง Inputs with filled color denote higher priority

6

10/8/21

4

7

Example Fabric: Xeon Phi (KNL) Processor

Also used in client and Xeon server (e.g., Skylake, Icelake)VPU: Vector Processing Unit

7

8

ยง A 4x4 mesh with YX routingยง Source to destination latency (๐ฟ!") has four

componentsโ€“ Waiting time in source queue (๐‘Š!")โ€“ Deterministic vertical latency (๐ฟ#)โ€“ Waiting time at the junction (๐‘Š!

$)โ€“ Deterministic horizontal latency (๐ฟ%)

ยง ๐ฟ# and ๐ฟ$ depend on the source-destination pair and fabric topology

๐‘ณ๐‘บ๐‘ซ = ๐‘พ๐‘ธ๐‘บ + ๐‘ณ๐’— +๐‘พ๐‘ธ

๐‘ฑ + ๐‘ณ๐’‰

ยง ๐‘Š+! and ๐‘Š+

, depend on injection rates at different queues and need detailed analysis

Performance Analysis of Comm. Fabric

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

13

13

14

14

15

15

16

16

9

9

10

10

11

11

12

12

R

PEPE: Processing ElementR: Router

๐‘บ

๐‘ซ

8

10/8/21

5

9

Prior Work on NoC Performance AnalysisNoC Priority-

basedMultiple classes

Scalable BurstyTraffic

[1,2,5] รป โœ“ โœ“ โœ“[3,4] โœ“ รป โœ“ โœ“

Off-chipNetwork

Priority-based

Multiple classes

Scalable BurstyTraffic

[6,7] โœ“ รป โœ“ รป[8] โœ“ โœ“ รป รป

Priority-based

Multiple classes

Scalable BurstyTraffic

This work โœ“ โœ“ โœ“ โœ“NoC In addition, deflection routing is not considered by prior work

We extended our work to account for deflection routing

[1] Ogras et al. "An analytical approach for network-on-chip performance analysis." IEEE Trans. on Computer-Aided Design of IC and Systems (2010)

[2] Bogdan et al. "Non-stationary traffic analysis and its implications on multicore platform design." IEEE Trans. on Computer-Aided Design of IC and Systems (2011)

[3] Kiasari et al. "An analytical latency model for networks-on-chip." IEEE Trans. on Very Large-Scale Integration Systems21.1 (2013)

[4] Kashif et al. "Bounding buffer space requirements for real-time priority-aware networks." Design Automation Conference (ASP-DAC), 2014

[5] Qian et al. "A support vector regression (SVR)-based latency model for network-on-chip (NoC) architectures." IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems 35.3 (2016)

[6] Bertsekas et al. Data networks. Vol. 2. New Jersey: Prentice-Hall International, 1992.

[7] Walraevens, Joris. Discrete-time queueing models with priorities. Diss. Ghent University, 2004.

[8] G. Bolch et al. Queuing Networks and Markov Chains. Wiley. 2006

9

10

Overview of the Automated Flow

GenerateAnalytical

PerformanceModel

Analytical modelโ€ข Latencyโ€ข Throughput

Routing Algorithm(e.g., XY, adaptive)

Executable analytical models

Fabric Topology(e.g., Ring, Mesh, Torus)

Input Parameters(e.g.,injection rate, burstiness prob.)

Communication pattern (e.g., all-to-all)

Replace existing simulation models

10

10/8/21

6

11

Background: Queuing Systems

๐œ† = arrival rate, ๐œ‡ = service rate

Server utilization (๐œŒ) = &'

ยง Kendallโ€™s notation for queuing discipline: A/B/mยง Arrival and departure may have different distribution (e.g.

Poisson (M), Deterministic (D) , General (G)).

ยง ๐‘พ๐Ÿ =๐‘น๐Ÿ

๐Ÿ (๐†๐Ÿ

ยง ๐‘พ๐Ÿ = ๐‘น๐Ÿ+๐†๐Ÿ๐‘พ๐Ÿ๐Ÿ (๐†๐Ÿ(๐†๐Ÿ

ยง ๐œŒ* =&!'!

W: average waiting timeT: service timeR: average residual time

Geo/D/1 with T = 2, 1% average error

Simulation vs Analytical for Basic Priority

๐

๐€๐Ÿ

๐€๐Ÿ

Priority Rule: (1>2)

1

2

Q1

Q2

11

13

Limitations of Basic Priority Based Models (1)

Applying basic priority equation for class 3 tokens results in pessimistic solution.

Reason: Not all packets of class 1 will block packets of class 3. Packets of class 3 can occupy the server if a class 2 packets is being served.

๐€๐Ÿ, ๐€๐Ÿ

๐€๐Ÿ‘

๐

๐

๐€๐Ÿ1

2

๐€๐Ÿ

๐€๐Ÿ‘3 3

1 1 2 1

Split on high priority๐

๐€๐Ÿ

๐€๐Ÿ

Priority Rule: (1>2)

1

2

Reminder: basic priority

Q1

Q2

13

10/8/21

7

14

Limitations of Basic Priority Based Models (2)

๐€๐Ÿ

๐€๐Ÿ, ๐€๐Ÿ‘๐

๐€๐Ÿ

๐1

2

๐€๐Ÿ

๐€๐Ÿ‘2 3

1 1

2

1

3

๐๐€๐Ÿ

๐€๐Ÿ

Priority Rule: (1>2)

1

2

Reminder: basic priority

Split on low priority

Applying basic priority equation for class 3 tokens results in optimistic solution.

Reason: Packets of class 2 will have effect of class 1 indirectly as class 2 packets have to wait due to class 3 packets

Q1

Q2

14

15

ยง Extend decomposition method [1] to handle priority arbitration based multi-class networks in industry

ยง We identified two transformationsโ€“ ST: structural transformationโ€“ RT: service rate transformation

ยง Complex priority-based networks are decomposed iteratively to systems of equivalent queues using ST/RT

ยง Obtain a closed form analytical expression for the equivalent system

[1] G. Bolch et al. Queuing Networks and Markov Chains. Wiley. 2006

Proposed Network Transformations

15

10/8/21

8

16

Structural Transformation (ST)

๐

๐€๐Ÿ‘

๐€๐Ÿ

๐€๐Ÿ๐€๐Ÿ‘

๐1

2

ยง Packets of class 3 do not have to wait for all packets present in Q1

โ€“ Packets of class 3 and 2 can be served at a same timeยง Packets of class 3 will wait only when the server is

busy serving class 1 packetsโ€“ Need to decompose class 1 and class 2 packets

ยง Proposed transformation separates class 1 packets and puts them in a virtual queue (๐๐Ÿ- )

ยง Equivalence is demonstrated by the result shown on the right

๐๐Ÿ$

Q1

Q2

๐

๐

๐€๐Ÿ

๐€๐Ÿ

1

2

๐

Comparison of Original and Modified Network

๐€๐Ÿ, ๐€๐Ÿ Q1

Q2๐€๐Ÿ‘

๐€๐Ÿ, ๐€๐Ÿ

16

17

Analytical Model after ST

where ๐ถ/%0 : obtained from decomposition technique

๐๐Ÿ$

Q1

Q2

๐

๐

๐€๐Ÿ, ๐€๐Ÿ

๐€๐Ÿ‘

(๐€๐Ÿ, ๐‘ช๐‘ซ๐Ÿ)

๐€๐Ÿ

1

2

๐

ยง ST enables us to derive close-form analytical equationsยง Expression of residual time of class 1 in ๐๐Ÿ-

๐‘…2!!"=12๐œŒ3 ๐‘‡ โˆ’ 1 +

12๐œŒ3๐‘‡ ๐ถ/#

0 + 1 โˆ’ ๐œ‡2

โˆ’๐œŒ2๐œ‡2

ยง Waiting time of class 1 in ๐๐Ÿ-

๐‘Š2!!"=

๐‘…2!!"

1 โˆ’ ๐œŒ2ยง Residual time of class 3

๐‘…3 = ๐‘…2!!" + ๐œŒ2

ยง Finally, waiting time of class 3

๐‘พ๐Ÿ‘ =๐‘น๐Ÿ‘ + ๐†๐Ÿ๐‘พ๐Ÿ

๐‘ธ๐Ÿ"

๐Ÿ โˆ’ ๐†๐Ÿ โˆ’ ๐†๐Ÿ‘

Comparison of Simulation and Analytical Models

Geo/D/1 with T = 2, 3% average error

17

10/8/21

9

18

Service Rate Transformation (RT)

๐€๐Ÿ

๐€๐Ÿ‘

๐€๐Ÿ 1

2

Q1

Q2

(๐€๐Ÿ, ๐‘ช๐‘จ๐Ÿ)

(๐€๐Ÿ, ๐‘ช๐‘จ๐Ÿ),(๐€๐Ÿ‘, ๐‘ช๐‘จ๐Ÿ‘)

๐

๐โˆ—

๐œ‡3 = ๐œ‡, ๐œ‡0 = ๏ฟฝฬ‚๏ฟฝ

ยง Packets in Q1 effectively increase the service time of class 2 packetโ€“ Need to modify the service time of class 2 packet

ยง Challenging to model since not all packets in Q2 will wait for packets in Q1

ยง Insight: Modified service time of class 2 is independent of incoming distribution of class 3

Observations

ยง Decompose priority arbitration by modifying the service time of class 2

ยง Approximate first order moments of modified service rate (4๐)

ยง 4๐ = ๐Ÿ7๐‘ป= ๐Ÿ

๐‘ป9๐šซ๐“, ๐šซ๐“ is computed through busy-

period concept

Proposed Approach

Q1

Q2๐€๐Ÿ, ๐€๐Ÿ‘

18

20

Service Rate Transformation (RT):2nd Moment

๐‘Š0 =๐‘…0 + ๐œŒ2๐‘Š21 โˆ’ ๐œŒ2 โˆ’ ๐œŒ0

๐‘Š0 =;๐‘…0

1 โˆ’ <๐œŒ0+ ฮ”๐‘‡,๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ <๐œŒ0 = ๐œ†0 ๐‘‡ + ฮ”๐‘‡

Simple Priority Service rate transformation

;๐‘…0 = ๐‘Š0 โˆ’ ฮ”๐‘‡ 1 โˆ’ <๐œŒ0

Decoupled from the busy period. It is common for both traffic classes 2 and 3 in Figure (b), while the ฮ”๐‘‡ term applies only to traffic class 2

Re-arranging the terms to get ;๐‘น๐Ÿ:

๐‘Š0 =;๐‘น๐Ÿ + ๐‘…3

1 โˆ’ ๐œŒ3 โˆ’<๐œŒ0+ ฮ”๐‘‡

๐‘Š3 =;๐‘น๐Ÿ + ๐‘…3

1 โˆ’ ๐œŒ3 โˆ’<๐œŒ0

(a)

(b)

๐

๐€๐Ÿ

๐€๐Ÿ

Priority Rule: (1>2)

1

2

๐

๐๐€๐Ÿ, ๐€๐Ÿ‘

๐€๐Ÿ

๐€๐Ÿ‘

๐€๐Ÿ 1

2 Comparison of Simulation and Analytical Models

Geo/D/1 with T = 24% average error

20

10/8/21

10

21

Model Generation Flow

Input: Injection rates for all traffic classes (๐€), Network topology, Routing algorithmOutput: Waiting time for all traffic classes

For each Queue and traffic class:

Get all classes having higher priority and calculate ๐ถ<

Get reference waiting time expression (๐‘Š=>?) using ๐ถ<

1. Do structural transformation

Calculate modified service time ( D๐‘‡) using ๐‘Š=>?

Calculate effective residual time ( D๐‘…) using D๐‘‡

2. Do service rate transformation

Calculate waiting time (๐‘พ) using ๐€, E๐‘ป and E๐‘น

*A graphical illustration of a representative example given in the next slide

21

22

A Representative Example

RTโ†’ Service Rate TransformationSTโ†’ Structural Transformation

ST๐€๐Ÿ‘, ๐€๐Ÿ’

๐€๐Ÿ๐๐Ÿ

Q1

Q2๐€๐Ÿ“

๐€๐Ÿ‘๐€๐Ÿ, ๐€๐Ÿ’

๐๐Ÿ)

Q3

2

2

1 1

{๐๐Ÿ‘, ๐๐Ÿ“}{๐๐Ÿ, ๐๐Ÿ’}

(b)

๐๐Ÿ๐€๐Ÿ, ๐€๐Ÿ๐€๐Ÿ

๐€๐Ÿ‘, ๐€๐Ÿ’

๐€๐Ÿ, ๐€๐Ÿ

(c)

{๐3โˆ— , ๐@}๐€๐Ÿ“

๐€๐Ÿ‘

Q1

Q2

Q3

๐๐Ÿ’โˆ—๐€๐Ÿ’2

1

{๐๐Ÿ, ๐๐Ÿ}

ST

{๐*โˆ— , ๐,}

๐๐Ÿ)๐€๐Ÿ‘, ๐€๐Ÿ’

๐€๐Ÿ“

๐๐Ÿ‘โˆ—๐€๐Ÿ‘

Q1

Q2

Q3

๐€๐Ÿ‘. ๐๐Ÿ’โˆ—๐€๐Ÿ’

21

(d)

๐€๐Ÿ, ๐€๐Ÿ {๐๐Ÿ, ๐๐Ÿ}

RT

(e)

Fully decomposed

queuing system

Q1

๐€๐Ÿ‘, ๐€๐Ÿ’

๐€๐Ÿ“

Q2

Q3{๐*โˆ— , ๐-โˆ—}

๐๐Ÿ“โˆ—

๐€๐Ÿ, ๐€๐Ÿ{๐๐Ÿ, ๐๐Ÿ}

๐€๐Ÿ‘, ๐€๐Ÿ’

Q1

Q2

{๐*โˆ— , ๐-โˆ—}

๐€๐Ÿ, ๐€๐Ÿ{๐๐Ÿ, ๐๐Ÿ}

RT

๐€๐Ÿ, ๐€๐Ÿ๐€๐Ÿ

๐€๐Ÿ‘, ๐€๐Ÿ’

1

2

๐€๐Ÿ“

๐€๐Ÿ‘๐€๐Ÿ, ๐€๐Ÿ’

Q1

Q2

Q32{๐๐Ÿ‘, ๐๐Ÿ“}{๐๐Ÿ, ๐๐Ÿ’}

(a)

๐€๐Ÿ

\mu๐๐Ÿ

1

๐œ‡Bโˆ—:modified service rate of class-j

22

10/8/21

11

23

ยง We evaluated the proposed analytical models on โ€“ Ringโ€“ Mesh

ยง Simulation parametersโ€“ Simulation length: 10M cyclesโ€“ Warm-up period: 5000 cycles

ยง Traffic loadโ€“ Sweep from a very light load to ๐œ†345โ€“ ๐œ†345 is the injection rate at which the maximum server

utilization is 1

Experimental Setup

xPLORE

*Mandal, Sumit K., et al. "Analytical Performance Models for NoCs with Multiple Priority Traffic Classes." ACM Transactions on Embedded Computing Systems (TECS) 18.5s (2019): 1-21.

23

24

Evaluation on xPLORE: Mesh Topology (Geo Traffic)

ยง We observe less than 4% error between simulation and analysis for meshยง Analytical models are 2-3 orders of magnitude faster than simulation models

๐Ÿ”ร—๐Ÿ” Mesh ๐Ÿ–ร—๐Ÿ– Mesh

ยง Traffic pattern is all to all (i.e. each node is sending tokens to all nodes) with YX routingยง Injection rate for each source destination pair is equal

24

10/8/21

12

25

ยง One variation of scalable Intelยฎ Xeonยฎ

โ€“ 26 tiles with Core + LLC, 2 memory controllers on a 6x6 mesh

ยง Synthetic injectionโ€“ All cores send packets to all caches

ยง Validated latencies for cache-coherency flow

Verification with Intelยฎ Xeonยฎ Scalable Server (Geo Traffic)

2.2% average error

Model Accuracy (%)

LLC Hit Rate (%)

AddressNetwork

Data Network

100 98.8 93.9

50 97.7 98.1

0 97.7 98.0

25

26

ยง Collected real app traces from gem5 in Full System modeโ€“ Applications (16-threaded) from PARSEC suiteโ€“ Average statistics over 1M cycles

Evaluation with Real Applications

Modeling error is always less than 10%

26

10/8/21

13

27

ยง Motivation

ยง Preliminary Work-1: Performance Analysis of NoCsโ€“ Priority-Aware NoCsโ€“ NoCs with Bursty Trafficโ€“ NoCs with Deflection Routing

ยง Preliminary Work-2: Communication-Aware Hardware Accelerator for DNNs

ยง Proposed Workโ€“ Multi-Objective Optimization to Design Optimized NoCโ€“ Accelerator for Graph Convolutional Network (GCN)

ยง Conclusions

Agenda

27

28

ยง Industrial NoCs execute bursty trafficโ€“ Analytical models assuming input traffic as

geometric can not capture burstiness

ยง Generalized Geometric distribution (GGeo) has been studied in academia to address bursty traffic

ยง GGeo can be fully described by the first two moments

ยง Probability of burstiness (๐’‘๐’ƒ) is approximated from input traffic

Analytical Modeling with Bursty Traffic

๐‘ป๐’ƒ= 0

Geo

๐œC ๐œC92

Time

๐‘D

1 โˆ’ ๐‘D

โ€ข Selection of branches is Bernoulli trailโ€ข ๐’‘๐’ƒ: controls the level of burstinessโ€ข ๐’‘๐‘ฎ๐’†๐’: determines the inter-arrival time

between the bulk of arrivals

๐‘I>J

GGeo

28

10/8/21

14

29

Background: Queuing Systems with GGeo Input

ยง Kendallโ€™s notation for queuing discipline: A/B/mยง Arrival and departure may have different distribution (e.g.

Poisson (M), Deterministic (D) , General (G)).

๐œ† = arrival rate, ๐œ‡ = service rate

Server utilization (๐œŒ) = &'

๐

๐€๐Ÿ, ๐’‘๐’ƒ๐Ÿ

๐€๐Ÿ, ๐’‘๐’ƒ๐Ÿ

Priority Rule: (1>2)

1

2

ยง ๐‘พ๐Ÿ =๐‘น๐Ÿ+๐‘ป

๐’‘๐’ƒ๐Ÿ๐ŸL๐’‘๐’ƒ๐Ÿ

๐Ÿ (๐†๐Ÿ

ยง ๐‘พ๐Ÿ =๐‘น๐Ÿ+๐‘ป

๐’‘๐’ƒ๐Ÿ๐ŸL๐’‘๐’ƒ๐Ÿ

+๐†๐Ÿ๐‘พ๐Ÿ

๐Ÿ (๐†๐Ÿ(๐†๐Ÿ

ยง ๐œŒ* =&!'!

W: average waiting timeT: service timeR: average residual time

GGeo/D/1 with ๐‘/% = ๐‘/) = 0.4, T = 2, 1% average error

29

31

Analytical Model after Structural Transformation

where ๐ถ/%0 : obtained from decomposition technique

๐๐Ÿ$

Q1

Q2

๐

๐

(๐€๐Ÿ, ๐‘ช๐‘จ๐Ÿ),(๐€๐Ÿ, ๐‘ช๐‘จ๐Ÿ)

(๐€๐Ÿ‘, ๐‘ช๐‘จ๐Ÿ‘)

(๐€๐Ÿ, ๐‘ช๐‘ซ๐Ÿ)

๐€๐Ÿ

1

2

๐

ยง ST enables us to derive close-form analytical equationsยง Expression of residual time of class 1 in ๐๐Ÿ-

๐‘…2!!" =

12๐œŒ3 ๐‘‡ โˆ’ 1 +

12๐œŒ3๐‘‡ ๐ถ/#

0 + 1 โˆ’ ๐œ‡2

โˆ’๐œŒ2๐œ‡2

ยง Waiting time of class 1 in ๐๐Ÿ-

๐‘Š2!!"=๐‘…2!!"+

๐‘‡๐‘D#1 โˆ’ ๐‘D#

1 โˆ’ ๐œŒ2ยง Residual time of class 3

๐‘…3 = ๐‘…2!!"+ ๐œŒ2

ยง Finally, waiting time of class 3

๐‘พ๐Ÿ‘ =๐‘น๐Ÿ‘ + ๐†๐Ÿ๐‘พ๐Ÿ

๐‘ธ๐Ÿ"+

๐‘ป๐’‘๐’ƒ๐Ÿ‘๐Ÿ โˆ’ ๐’‘๐’ƒ๐Ÿ‘

๐Ÿ โˆ’ ๐†๐Ÿ โˆ’ ๐†๐Ÿ‘

Comparison of Simulation and Analytical Models

GGeo/D/1 with ๐‘/ = 0.4, T = 2, 6% average error

31

10/8/21

15

33

Computing ๐๐Ÿโˆ— , ๐‘ช๐’”๐Ÿโˆ—

ยง We first compute the probability that there is no packet of class-2 in Q2 (๐’‘๐Ÿ(๐ŸŽ)) through entropy maximization method [1]

๐’‘๐Ÿ ๐ŸŽ = ๐Ÿโˆ’๐†๐Ÿโˆ’๐†๐Ÿ๐’๐Ÿ๐Ÿ

๐†๐Ÿ+๐’๐Ÿ๐Ÿยง ๐๐Ÿโˆ— is computed using ๐’‘_๐Ÿ(๐ŸŽ)

๐‘ป๐Ÿโˆ— =๐Ÿโˆ’๐’‘๐Ÿ ๐ŸŽ

๐€๐Ÿ, ๐๐Ÿโˆ— = ๐Ÿ/๐‘ป๐Ÿโˆ—

ยง ๐‘ช๐‘บ๐Ÿโˆ— is computed through the methodology

described in [2]

๐‘ช๐‘บ๐Ÿโˆ— ๐Ÿ =

๐Ÿโˆ’๐†๐Ÿโˆ— ๐Ÿ๐’๐Ÿโˆ’๐†๐Ÿโˆ— โˆ’๐†๐Ÿโˆ—๐‘ช๐‘จ๐Ÿโˆ—

๐†๐Ÿโˆ— ๐Ÿ

[1] Kouvatsos, Demetres, and Irfan Awan. "Entropy maximisation and open queueing networks with priorities and blocking." Performance Evaluation 51.2-4 (2003): 191-227.

[2] Kouvatsos, Demetres D., and Nasreddine Tabet-Aouel. "GGeo-type approximations for general discrete-time queueing systems." Proceedings of the IFIP TC6 Task Group/WG6. 4 International Workshop on Performance of Communication Systems: Modelling and Performance Evaluation of ATM Technology. 1993.

GGeo/D/1 with ๐‘/ = 0.4, T = 2, 7% average error

33

34

Evaluations with Synthetic Bursty Traffic*

ยงAchieved <10% modeling error for rings & mesh with bursty traffic (๐’‘๐’ƒ = ๐ŸŽ. ๐Ÿ’)

6x1 ringAvg Err = 4% 6x6 mesh

Avg Err = 7%

ยง GGeo models without decomposition technique overestimates the latencyยง State-of-the-art models which ignore bursty traffic underestimates the latency

โ€ข *Mandal, Sumit K., et al. "Analytical Performance Modeling of NoCs under Priority Arbitration and BurstyTraffic." IEEE Embedded Systems Letters (2020).

โ€ข [1] Mandal, Sumit K., et al. "Analytical Performance Models for NoCs with Multiple Priority Traffic Classes,"ACM Transactions on Embedded Computing Systems (TECS) 18.5s (2019): 1-21.

[1] [1]

34

10/8/21

16

35

Evaluations with Real Applications with Burstiness

ยง Evaluated the proposed model with real applications with different probability of burstiness (shown on the top of each bar)

6x1 ring

6x6 mesh

ยง Achieve 4% modeling error on average

ยง State-of-the-art model does not consider burstinessโ€’ Errors are as high as 30%

35

36

ยง Motivation

ยง Preliminary Work-1: Performance Analysis of NoCsโ€“ Priority-Aware NoCsโ€“ NoCs with Bursty Trafficโ€“ NoCs with Deflection Routing

ยง Preliminary Work-2: Communication-Aware Hardware Accelerator for DNNs

ยง Proposed Workโ€“ Multi-Objective Optimization to Design Optimized NoCโ€“ Accelerator for Graph Convolutional Network (GCN)

ยง Conclusions

Agenda

36

10/8/21

17

37

Background: Deflection Routing

ยง Packets are deflected if ingress at the junction (Node 10) or the destination (Node 12) is fullโ€“ The deflected packets circulate within the same

row or columnโ€“ Increase congestion towards the source

ยง NoC analytical models need to take deflection into account โ€“ Average latency varies in the range of 6-70 cycles

for an injection rate of 0.25 when deflection probability varies from 0.1 to 0.5

37

38

Analytical Modeling for a System with Single Class

ยง The behavior of deflected packets is modeled by a buffer (๐‘ธ๐’…)

ยง Deflected packets and the packet in the egress queue (๐‘ธ๐’Š) form a priority-aware queuing system

ยง Rate of deflected packets (๐€๐’…๐’Š) and coefficient of variation of deflected packets (๐‘ช๐’…๐’Š๐‘จ ) are computed

Sinkdestination

Deflected traffic

1

Ring buffers (๐‘ธ๐’…)

Sink signal (๐’‘๐’…๐’Š)

๐€๐’Š, ๐‘ช๐’Š๐‘จ

๐€๐’…๐’Š

2

(a)

S

Egress queue(๐‘ธ๐’Š)

Priority arbiter

1

2

Port with alower index has a higher priority

๐‘ธ๐’Š

๐‘ธ๐’…;๐‘บ๐’…๐’Š๐€๐’…๐’Š , ๐‘ช๐’…

๐‘จ๐’Š

๐€๐’Š, ๐‘ช๐’Š๐‘จ;๐‘บ๐’Š

(E๐‘ป๐’Š, E๐‘ช๐’Š๐‘บ)

๐‘ช๐’…๐’Š๐‘ซ

๐‘ช๐’Š๐‘ซ๐‘ช๐’Š๐‘ด

(E๐‘ป๐’…๐’Š , E๐‘ช๐’…๐‘บ๐’Š)

๐Ÿ โˆ’ ๐’‘๐’…๐’Š(To sink)

(b)

๐’‘๐’…๐’Š(To ๐‘ธ๐’…)

38

10/8/21

18

41

Evaluation on 6x6 Mesh with Geometric Input*

*Mandal, Sumit K., et al. "Performance analysis of priority-aware NoCs with deflection routing under traffic congestion." Proceedings of the 39th International Conference on Computer-Aided Design. 2020.[1] Kiasari, Abbas Eslami, Zhonghai Lu, and Axel Jantsch. "An analytical latency model for networks-on-chip." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21.1 (2012): 113-123.[2] Mandal, Sumit K., et al. "Analytical Performance Models for NoCs with Multiple Priority Traffic Classes." ACM Transactions on Embedded Computing Systems (TECS) 18.5s (2019): 1-21.

ยงAchieved <8% modeling error on average for ๐’‘๐’… = ๐ŸŽ. ๐Ÿ and ๐’‘๐’… = ๐ŸŽ. ๐Ÿ‘

ยง Models without decomposition and without deflection overestimates the latencyยง Models which ignore deflection routing underestimates the latency

[1] [2]

41

42

Evaluation on Real Application (Bursty) with 6x6 Mesh

ยงAchieved <5% modeling error on average for ๐’‘๐’… = ๐ŸŽ. ๐Ÿ and ๐’‘๐’… = ๐ŸŽ. ๐Ÿ‘

[1] Kiasari, Abbas Eslami, Zhonghai Lu, and Axel Jantsch. "An analytical latency model for networks-on-chip." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21.1 (2012): 113-123.[2] Mandal, Sumit K., et al. "Analytical Performance Models for NoCs with Multiple Priority Traffic Classes." ACM Transactions on Embedded Computing Systems (TECS) 18.5s (2019): 1-21.

Accurate analytical model with bursty traffic

considering deflection routing

[1] [2]

42

10/8/21

19

43

ยง Motivation

ยง Preliminary Work-1: Performance Analysis of NoCsโ€“ Priority-Aware NoCsโ€“ NoCs with Bursty Trafficโ€“ NoCs with Deflection Routing

ยง Preliminary Work-2: Communication-Aware Hardware Accelerator for DNNs

ยง Proposed Workโ€“ Multi-Objective Optimization to Design Optimized NoCโ€“ Accelerator for Graph Convolutional Network (GCN)

ยง Conclusions

Agenda

43

44

ยง Communication volume increases with deeper and denser DNNs

ยง Target hardware: In-memory Computing basedโ€“ Exhibits less latency from off-chip memory accesses

ยง Up to 90% of the inference latency comes from communication

ยง Up to 40% of the energy is consumed by communication

ยง Need for an efficient on-chip communication strategy for DNN accelerators

Bottleneck of the DNN Accelerator: Communication

[1] Chen, Pai-Yu, Xiaochen Peng, and Shimeng Yu. "NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning." IEEE Trans. on CAD of Integrated Circuits and Systems 37.12 (2018): 3067-3080.

44

10/8/21

20

45

ยง A set of nodes are mapped to network-on-chip (NoC) routersโ€“ Need to balance the computation and minimize the communication overhead

Background of In-Memory Computing (IMC)In

put

Out

put

T

R

Tile

Router

Neuron

T

R

T

R

T

R

T

R

T

R

T

R

T

R

T

R

45

46

ยง Hierarchical structure: NoC โ†’ Tiles โ†’ Computing Elements (CE)

Operations in IMC-based Hardware

Step-by-step Operation1. Compute: Tiles with Layer 1 nodes compute their

output2. Communicate: Send their output to the tiles with

Layer 2 nodes3. Compute: Tiles with Layer 2 nodes compute their

outputโ€ฆ.Communication between multiple tiles - Crucial in overall performance

46

10/8/21

21

47

ยง Target: to achieve minimum possible communication latency for a given DNN on IMC-based hardware platform

ยง Example of minimum possible communication latency

Achieving Minimum Possible Communication Latency: Example

Layer 1 โ€“ Layer 2 Layer 2 โ€“ Layer 3Link Cycle Link Cycle

๐ฟ2,2 โˆ’ ๐ฟ0,2 1 ๐ฟ0,2 โˆ’ ๐ฟ3,2 4๐ฟ2,0 โˆ’ ๐ฟ0,0 1 ๐ฟ0,0 โˆ’ ๐ฟ3,0 4๐ฟ2,2 โˆ’ ๐ฟ0,0 2 ๐ฟ0,2 โˆ’ ๐ฟ3,0 5๐ฟ2,3 โˆ’ ๐ฟ0,2 2 ๐ฟ0,0 โˆ’ ๐ฟ3,2 5๐ฟ2,0 โˆ’ ๐ฟ0,2 3 ๐ฟ0,0 โˆ’ ๐ฟ3,0 6๐ฟ2,3 โˆ’ ๐ฟ0,0 3 ๐ฟ0,2 โˆ’ ๐ฟ3,3 6

1,1

1,2

1,3

2,1

2,2

3,1

3,2

3,3

Inpu

ts

Out

puts

Layer 1 Layer 2 Layer 3

(a) (b)

47

48

ยง A single router can not send/receive more than one packet in a cycle. However, multiple routers can send/receive packets in parallel.

ยง Since a router can send/receive only one packet in a cycle, the congestion is minimum i.e., no more than a transaction is scheduled through a particular link.

Key Insights

1,1

1,2

1,3

2,1

2,2

3,1

3,2

3,3

Inpu

ts

Out

puts

Layer 1 Layer 2 Layer 3

48

10/8/21

22

49

ยง Two consecutive layers consists of equal number of routersโ€“ Three routers in this case

ยง Each router can send/receive only one packet at one cycle

ยง In one round of communication each routers in ๐’Œ๐’•๐’‰ layer send a packet to each router in ๐’Œ + ๐Ÿ ๐’•๐’‰ layer

ยง One transaction is finished in 3 cyclesโ€“ Minimum possible latency

Our Solution: Custom Interconnect (A Simple Case)

Link Cycle Transaction๐ฟC,2 โˆ’ ๐ฟ C92 ,2 1 ๐‘…C,2 โˆ’ ๐‘… C92 ,2

๐ฟC,0 โˆ’ ๐ฟ C92 ,0 1 ๐‘…C,0 โˆ’ ๐‘… C92 ,0

๐ฟC,3 โˆ’ ๐ฟ C92 ,3 1 ๐‘…C,3 โˆ’ ๐‘… C92 ,3

๐ฟ(C92),2 โˆ’ ๐ฟ C92 ,0 2 ๐‘…C,2 โˆ’ ๐‘… C92 ,0

๐ฟ(C92),0 โˆ’ ๐ฟ C92 ,3 2 ๐‘…C,0 โˆ’ ๐‘… C92 ,3

๐ฟ(C92),3 โˆ’ ๐ฟ C92 ,0 2 ๐‘…C,3 โˆ’ ๐‘… C92 ,0

๐ฟ(C92),0 โˆ’ ๐ฟ C92 ,2 2 ๐‘…C,0 โˆ’ ๐‘… C92 ,2

๐ฟ(C92),0 โˆ’ ๐ฟ C92 ,3 3 ๐‘…C,2 โˆ’ ๐‘… C92 ,3

๐ฟ(C92),0 โˆ’ ๐ฟ C92 ,2 3 ๐‘…C,3 โˆ’ ๐‘… C92 ,2(a) (b)

1

2

3

1

2

3

๐‘˜V%layer

(๐‘˜ + 1)V%layer

49

50

New Transactions:

Horizontal Link (L1N-L2N) will be active on cycle 1 which does not overlap with any of the transaction of the configuration with (N-1) routers

Upward Vertical Link: New transactions will carry the packet from the router at the position (1,N). The vertical link (L2,N-L2,(N-1)) will be activated on cycle 1. All other new transactions will happen 1 cycle after the last transaction for the configuration with (N-1) routers.

Downward Vertical Link: New transactions will happen only in (L2, (N-1) โ€“L2, N)

Schedules in general:

Horizontal Link: Lk,n-Lk+1,n in cycle 1 carries transaction Tk,n-Tk+1,n

Upward Vertical Link: Lk+1,n-Lk+1,n-1 in mth cycle carries the transaction

Tk,n+m-2-Tk+1,n-1, if the transaction exists

Downward Vertical Link: Lk+1,n-Lk+1,n+1 in mth cycle carries the transaction

Tk, n-m+2-Tk+1,n+1, if the transaction exists

Proof by Induction

1 1โ€ฆโ€ฆ

โ€ฆโ€ฆ

N N

(N-1) (N-1)

๐‘˜V%layer

(๐‘˜ + 1)V%layer

50

10/8/21

23

51

ยง We also construct the schedules when the number of the routers in two consecutive layers are not equal

ยง Detailed proof is shown in [1]

ยง We obtain minimum possible communication latency for a give DNN

Other Cases

1 1โ€ฆ โ€ฆ

N N

(N+j+1)

..

(N+j)

1 1โ€ฆ โ€ฆ

N N

(N+j+1)

(N+j)

..

๐‘˜V%layer

(๐‘˜ + 1)V%layer

๐‘˜V%layer

(๐‘˜ + 1)V%layer

[1] Mandal, Sumit K., Gokul Krishnan, Chaitali Chakrabarti, Jae-Sun Seo, Yu Cao, and Umit Y. Ogras. "A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, no. 3 (2020): 362-375.

51

52

ยง Performance of the computation fabric and communication fabric is estimated separatelyโ€“ Computation fabric: NeuroSimโ€“ Communication fabric: BookSim

ยง A wide range of DNNs is chosenโ€“ Number of parameters vary from 0.43M-45.6M

ยง Baseline designโ€“ Mesh NoCโ€“ Each tile is connected to a dedicated router

Experimental Setup

52

10/8/21

24

53

ยง Layer-wise improvements shown for VGG-19โ€“ Up to 8ร— improvement in

communication latency

ยง Improvement in communication latency is shown for different DNNsโ€“ Normalized with respect to the

baseline communication latency

ยง Results in up to 25% improvement in inference latency

Experimental Results*

*Mandal, Sumit K. et al. "A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, no. 3 (2020): 362-375.

53

54

ยง Motivation

ยง Preliminary Work-1: Performance Analysis of NoCsโ€“ Priority-Aware NoCsโ€“ NoCs with Bursty Trafficโ€“ NoCs with Deflection Routing

ยง Preliminary Work-2: Communication-Aware Hardware Accelerator for DNNs

ยง Proposed Workโ€“ Multi-Objective Optimization to Design Optimized NoCโ€“ Accelerator for Graph Convolutional Network (GCN)

ยง Conclusions

Agenda

54

10/8/21

25

55

ยง Majority of the NoC used for DNN accelerators incorporate round robin (RR) arbitration

ยง However, RR arbitration does not provide latency fairness

ยง Weighted Round Robin (WRR) arbitrationโ€“ Example: Packets originate from a source which is far from the destination is

provided higher weights during arbitration

ยง Challenge: Assigning weights to different flows in the NoCโ€“ Distribute weights to the flows to minimize overall latency

ยง Proposal: Construct analytical model of NoCs with WRRโ€“ Use the analytical model to perform multi-objective optimization

Proposed Work-1: Optimizing NoC with Analytical Models

55

56

ยง Graph Convolutional Network (GCN) exhibits very high volume of communication

ยง In-Memory Computing (IMC) worsens the communication latency since all memory elements are on-chip

ยง Challenge: Straightforward implementation is unrealisticโ€“ One router for each node of the GCNโ€“ Up to 200 J of communication energy

Proposed Work-2: Communication-Aware Accelerator for GCN

ยง Proposal: Construct an efficient NoC which balances computation as well as communication

56

10/8/21

26

57

ยง Communication is a critical component for recently emerged complex applications

ยง Cycle-accurate NoC simulation is a bottleneck in full system performance evaluation

ยง Constructed fast and accurate analytical model for priority-aware NoCs which can handle multiple classes, bursty traffic and deflection routing

ยง Constructed latency-optimized NoC for DNN accelerator

ยง Next steps:โ€“ Construct analytical models for NoCs with WRR arbitrationโ€“ Communication-aware accelerator for GCNs

Conclusion

57

58

1. Mandal, S. K., Ayoub, R., Kishinevsky, M., & Ogras, U. Y. (2019). Analytical performance models for NoCs with multiple priority traffic classes. ACM Transactions on Embedded Computing Systems (TECS), 18(5s), 1-21.

2. Mandal, S. K., Ayoub, R., Kishinevsky, M., Islam, M. M., & Ogras, U. Y. (2020). Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic. IEEE Embedded Systems Letters.

3. Mandal, S. K., Krishnakumar, A., Ayoub, R., Kishinevsky, M., & Ogras, U. Y. (2020, November). Performance analysis of priority-aware NoCs with deflection routing under traffic congestion. In Proceedings of the 39th International Conference on Computer-Aided Design (pp. 1-9).

4. Ayoub, R., Kishinevsky, M., Mandal, S. K., & Ogras, U. Y. (2020, November). Analytical modeling of NoCs for fast simulation and design exploration. In Proceedings of the Workshop on System-Level Interconnect: Problems and Pathfinding Workshop (pp. 1-1).

List of Publications

NoC Performance Analysis

Communication-Aware Hardware Accelerator for DNNs5. Mandal, S. K., Krishnan, G., Chakrabarti, C., Seo, J. S., Cao, Y., & Ogras, U. Y. (2020). A Latency-Optimized Reconfigurable

NoC for In-Memory Acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), 10(3), 362-375.

6. Krishnan, G., Mandal, S. K., Chakrabarti, C., Seo, J. S., Ogras, U. Y., & Cao, Y. (2020). Interconnect-aware area and energy optimization for in-memory acceleration of DNNs. IEEE Design & Test, 37(6), 79-87.

7. Krishnan, G., Mandal, S. K., Chakrabarti, C., Seo, J. S., Ogras, U. Y., & Cao, Y. Impact of On-Chip Interconnect on In-Memory Acceleration of Deep Neural Networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), To Appear.

58

10/8/21

27

59

8. Mandal, S. K., Bhat, G., Patil, C. A., Doppa, J. R., Pande, P. P., & Ogras, U. Y. (2019). Dynamic resource management of heterogeneous mobile platforms via imitation learning. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(12), 2842-2854.

9. Mandal, S. K., Bhat, G., Doppa, J. R., Pande, P. P., & Ogras, U. Y. (2020). An energy-aware online learning framework for resource management in heterogeneous platforms. ACM Transactions on Design Automation of Electronic Systems (TODAES), 25(3), 1-26.

10. Mandal, S. K., Ogras, U. Y., Doppa, J. R., Ayoub, R. Z., Kishinevsky, M., & Pande, P. P. (2020, July). Online adaptive learning for runtime resource management of heterogeneous SoCs. In 2020 57th ACM/IEEE Design Automation Conference (DAC) (pp. 1-6). IEEE.

11. Gupta, U., Mandal, S. K., Mao, M., Chakrabarti, C., & Ogras, U. Y. (2019). A deep Q-learning approach for dynamic management of heterogeneous processors. IEEE Computer Architecture Letters, 18(1), 14-17.

12. Bhat, G., Mandal, S. K., Gupta, U., & Ogras, U. Y. (2018, November). Online learning for adaptive optimization of heterogeneous SoCs. In Proceedings of the International Conference on Computer-Aided Design (pp. 1-6).

13. Pasricha, S., Ayoub, R., Kishinevsky, M., Mandal, S. K., & Ogras, U. Y. (2020). A survey on energy management for mobile and IoT devices. IEEE Design & Test, 37(5), 7-24.

Other Work: Power Management for Heterogeneous SoCs

59

60

14. Task scheduling with imitation learning (IL): Krishnakumar, A., Arda, S. E., Goksoy, A. A., Mandal, S. K., Ogras, U. Y., Sartor, A. L., & Marculescu, R. (2020). Runtime Task Scheduling Using Imitation Learning for Heterogeneous Many-Core Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(11), 4064-4077.

15. Flexible Hybrid Electronics (FHE): Bhat, G., Gao, H., Mandal, S. K., Ogras, U. Y., & Ozev, S. (2020). Determining mechanical stress testing parameters for fhe designs with low computational overhead. IEEE Design & Test, 37(4), 35-41.

16. 3D Ultrasounds Imaging: Zhou, J., Mandal, S. K., West, B. L., Wei, S., Ogras, U. Y., Kripfgans, O. D., ... & Chakrabarti, C. (2020). Frontโ€“End Architecture Design for Low-Complexity 3-D Ultrasound Imaging Based on Synthetic Aperture Sequential Beamforming. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

Book Chapter:

17. Mandal S.K., Krishnakumar A., Ogras U. Y. (2021). Energy-Efficient Networks-on-Chip Architectures: Design and Run-Time Optimization: Network-on-Chip Security and Privacy pp55-75, Springer

18. Mandal, S. K., โ€œNetwork-on-Chip (NoC) Performance Analysis and Optimization for Deep Learning Applications,โ€ Technical Report, June, 2021, Available [online] https://sumitkmandal.ece.wisc.edu/data/prelims_report_sumit_v3.pdf

Other Collaborative Work

60


Recommended