Quality-of-Service for Network-on-Chip-based Smartphone/Tablet … · Quality-of-Service for Network-on-Chip-based Smartphone/Tablet Systems-on-Chip Kai Feng Master of Applied Science

Quality-of-Service for Network-on-Chip-based Smartphone/Tablet Systems-on-Chip

by

Kai Feng

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science

Electrical and Computer Engineering University of Toronto

© Copyright by Kai Feng 2012

ii

Quality-of-Service for Network-on-Chip-based

Smartphone/Tablet Systems-on-Chip

Kai Feng

Master of Applied Science

Electrical and Computer Engineering

University of Toronto

2012

Abstract

Smartphone/tablet Systems-on-Chip (SoCs) integrate increasing number of components to offer

more functionality. Capacity and efficiency of data communication between memory and other

hardware blocks have become a major concern in the SoC design. To address this concern, we

propose to use Network-on-Chip (NoC) architectures, to meet high bandwidth, and low power

and area demands. We propose a Quality-of-Service (QoS) scheme to differentially provision

network resources to cater to different performance requirements by different hardware blocks.

Implementation and evaluation are performed on a simulation infrastructure we construct

specifically for this type of SoCs. We demonstrate, via simulation results, that the proposed

Dynamic QoS schemes can achieve better bandwidth provisioning, with good area and power

efficiencies.

iii

Acknowledgments

This thesis may mark the end of my life in school, but not the end of my journey in pursuing

knowledge. I am very grateful that I have been blessed with support and encouragement from

numerous people.

First I would like to express my sincere thanks to my supervisor, Prof. Natalie Enright Jerger for

her guidance and patience for the past two years. I also want to extend my gratitude to Dr. Serag

Gadelrab and Prof. Andreas Moshovos for their great support in my research.

I want to thank my fellow graduate students in Natalie's research group, for their faithful

comments and suggestions on this project. Especially I owe thanks (as well as apologies) to

Sheng Ma for my infinite consultations.

Last but not least, I am extremely indebted to my parents and Emma for their love and constant

encouragement, which got me through many tough moments. Thank them for always being there

for me.

iv

Table of Contents

Acknowledgments………………………………………………………………………………..iii

Table of Contents ........................................................................................................................... iv

List of Tables ................................................................................................................................. vi

List of Figures ............................................................................................................................... vii

List of Acronyms ........................................................................................................................... ix

Chapter 1 Introduction ................................................................................................................. 1

1.1 Motivations ......................................................................................................................... 1

1.2 Research Goals .................................................................................................................... 2

1.3 Thesis Organization ............................................................................................................ 3

Chapter 2 Related Work .............................................................................................................. 4

Chapter 3 Simulation Infrastructure .......................................................................................... 9

3.1 Interconnect ......................................................................................................................... 9

3.2 DRAM ............................................................................................................................... 10

3.3 Workloads ......................................................................................................................... 11

3.3.1 CPU ....................................................................................................................... 11

3.3.2 Traffic Generator (TG) ......................................................................................... 11

3.3.3 Video-Conferencing Workload (VCW) ................................................................ 17

Chapter 4 Quality-of-Service Schemes ..................................................................................... 19

4.1 Hierarchical-Multiplexers Baseline .................................................................................. 19

4.2 Dynamic QoS .................................................................................................................... 22

Chapter 5 Experimental Evaluation ......................................................................................... 32

5.1 Experiment Setup .............................................................................................................. 32

5.2 Experiment Results ........................................................................................................... 35

5.2.1 Latencies ............................................................................................................... 35

v

5.2.2 Case Study: a micro-experiment ........................................................................... 41

5.2.3 Throughput ............................................................................................................ 44

5.2.4 Area and Power ..................................................................................................... 46

Chapter 6 Conclusions ................................................................................................................ 49

6.1 Future Work ...................................................................................................................... 49

Bibliography ................................................................................................................................. 51

vi

List of Tables

Table A: Main configurations of each hardware block ................................................................ 34

vii

List of Figures

Figure 1: Exynos 4212 .................................................................................................................... 2

Figure 2: Microarchitecture of a generic credit-based NoC router ................................................. 4

Figure 3: Simulation infrastructure for smartphone/tablet SoCs .................................................... 9

Figure 4: Markov chain model for parameter collection and request generation. ........................ 14

Figure 5: Markov chain TG verification results of address model with different configurations 15

Figure 6: verification of self-similar timing model ....................................................................... 17

Figure 7: VCW implementation .................................................................................................... 18

Figure 8: 16-node Hierarchical-multiplexers baseline network .................................................... 20

Figure 9: 16-node Dynamic QoS network .................................................................................... 24

Figure 10: Zoom-in view of satellite router's outputs and backbone router's inputs .................... 27

Figure 11: Two examples of step-by-step procedures of token handshakes ................................ 28

Figure 12: Comparison of average latencies of packets in Dynamic QoS with different lengths of

buffer queues ................................................................................................................................. 34

Figure 13: Average latencies of packets from hardware blocks associated with VCW ............... 36

Figure 14: Latency distributions of packets from camera ............................................................ 37

Figure 15: Latency distributions of packets from display ............................................................ 37

Figure 16: Latency distributions of packets from encoder ........................................................... 37

Figure 17: Latency distributions of packets from decoder ........................................................... 38

Figure 18: Latency distributions of packets from modem ............................................................ 38

Figure 19: Average latencies of packets from non-VCW hardware blocks ................................. 39

viii

Figure 20: Average total round-trip latencies of packets from VCW hardware blocks ............... 40

Figure 21: Average round-trip latencies for every 1000 cycles .................................................... 42

Figure 22: Average round-trip latencies for every 200 cycles ...................................................... 43

Figure 23: Network throughputs comparison ............................................................................... 44

Figure 24: Router areas and channel areas ................................................................................... 46

Figure 25: Router power consumptions ........................................................................................ 46

ix

List of Acronyms

NoC Network-on-Chip

SoC System-on-Chip

QoS Quality-of-Service

WRR weighted round robin

TG traffic generator

VCW video-conferencing workload

ACK acknowledgement

BE best-effort

GS guaranteed services

UI user interface

ISA instruction set architecture

1

Chapter 1 Introduction

Smartphones are different from simple voice communication devices since they offer far more

superior capabilities, such as navigation and video chatting. In fact, smart handheld devices

nowadays, e.g. smartphones and tablets, have enabled more and more sophisticated tasks that

were either only provided by those big machines of the past, or simply never possible due to lack

of sensors. To achieve these functionalities, the number of components, such as processing cores,

specialized accelerators and various sensors, that are integrated onto a system-on-chip (SoC)

continues to increase.

1.1 Motivations

Figure 1 shows the scale of a smartphone/tablet SoC as of early 2012. The chip is Exynos 4212

by Samsung Electronics, which is marketed as appropriate for either a smartphone or a tablet [1].

Each component requires that data be communicated between it and other parts of the system.

More specifically, the majority of this communication traffic is that various processing cores,

accelerators and sensors require access to the same memory module, which forms a unique N-to-

1 traffic pattern, comparing with other types of SoCs in which the network heterogeneity and

traffic patterns are rather different. The interconnection network that supports such data supply

determines memory latency and memory bandwidth, two key performance factors in a system [2].

2

Therefore the interconnection network, especially its ability to meet low latency requirements

and constraints on both size and power, has become a major concern in today's smartphone/tablet

SoC design.

Figure 1: Exynos 4212 SoC, combines a 32nm dual-core ARM Cortex-A9 CPU, by Samsung

1.2 Research Goals

In this research, we investigate network designs for these specific SoCs, to facilitate data

communication requirements between DRAM controller and other components. We propose to

use many concepts of Network-on-Chip (NoC) architectures, driven by high bandwidth, low

power and area demands. In particular, we would like to avoid fair network resources

distribution to all applications or hardware blocks, as they would have different performance

3

requirements to the network. Therefore we particularly focus on differentially provisioning

network resources to cater to different requirements by different hardware blocks, by using

Quality-of Service (QoS) schemes,.

To implement and evaluate our NoC-based interconnection network and QoS framework design,

we also construct a simulation infrastructure, which is composed of simulators and workloads.

The workloads are implemented by first characterizing a traffic pattern and then abstracting it

into a specific model.

1.3 Thesis Organization

The rest of the thesis is organized as follows. In Chapter 2, we provide an overview of basic NoC

and QoS concepts, and an overview of related work. In Chapter 3, we introduce our simulation

infrastructure, with descriptions of each element, including simulators and workloads. Then in

Chapter 4, we present two QoS schemes, weighted round robin (WRR) and Dynamic QoS, as

well as their corresponding network designs. In Chapter 5, we evaluate these QoS designs

through experiments in our simulation infrastructure. Lastly in Chapter 6, we summarize our

contributions and discuss potential plans for the next step.

4

Chapter 2 Related Work

With the trend of increasing core counts in a single chip, Network-on-Chip (NoC) architectures

have been proposed [3] [4] and are employed in homogeneous, general-purpose chips [5] [6] [7]

[8], to provide high bandwidth and scalable on-chip interconnection networks. In a NoC

structure, one or several cores or memory controllers are bound with a router [9]. Figure 2 shows

the microarchitecture of a generic credit-based NoC router. Traffic injected to the network by

these cores or controllers through their router would first be packetized, and each packet is then

further divided into a head flit, a tail flit and several body flits. Within the network, data

communication between routers is in unit of flits. The flits are serialized and reassembled back to

packets at their destination. A more comprehensive description of NoC concepts can be found in

Principles and Practices of Interconnection Networks, by Dally and Towles [2].

Figure 2: Microarchitecture of a generic credit-based NoC router

5

There has been an increasing trend in industry to have heterogeneous cores, i.e. different types of

cores, on a chip. AMD's Fusion multi-core processors [10] and the Cell chips [7] developed by

IBM, Sony and Toshiba serve as good examples. There is also a fair amount of research on NoC

that targets heterogeneous networks. Lee et al. [11] hierarchically design and implement a

heterogeneous NoC based on a topology named hierarchical star. They focus on low-power

communication in design levels such as circuits, signaling, channel coding, protocol and

topology, using various power-efficient techniques. Lambrechts et al. [12] provide a power

breakdown analysis for heterogeneous NoCs, and identify the power bottlenecks considering the

platform as a whole. They carefully map an MPEG2 video chain as well as other applications

onto a heterogeneous NoC-based platform, and point out that the global interconnect is not that

critical for a well-optimized mapping. Kreutz et al. [13] employ a mix of 3 types of routers to

optimize heterogeneous NoCs for latency and energy consumption, along with an optimization

algorithm to find optimal placements for application cores.

Another important concept in NoC is Quality-of-Service (QoS). It is defined as service

quantification that is provided by the NoC to the demanding core [14]. In other words, it refers to

reserving and provisioning different resources for applications or traffic streams with different

priorities. Goossens et al. [15] identify two basic QoS classes: best-effort (BE) services, which

improve average resource utilization but offer no commitment, and guaranteed services (GS),

6

which do. The general shortcoming of BE is its proneness to network congestion [16]. Avasare et

al. [17] present a centralized OS communication management scheme that addresses congestion

of a BE NoC. In the work, control data is immune to congestion due to its own separate NoC. On

the other hand, a good example of GS is Preemptive Virtual Clock [18], which uses packet

preemption and a dedicated ACK network to provide GS, by allocating network bandwidth to

threads or applications. Many actual implementations of NoCs choose a combination of both

basic QoS classes. Bjerregaard et al. [19] propose MANGO, which utilizes allocated virtual

channels to provide connection-oriented GS and connection-less BE routing. Similarly in

Æthereal NoC [20], routers provide both GS and BE services. GS are obtained by means of

TDMA slot reservations. BE traffic makes use of non-reserved slots and of any slots reserved but

not used.

Nevertheless, the functionality and connectivity of most modern NoC with QoS designs do not

perfectly match the communication demands of our targeting SoC architecture. Our system is

heterogeneous, containing a large variety of processing units, accelerators and sensors, while

most previous research in the field of NoC focus only on homogeneous cores. Nowadays, QoS

for heterogeneous networks has become a more attractive topic to NoC researchers. Murali et al.

[21] exploit the heterogeneity of applications, based on their different communication

requirements and traffic patterns, and map them onto reconfigurable NoCs. Cheng et al. [22]

7

leverage a heterogeneous interconnect to map different coherence protocol messages onto wires

of different widths and thicknesses. Grot et al. [23] propose a heterogeneous network to support

a thousand connected components with high area and energy efficiency, and strong QoS

guarantees. They reduce router complexity by isolating shared resources in dedicated QoS-

equipped regions of the chip. Bolotin et al. [24] present QNoC, a low-cost customized NoC to

meet QoS requirements. Services are categorized into 4 levels, where signaling has the highest

priority, followed by real time, read/write and block transfer. However, the typical NoC

application investigated in those papers is limited to support cache coherence protocols in shared

memory multi-core systems. The networks are usually of N×N sizes, either in mesh or torus

topologies, whereas the SoC network we investigate here is a unique N-to-1 communication

structure with rather different traffic patterns expected.

Regarding traffic characterization and generation, related work is as follows. Soterious et al. [25]

propose an empirically-derived statistical traffic model for NoCs. The model exposes both

spatial and temporal dimensions of traffic via 3 statistical parameters: hop count, burstiness, and

packet injection distribution. Hestness et al. [26] propose to collect application traces while still

preserving dependencies between network messages. They introduced Netrace, a trace-based

simulation platform with high fidelity due to dependencies enforced. Again, the work mentioned

above, the benchmarks used to generate traces, is for homogenous general purpose

8

interconnection networks. Although not focusing on NoC, Gutierrez et al. [27] make an

interesting analysis of smartphone applications. They measure a variety of mobile applications

for audio, video, and interactive gaming. They conclude that the characteristics of these

applications markedly differ from those of general-purpose benchmarks. We adopt their BBench,

a web-browser benchmark, as the CPU workload in this research.

9

Chapter 3 Simulation Infrastructure

As the first step of the project, we have constructed a simulation infrastructure for

smartphones/tablets, as shown in Figure 3. This infrastructure allows us to simulate and evaluate

modern Android platform workloads, as well as the interactions between the on-chip network

and the DRAM controller. Each component of this infrastructure is described as follows.

Figure 3: Simulation infrastructure for smartphone/tablet SoCs

3.1 Interconnect

We implement our QoS schemes and their corresponding interconnect topologies in BookSim

2.0, a cycle-accurate interconnection network simulator [2]. We have made proper modifications

10

to channels, routers and routing functions to suit our designs. In addition, instead of using the

built-in synthetic traffic patterns, we have implemented an interface between BookSim and all

the other simulators or workloads (to be described in the following sections) to provide real-life

traffic patterns. At the interface, each hardware block can either receive or inject available

memory requests/responses from or to the network at each clock cycle. We adopt open-loop

measurement configuration [2], which incorporates an injection queue of infinite length at each

network interface. These queues isolate the traffic processes from the network itself so that the

traffic patterns are kept as originally specified. Since in Chapter 5 we evaluate our network

designs on traffic patterns that we specify in various workloads, we will use open-loop

measurements for both latency and throughput.

3.2 DRAM

We choose DRAMSim2 to model the DRAM memory controller of the system. It also models

memory channels, DRAM ranks and banks, and provides timing for memory accesses based on

its configurations [28]. It keeps monitoring the network for memory read/write requests, and

transforms them to memory transactions. When a transaction is returned, the generated response

would be packetized and sent back to the request owner.

11

3.3 Workloads

3.3.1 CPU

GEM5 is chosen to simulate the CPU, which can run in full system mode [29]. It provides CPU

models with various ISAs, while ARM best suits our case since most smartphone/tablet SoCs

adopt ARM processors. We set it to boot Android operating system, and run BBench, a web-

page rendering benchmark specifically for Android [27]. Due to some technical difficulties,

instead of directly connecting GEM5 with BookSim, we have collected memory access traces of

it running BBench, and use them as the CPU traffic model. Although we lose some fidelity due

to lack of dependency information, the traces can still provide a usable traffic pattern of CPU. In

addition, dependencies make CPU self-throttling, which is a stabilizing behavior that when

network becomes congested, CPU will become idle due to little or no memory requests fulfilled,

then congestion of the network would be reduced [2]. However, QoS is generally more useful

during network congestion, which tends to be avoided by self-throttling.

3.3.2 Traffic Generator (TG)

Smartphone SoCs are heterogeneous networks, which in addition to “traditional” on-chip

elements we mentioned above, also have many specialized hardware elements such as video

encoder or camera. When simulating the entire network, each element also needs to be properly

12

modeled. The most direct way is to perform complete hardware simulation, but it is usually slow

and heavy on system resources. A comparably simpler solution is trace replay. However, trace

files tend to be very large in size. In addition, both full simulation and trace replay have the same

problem in that for most hardware blocks in a smartphone SoC, it is impossible to find or to

implement a specific simulator so that we could directly use or collect trace from.

Let’s revisit the reason we need to simulate the entire SoC. The most important goal of this

research is to investigate Quality-of-Service schemes for this particular type of networks. We use

those hardware blocks as sources to inject traffic into the network. In this case, it is traffic

patterns that matter the most, while absolute precisions in bytes for addressing or in exact cycles

for timing is not as important. Therefore, we have implemented a series of probabilistic traffic

generators that mimic traffic patterns from the specialized hardware blocks in a lightweight way,

rather than reproduce exact memory traces from them. Memory requests are generated based on

selection of input parameters, e.g. injection frequency and correlation probabilities, which are

either trained from real traffic traces or based on technical white papers.

One major kind of traffic generators we have implemented use history-based Markov chain

address model as a base for collecting parameters (e.g. probabilities) and reproducing the traffic

pattern. Markov chain is a finite-state system that has probabilities of transitions from one state

to another, depending on the current state [30]. As shown in Figure 4, each node in chain

13

represents a particular sequence of loads and stores, with probabilistic transitions (edges)

representing the type (load/store) of the next memory request. A queue of latest address histories

is maintained, and is updated after each request learned/generated. Address of the next memory

request is determined by the historical correlations with previous requests in the queue, because

probabilities of re-accessing the latest memory addresses are high, especially in photographic

applications, where dependencies on previous pixels/frames are commonly seen. In addition, the

correlation with the next adjacent row of the current row is also evaluated, because accessing

next row is expected to be fairly regular as well. Therefore, when generating new requests, the

address predicted will either have the same address (or row+1) as one of the requests maintained

in the history queue, or be random.

14

Figure 4: Markov chain model for parameter collection and request generation. (a) shows a

sample of history-of-two Markov chain, with transitions representing next request to be

read/write. (b) is the queue of probabilities of the next request’s address equal to one of the

latest memory requests’ address. The sum of the probabilities, including the probabilities of the

next request address being random and row+1, is 1.

This Markov-chain address model is validated by recollecting parameters (second collection)

from the already reproduced traffic pattern, from traffic generator with parameters collected (first

collection) from a benchmark’s traffic. The idea is to compare two sets of parameters and

examine their similarity. The example hardware block we select is h.264 video encoder, which is

modeled by 464.h264ref benchmark in SPEC CPU2006 [31] with foreman_ref_encoder_baseline

input. Parameters, such as read ratio and address correlations, are collected by PIN binary

instrumentation tool [32], which collects memory accesses (loads/stores) from applications and

feeds them into the Markov-chain model. The model determines parameters on-the-fly, then

saves them to a file.

15

Figure 5: Markov chain TG verification results of address model with different configurations

Experimental results for the Markov chain address model are obtained for 6 different request

sequence history lengths (from 4 to 9) and address history lengths either equal to or twice as

large as request sequence lengths. For each pair of history values, test is run for 3 times and the

results are averaged. Mean square error is also calculated across runs for same history pairs given

same input parameters, to examine whether the traffic generated is unique to the benchmark

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

(4;4) (4;8) (5;5) (5;10) (6;6) (6;12) (7;7) (7;14) (8;8) (8;16) (9;9) (9;18)

Average Deviation

(R/W historydepth; Addr history depth)

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

(4;4) (4;8) (5;5) (5;10) (6;6) (6;12) (7;7) (7;14) (8;8) (8;16) (9;9) (9;18)

Average Deviation

(R/W historydepth; Addr history depth)

(a)

(b)

16

input. The results show that mean square error is either close or equal to 0 in almost all cases,

meaning the output parameters are consistent. Then the comparison of parameters is shown in

Figure 5, where (a) shows difference between first collection and second collection regarding

read/write speculation from each node. Y-axis is an average of mean-square-error with respect to

average between read ratio parameters in first and second collections, across all nodes. It shows

that error is kept constant until request sequence history length reaches 8, when it starts to grow.

This is most likely due to increasing number of possible histories, which amplify the impact of

noise and randomness. Figure 5(b) shows a similar metric for address correlation percentages,

with a trend that performance improves with both longer request sequence and address histories.

One particular detail to note is that while longer request type history length lowers the error rate,

longer address history has a much stronger impact, as can be seen by comparing values for

combinations (4, 8) and (8, 8). The address history length is the same in both cases (4, 8) and (8,

8), but combination (4, 8) provides lower error rate due to smaller number of overall options,

which provides better robustness and less sensitivity to noise.

In addition to the address model, we have also modeled the injection behaviors in time domain.

The timing model varies with different types of hardware blocks. For example, some tend to be

of a streaming type where packets are sent intensely and periodically, while some others may

show a pattern of self-similarity, which is that when zooming in to observe the pattern, we could

17

find a pattern similar to the overall pattern [33]. Therefore, the timing parameters generally need

more manual tuning than parameters in address model. Figure 6 shows an example of h.264

video encoder’s streaming and self-similar pattern, along with output from a tuned traffic

generator. Y-coordinate shows the time when the corresponding x-coordinate request is issued.

Figure 6: verification of self-similar timing model

3.3.3 Video-Conferencing Workload (VCW)1

This workload simulates traffic injected to the network by a combination of hardware blocks

involved in a video-conferencing application. It runs through a series of 1080p HD video frames.

In the workload, there are two sections: outgoing video and incoming video. Outgoing video

section simulates the following procedures. Camera writes a frame to DRAM. Then h.264

1 This work was done by Goran Narancic, a fellow M.A.Sc candidate supervised by Prof.

Andreas Moshovos.

cycles

1st request 1000000

th

18

encoder reads this frame, encodes it, and writes the encoded frame to DRAM. At last modem

reads this encoded frame (to send out via antenna). On the other side, incoming video section

simulates almost in an opposite order. Modem writes an encoded frame (received via antenna) to

DRAM. Then decoder reads, decodes this frame, and writes it back to DRAM. Display reads this

decoded frame in the end. The sections are modeled by h.264 reference implementation encoding

or decoding frames, while PIN tool is used to capture memory requests it produces. Figure 7

shows the implementation of memory requests capturing. The captured traffic is then used to

create memory request streams from camera, display and modem.

Figure 7: VCW implementation

19

Chapter 4 Quality-of-Service Schemes

To facilitate different performance requirements by different hardware modules during resource

contention, we design networks that can perform two different QoS schemes: a baseline scheme,

and our proposed dynamic scheme.

4.1 Hierarchical-Multiplexers Baseline

We build our baseline topology based on a mapping of the N-to-1 communication structure in a

straightforward fashion, as well as an idea of decentralized arbitrations. If we were to use a

centralized arbiter to arbitrate packets from all the hardware blocks, both size and complexity of

the arbiter would get unacceptably large, and the arbitration process would be very slow. Instead

we use 3 levels of routers to fit the current scale (approximately 16 nodes) of smartphone SoC

networks. Routing is simple, since paths are fixed. Inside each router, switch (a.k.a. crossbar)

arbitration is performed independently and locally among a relatively small number of traffic

streams. In this design, since switches have 2 or 3 inputs and only 1 output, so they have the

same functions with multiplexers. This hierarchical-multiplexers network design is shown in

Figure 8.

20

Figure 8: 16-node Hierarchical-multiplexers baseline network

We choose weighted round robin (WRR) [33] as our baseline QoS scheme to compare with our

proposed QoS scheme. Each traffic stream injected from a hardware block is called a service in

this context. WRR schedules services with different pre-assigned weights and is a very effective

and relatively light-weight method. There is also a simple starvation avoidance mechanism that

monitors time that packets have been waiting. In general, it is easy to implement and manipulate.

21

Priorities, or weights in WRR, of traffic packets by different hardware blocks are assigned based

on both industry insights and observations of results from experiments. As we assume a VCW

application is running, we would intentionally prioritize services from the hardware blocks

involved. Within these services, as will be shown in Section 5.2.1, we find that packets' latency

of camera is relatively more vulnerable to network congestions, thus we assign its packets with

top priority. Encoder is the bottleneck of VCW workload as it is computationally heavy.

Therefore we also assign its packets with a high priority. Other than VCW, there are several

services that should also be crucial to performance of the entire system. For example, traffic of

user interface (UI) from GPU is obviously important to user experience, so this service should be

prioritized. More detail of priority assignment can be found in Table A in Chapter 5. We should

note that if any assumptions made to each hardware block's priority were wrong, it should not

affect the correctness of our design. In fact as long as the results we obtain through experiments

can justify the QoS functionalities we expect even based on faulty priority assumptions, our

design could be easily adjusted to any service priority orders.

In order to further ease the burden of arbitration in each latter level, we group services with

similar priorities together and bind them with a specific level-1 router, as shown in Figure 8. For

instance, camera, encoder and decoder have the highest priorities among all, so they all inject

their packets through R3. From R3 to R7, the hardware blocks bound to them decrease with

22

respect to the priorities of their packets. For this reason, R1 could always assume services from

R3 are superior to services from R4 and then R5. The same rule applies in R2, and further in R0.

The following example can illustrate the benefit of applying QoS. Suppose the display and the

streaming TG are injecting traffic at the same rate. In the hierarchical-multiplexer network

without QoS, each router adopts round-robin arbitration. This absolute fairness would provision

1/18 of network resources to the display traffic but 1/12 to the less-critical streaming TG traffic.

With such small portion of network resources, data supply from DRAM to display may therefore

be delayed and user experience would be jeopardized. By being assigned with a proper weight

that is much higher than that of the streaming TG in WRR, the display can thus reserve enough

network resources to meet its tight performance requirement.

4.2 Dynamic QoS

The baseline topology, i.e. hierarchical-multiplexers, with static priority assignment is a simple

and effective solution to the Quality-of-Service requirement of smartphone SoCs. However, it

may suffer from a major limitation, illustrated in Section 5.2.2, that as each packet’s priority and

path are fixed, the flow control cannot guarantee that the router and channel resources are evenly

distributed. For example, in VCW when the user at this end decides to stop filming himself, the

camera and the encoder will be turned off. The resources in R5, e.g. buffers, will be

23

underutilized, while the active services are still fiercely contending for resources at the other

routers. The situation would be even worse when the entire video-conference workload is shut

down. Therefore, the system needs more flexibility to adapt to different task combinations.

The scheme we propose in this project is called Dynamic QoS. It is based on the fact that in most

cases, not all services are active, so network resources shall be allocated dynamically to each

active service. In the topology, as shown in the Figure 9, there is one original input-queue router

R0 connected with DRAM at one end, and on the end connected with three slightly modified

routers that are named as backbone routers. Inside these four routers, weighted-round-robin

arbitration scheme is adopted so that weights assigned to each port decrease from the most upper

port to the bottom port. Instead of utilizing level-1 routers and their dedicated channels

connecting with their corresponding unique level-2 router in Figure 8, Dynamic QoS adopts an

intermediate network between the hardware blocks’ inputs and backbone routers. This

intermediate network could be fully connected, which allows packets from each hardware blocks

to travel through different backbone routers via different paths. Each green arrow in Figure 9

represents a bundle of such paths, including a data channel from each network input to the

backbone router.

24

Figure 9: 16-node Dynamic QoS network

At each hardware block’s network input, as shown in Figure 9, there is control logic that directs

packets to go through the appropriate output port. The number of output port options depends on

the number of downstream backbone routers. For the convenience of explanation here, we could

regard this control logic and ports together as satellite routers, even though they are not typical

routers since they neither buffer request packets nor have a large crossbar inside. The no-buffer

25

design is due to there being no direct contention between requests from different network inputs.

In fact, it would be more appropriate to treat a satellite router as an extension of the injection

channel from a hardware block to a backbone router. Every three hardware blocks are grouped

based on their priority, and are bound with a satellite router. Therefore, packets from R4 have the

highest priority, and packets from R8 have the lowest priority. Similar to baseline, backbone

routers are also assigned with different priorities, with R1 being the highest, and R3 being the

lowest. It should be noticed that it does not make sense to send high-priority packets to low-

priority backbone routers, so R4 is only connected to R1. It is also nearly impossible for packets

with the lowest priorities to reach R1 or R2, which will be explained later. After eliminating

redundant channels from R7 to R1, from R8 to R1 and R2, the intermediate network has become

partially connected (as opposed to fully connected) but still fully functional, shown in Figure 9.

Each backbone router keeps a number of tokens. A token simply represents available resources

to accommodate one service, i.e. one traffic stream from one hardware block. The maximum is

preset based on the maximum number of services it is expected to accommodate at one time.

Each satellite router keeps a record of the numbers of current tokens at each downstream

backbone router. Each satellite router also monitors the injection activity of each local service,

i.e. input from each connected hardware block. For example, after a certain time interval, if a

satellite router detects that one service “wakes up” from silence and starts to inject packets

26

regularly, it will redirect this service to the highest downstream backbone router which has at

least one token, and inform the backbone router that it needs to consume one token for this new

service. The backbone router will then decrease its token by one, and broadcast this change to all

its subscribing satellite routers, which will update their local record of tokens. Similarly, if a

satellite router determines that one service has changed from active to inactive, it will inform this

service’s backbone router to increase its token by one, meaning it now has the ability to

accommodate one more active service. The backbone router will also inform all its subscribers,

i.e. connected satellite routers, of this change, but not at the same time, which will be explained

in the following examples. The token handshaking signals travel through a different kind of

signaling channels, shown as dash lines in Figure 9 besides their corresponding data channels.

To present a clearer picture of the structure, we magnify the output ports of R5, a satellite router,

and the input ports of R1, a backbone router, and show them in Figure 10. R5 has two

downstream backbone routers, thus each hardware block bound in R5 has two optional

destinations to send its packets to. On the other hand, R1 has reserved input ports and buffer

queues for all possible services from its upstream satellite routers. Between each pair of

connected backbone and satellite routers, there is an independent signaling channel for token

handshakes. For every group of services bound to a satellite router, there is only one data channel

for DRAM response packets, which is enough since these response packets are usually scattered

27

due to differences in DRAM access time. As shown by yellow arrows in Figure 10, R1 sends

DRAM response packets to R4, and R5's DRAM response packets are from R2.

Figure 10: Zoom-in view of satellite router's outputs and backbone router's inputs

The following is an example to better illustrate the dynamic service rearranging and token

handshaking procedures. Suppose a smartphone user decides to only switch off the display

during a video-conference call, as shown in the first step of Figure 11(a), while the other

hardware blocks keep working. After a certain time interval, R5 detects this change and marks

the service of “display” inactive. Suppose originally the service of “display” went through R1.

Therefore R5 will signal R1, and increase R1’s token by one. If R1’s token count was more than

zero before the adjustment (though very unlikely), nothing needed to be done except for

broadcasting this change to R4-R7, because no other service needs to use R1 anyway. If R1 had

28

zero tokens before the adjustment, now with this one available token, R1 will need to query R5

through R7 to find a new service that can be promoted from R2 or even R3. The reason not to

query R4 is that the services at R4 all have higher priorities than “display”, so they should be

either already using R1 or inactive. R1 will now inform R5 of this available token. Suppose

Audio is now using R2, it will be redirected to R1. R2 now has one more token. Similar

procedures would be gone through to find a new owner for this R2’s token. In the end, if no

service needs the token, R2 will broadcast this to all the subscribers, and R5-R8 will then

increase their local record of R2’s token by one.

Figure 11: Two examples of step-by-step procedures of token handshakes

(a)

(b)

29

On the other hand, suppose the user now switches on the display, as shown in Figure 11(b), and

R5 has detected that display start to regularly inject packets again represented by the green arrow.

If R5 finds R1’s token count greater than zero, it will assign R1 as the new downstream router of

“display” traffic. R1 will then broadcast this change to R4-R7, and done. If R1 was already fully

reserved by traffic streams from R4, the “display” traffic will be redirected to R2. It’s possible

that R2 was also fully reserved, and now R2 is temporarily overloaded. R2 will find the service

from the lowest satellite router, and send two pulses via the signaling channel to deactivate and

activate this service. Now the satellite router will recognize this service as newly activated, and

will find an appropriate downstream backbone router for it.

Things to note:

Initially, all the services are carefully distributed to backbone routers, so that each

backbone router would use up its tokens to accommodate the highest available services.

For example, R2 has 5 tokens in total, so initially it would accommodate all the 3 services

from R5 and 2 of the services from R6.

When a satellite router has received more than one status change notifications by either

newly activated or deactivated services, it will deal with them in the order of their

priorities. Similarly, when more than one newly activated service arrives at a backbone

30

router during the same clock cycle, the router would satisfy each service in the order of

their priorities.

Each time if a service was to be promoted or degraded to another backbone router, the

satellite router would need to wait until the moment that the tail flit of the current packet

has just been sent. Otherwise, there may be a serialization problem when packets exit the

network, because some flits may be reordered.

There is a threshold to classify services to be active or not, based on the number of

packets they send within a specific period of time. Therefore a service being inactive may

still send limited number of packets to DRAM. During inactive state, the packets would

be routed to the backbone router that was assigned to this service when last time it was

active, until this service is active again and decision of new destination backbone router

is made. During this period, the service would be assigned with lower priorities than

currently active services within the same router.

The worst case of this token handshake protocol is when R1 and R2 are fully subscribed,

and R1 receives a new active service from R4. Therefore, it may subsequently force R1

and R2 to degrade a service to a lower backbone router. It would take 3 rounds of

handshakes and the cycles to wait for the tail flit for each service rearranging. However,

no packet would actually need to wait for this long before assigned with a new backbone

31

router. Instead, each new active service would only wait for at most one round of

handshake until new destination backbone router is decided.

Dynamic QoS can address the limitation of baseline, i.e. the weighted-round-robin hierarchical-

multiplexer design that is previously discussed. As is in its name, this new QoS scheme

dynamically allocates the best resources for packets from different input sources. Therefore, it

would provide more throughput of the whole network. In addition, satellite routers are actually

much smaller units than level-1 routers regarding buffer areas in baseline, therefore we could see

in the experiment results later that to achieve similar performance, regarding costs of routers,

Dynamic QoS saves router buffer area by 35.2%, and router buffer power consumption by 34.5%.

32

Chapter 5 Experimental Evaluation

We use the simulation infrastructure described in Chapter 3 to run simulations and collect results.

This chapter first describes the parameters we use to setup each component in the experiments,

then demonstrates and analyses the results obtained.

5.1 Experiment Setup

We set the network and TGs to run at 3.2GHz, while the other hardware blocks have different

clock frequencies, which are shown in Table A. Also shown in the table are main configurations

of different simulators or workloads. We have included 9 TGs in this network. Two of them are

used to model GPU traffic, one for user interface (UI), and the other for 3D graphics (3D).

Though generated by the same hardware unit, we model these two services separately because

they should have different traffic patterns, and more importantly different priorities. Another TG

is used to model streaming audio traffic, to be more specific, 320kbps mp3 streaming. The

remaining TGs are not assigned with specific services. It does not mean they are unimportant.

On the contrary, they play critical rules in stressing the network, since network in a real

smartphone system is stressed by a variety of less known traffic streams.

Network packets are uniformly 64 bytes long, and each of them is broken down into 16 4-byte

flits. Each data channel is 4-bytes wide, thus allows one flit to traverse it using one cycle. At

33

each level of routers in the baseline network, memory request flits have 1-cycle switch allocation

delay, while memory response flits have 1-cycle routing delay. In Dynamic QoS network, delay

compositions are the same except for a 1-cycle routing delay instead of switch allocation delay at

satellite routers. Also taking 15-cycle flits serialization delay into consideration, zero-load round-

trip latency of baseline and Dynamic QoS networks are both 40 cycles. In Dynamic QoS, the

width of signaling channels for token handshakes is 1 byte, which is enough to transfer a signal

containing block ID bits and signal type bits within 1 cycle.

Hardware Block Priority Modeling Main Configurations

DRAM

N/A

DRAMSim2

Memory size: 4GB

Frequency: 800MHz DDR3

Bus width: 64 bits

Controller policy: FCFRFS

Row buffer policy: open page

Camera 4 VCW Frequency: 160MHz

Display 3 VCW Frequency: 160MHz

Encoder 4 VCW Frequency: 3.2GHz

Decoder 4 VCW Frequency: 3.2GHz

Modem 2 VCW Frequency: 800MHz

CPU

2

GEM5 trace

Frequency: 1GHz

Caches: L1i 32KB, L1d 64KB

ISA: ARM

Mode: full-system

Benchmark: BBench

GPU(UI) 3 TG Address: Markov-chain

34

Timing: self-similar

GPU(3D) 2 TG Address: Markov-chain

Timing: self-similar

Audio 3 TG Address: linear

Timing: streaming

unspecified 1 TG Address: linear

Timing: streaming

unspecified 1 TGx5 Address: Markov-chain

Timing: random

Table A: Main configurations of each hardware block

Figure 12: Comparison of average latencies of packets in Dynamic QoS with different lengths

of buffer queues

Routers have a buffer queue for each of their input ports. In baseline and R0 in Dynamic QoS

each buffer queue has 24 buffer slots. In backbone routers of Dynamic QoS, there are 10 buffer

slots in each queue. All the above parameters were chosen by running experiments with different

100

117

96

116

94

64

107

93

108

83

64

107

94

107

83

0

20

40

60

80

100

120

140

Camera Display Encoder Decoder Modem

cycles

8 buffers

10 buffers

12 buffers

35

configurations, as the example shown in Figure 12, where smaller numbers of buffers per queue

would result in performance loss, and larger numbers would not bring obvious benefits. Similar

methods are adopted when selecting the interval of active/inactive service detection to be 1000

cycles. Smaller intervals than that mean finer granularity of control, but would only return

negligible performance gain.

We choose to use number of frames to represent length of a simulation. Typically, we set

simulations to run for 10 frames, where we could get a good balance of corner case coverage and

simulation time, which is just over 1 billion network cycles with more than 40 million memory

requests.

5.2 Experiment Results

5.2.1 Latencies

To evaluate network performance with respect to its quality in serving each hardware block, we

adopt average round-trip latency as the main metric. Lower average latency means smaller

memory access delay, as network functions as part of the data supply architecture. We compare

the average round-trip in-network latencies across all three network configurations: hierarchical-

multiplexers with simple round-robin arbiters, hierarchical-multiplexers with weight-round-robin

arbiters, and Dynamic QoS.

36

Figure 13: Average latencies of packets from hardware blocks associated with VCW

Figure 13 shows the results of hardware blocks in VCW. We can see performance gains, i.e.

reduction in average latencies, over non-QoS baseline, since we intentionally prioritize the

services of VCW. We can observe that after being assigned with top priority over all the other

services, camera's latency reduction of 42.9% is the most significant. This proves what we

claimed in Section 4.1 that camera is relatively more vulnerable to resource contention.

Figure 14 to Figure 18 show latency distributions of packets from VCW services in the form of

histograms. Generally we can find that latencies with QoS have a higher concentration into

categories that are close to zero-load latency. At the other end, the number of long-latency

packets is reduced by QoS schemes.

112 120

99

118

93

64

114

95

110

82

64

107

93

108

83

0

20

40

60

80

100

120

140


cycles

HMux RR

HMux WRR

Dynamic QoS

37

Figure 14: Latency distributions of packets from camera

Figure 15: Latency distributions of packets from display

Figure 16: Latency distributions of packets from encoder

0

100000

200000

300000

400000

500000

600000

700000counts

In-Network Latency Distributions - Camera

HMux RR

HMux WRR

Dynamic

0

2000

4000

6000

8000

10000

12000counts

In-Network Latency Distributions - Display

HMux RR

HMux WRR

Dynamic

0

1000000

2000000

3000000

4000000

5000000

6000000counts

In-Network Latency Distributions - Encoder

HMux RR

HMux WRR

Dynamic

38

Figure 17: Latency distributions of packets from decoder

Figure 18: Latency distributions of packets from modem

Similarly, the other services which are prioritized in QoS schemes, e.g. GPU(UI), streaming

audio and CPU, also show better performance than in non-QoS network. All the above

performance gains come at the expense of a performance penalty to the low-priority services.

Especially, as we can see in Figure 19, average latencies of the streaming TG and the lowest-

priority TG5 are severely affected. In general, the results of HMux-WRR and Dynamic QoS

0

20000

40000

60000

80000

100000

120000

140000

160000counts

In-Network Latency Distributions - Decoder

HMux RR

HMux WRR

Dynamic

0

500

1000

1500

2000

2500

3000

3500counts

In-Network Latency Distributions - Modem

HMux RR

HMux WRR

Dynamic

39

show in both Figure 13 and Figure 19 are fairly close. The reason is that both schemes use

weighted-round-robin arbitration in their routers, and the weight assignments to each service's

packets are also the same.

Figure 19: Average latencies of packets from non-VCW hardware blocks

In Figure 20, we take memory access delay into round-trip delay. By comparing with Figure 13,

we could find that in-network round-trip delay is a very important portion of the total delay.

However, improvements in performance are slightly shadowed by so-far unpredictable memory

access delays. We also show a comparison with the average deadline that guarantees 30

frames/sec performance for each VCW hardware block. The deadlines are calculated by different

numbers of read/write requests per frame for different hardware blocks. For example, encoder

0

100

200

300

400

500

600

700

800

900

1000

cycles

HMux RR

HMux WRR

Dynamic QoS

40

reads 8MB/frame from memory, and writes back 0.16MB/frame according to typical 50:1

compression ratio of h.264 video [34]. Since each memory write/read is 64-byte and the system

is running at 3.2GHz, in order to achieve a minimum 30 frames/sec performance, packets would

have a latency of 837 cycles on average at maximum. Modem only reads or writes encoded

frames which are very small in size, thus we do not show its expected average latency to

guarantee 30fps. The comparisons show the average latencies we have obtained are all well

below expected their corresponding 30fps average latencies, indicating 30fps could be achieved.

In fact, it is also justified by calculating frame rate using total number of cycles (slightly over 1

billion) spent to finish a 10-frame simulation.

Figure 20: Average total round-trip latencies of packets from VCW hardware blocks

474 523 372

577 451

471 510

374

571 437

458 508 360

567 451

853

2276

837

2231

0

500

1000

1500

2000

2500


cycles

HMux RR

HMux WRR

Dynamic QoS

30fps

41

5.2.2 Case Study: a micro-experiment

In real-life scenarios, there may suddenly be a stream of high-intensity but low-priority service

that may overwhelm network resources. The question of how well and how quickly can the QoS

scheme respond to such a change should always be one of the bottom lines when designing a

QoS scheme. We therefore have designed the following micro-experiment, in order to justify that

our Dynamic QoS scheme is well able to handle similar cases.

In this experiment setup, R4 has 3 active streaming TGs injecting a packet every 100 cycles;

each of R6 and R7 has another 3 active streaming TGs injecting a packet every 20-30 cycles; the

other satellite routers do not have any active services. The active streaming TGs are set to have

different initial waiting periods and different silence intervals, to provide stably high pressure on

DRAM. The silence intervals are set small enough to not trigger inactive detection. TG1-TG3

start at cycle 0. They have a silence interval of 200 cycles for every 1500-2500 cycles of

continuously sending packets. TG4-TG9 started 300 cycles later. They have a silence interval of

80 cycles for every 1000-1500 cycles.

R1 has 3 tokens and R2 has 6 tokens. According to the routing algorithm, all R4's services would

be served by R1, and all R6’s and R7's would go to R2. Every 1,000 cycles as a period, we let

each satellite router scan locally to detect newly activated/deactivated services, and output last

42

period's average round-trip latencies of each service's packets. We start to collect results after

5,000 cycles to allow for initial stabilization, and we set the total simulation time to be 45,000

cycles.

Figure 21: Average round-trip latencies for every 1000 cycles

The results are shown in Figure 21. At cycle 20,000, we deactivate TG1 at R4. Then this

available token at R1 is consumed by TG4 from R6. As can be observed, the average latencies of

TG2’s and TG3’s packets are not only affected by newly joined TG4, but also slightly decrease

due to absence of TG1 which used to have the highest weight in arbitration. So are the average

latencies of packets from all the other services. One may not expect that TG4 has such a

performance improvement at cycle 21,000, since its promotion to R1 is not yet performed until

30

35

40

45

50

55

60

65

50

00

70

00

90

00

11

00

0

13

00

0

15

00

0

17

00

0

19

00

0

21

00

0

23

00

0

25

00

0

27

00

0

29

00

0

31

00

0

33

00

0

35

00

0

37

00

0

39

00

0

41

00

0

43

00

0

45

00

0

47

00

0

49

00

0

Cycles

Time

TG1

TG2

TG3

TG4

TG4*

43

then. In fact, this performance gain during cycle 20,000 to 21,000 is simply due to less

competitor(s) in the arbitration(s), the same reason with other active services. At cycle 36,000,

after we set TG1 active again at cycle 35,000. R1 takes back the token from TG4, assigns it to

TG1, and we can observe that all average latencies resume their original states.

Figure 22: Average round-trip latencies for every 200 cycles

It may get more interesting when we zoom in to observe what exactly happens to each service's

packets at around cycle 35,000. As shown in Figure 22, when TG1 resumes its traffic injection,

according to the protocols, its packets are routed to R1 and assigned with lower priority than

TG2-TG4 in R1. In spite of this, performance of TG2-TG4 is still slightly affected with this one

more competitor for router resources. When active TG1 is finally detected at cycle 36,000, it gets

assigned with top priority in R1, and on the other hand, TG4 is rerouted back to R2. All packets'

30

35

40

45

50

55

60

65

35000 35200 35400 35600 35800 36000 36200 36400

Cycles

Time

TG1

TG2

TG3

TG4

44

latencies return to their original states gradually, considering the delay of waiting for tail flits in

order to perform the service rerouting and priority reassigning.

To serve as a comparison, we conduct the same experiment on the same network but without

dynamic routing or dynamic priority assignment. Similar to the baseline with WRR, priorities are

pre-assigned to TG1's-TG9's packets from high to low. We mainly focus on the latency change

to packets from TG4, shown as TG4* in Figure 21. Since TG1-TG3 still have higher priorities

and are "isolated" in R1, their performance would not change. We can find from the result that

when TG1 is inactive, TG4* does not have as much performance improvement as TG4 does. The

reason is that TG4* still competes with 5 other TGs in R2, even though R1 now has available

resources. This could be analogous to WRR baseline, which also lacks such adaptively.

5.2.3 Throughput

Figure 23: Network throughputs comparison

0

5000

10000

15000

20000

25000

800 400 200 100 50 25 12

Packets

Cycles

HMux

Dynamic

45

Network throughput is an important metric to evaluate a NoC design. It represents the maximum

capacity of communications that the network can support. In a typical N-to-N NoC research,

throughput is evaluated by increasing traffic injection rates of each nodes and measuring the

average number of flits received at each node. Since we aim to improve efficiency of data supply

by DRAM memory, what interests us more is the maximum number of memory

requests/responses the system could provide given a specific time interval. Therefore, we place a

counter at the output port to DRAM, which counts the number of packets arriving at DRAM

within 1 million network cycles. At the other end, hardware blocks are all replaced by streaming

TGs, and we gradually increase the injection rate of each streaming TG. As shown in Figure 23,

Dynamic QoS has a larger stabilized packet count than WRR baseline regardless of further

increased injection rates, which demonstrates a 5.2% higher capacity of communications

between DRAM and its requesters. The difference should lie in packets with mid-level priorities.

In baseline with WRR, these packets contend for resources in limited number of mid-priority

routers, even when sometimes high-priority routers are nearly idle. However, in same scenarios

with Dynamic QoS, part of these packets would be rerouted to high-priority routers and thus

prioritized. On the other hand, the severity of contention in mid-priority routers is relieved.

46

5.2.4 Area and Power

Figure 24: Router areas and channel areas

Figure 25: Router power consumptions

0

0.005

0.01

0.015

0.02

0.025

0.03

Hmux:router Dynamic:router HMux-channel Dynamic-channel

mm2

Channel

Sat-Output

Sat-Crossbar

Sat-Buffer

BB-Output

BB-Crossbar

BB-Buffer

Regular-Output

Regular-Crossbar

Regular-Buffer

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

HMux Dynamic

watts

Channel Leakage

Output

Switch

Input

47

We measure area and power costs of both QoS schemes, by using power_module in BookSim2.0,

with 32nm CMOS process. Area and static power consumption are calculated based on

configuration parameters of each network. Dynamic power consumption is based on activity

factors of each router component that are recorded during the entire simulation. Figure 24 shows

a saving of 6.7% in total router areas, especially in buffer areas with a total reduction of 35.2%.

The reason is that satellite routers of Dynamic QoS do not possess request flit buffers that are in

level-1 routers of baseline WRR. The same reason applies as we can observe in Figure 25, a 40.9%

reduction in input (buffer) power consumptions for Dynamic QoS. On the other side of the coin

is the increased channel area in Dynamic QoS, by providing more abundant number of channels

to give more options to each service.

It should be noted that routers and data channels reside in different layers. Router logic is in

silicon, while channels incur overhead in metal layers and in silicon through the insertion of

repeaters to meet cycle time constraints. Given the large number of metal layers in modern

ASICS, increasing channel requirements should not be problematic. The repeater insertion may

impact logic density, but that exploration requires detailed layout beyond the scope of this

research. The other tradeoff to the adaptivity of this intermediate network in Dynamic QoS is the

increased number of inputs of switches in backbone routers. We can see in Figure 24 and Figure

48

25 that the switch area and power consumption of Dynamic QoS are increased by 36.4% and

3.9x respectively over those in baseline WRR.

49

Chapter 6 Conclusions

In summary, in this research work we have made the following contributions:

We have investigated traffic patterns and analyzed prioritizations for streams of data

communications between DRAM controller and other components.

We have implemented WRR and a newly designed Dynamic QoS scheme specifically for

smartphone/tablet SoC networks. We have also described their protocols and corresponding

network topologies.

We have constructed a simulation infrastructure for smartphone/tablet SoCs. We have

evaluated our QoS designs with this infrastructure. Results show performance gains

comparing with non-QoS baseline regarding average latencies. Dynamic QoS outperforms

baseline with WRR on network throughput as well as router area and power consumption.

As a tradeoff, Dynamic QoS has more channel and in-router switch cost.

6.1 Future Work

As future work, we plan to directly integrate a CPU simulator to the simulation infrastructure.

We would regain data dependency information which could affect traffic patterns. Similarly, we

also plan to integrate a GPU simulator to provide more realistic traffic patterns, as compared

50

with those from the current dummy model. Once we have introduced such a credit-feedback

mechanism to all the network injections, we could switch to finite injection queues, a.k.a. closed-

loop measurement [2]. In that case, we could directly run benchmarks on simulators and measure

the sensitivity of their run time to network parameters as another evaluation metric. Moreover, a

more adaptive scheme would be required with injection queues being finite. For instance, when a

finite injection queue is filled beyond a certain level, the corresponding priority should be

increased.

Regarding our proposed Dynamic QoS, we have proved its capability in reducing average

latency of packets from high-priority traffic. We may still need to watch the jitter, i.e. variance,

of packets’ latencies, since it also serves as an important aspect that can affect the overall

performance. In addition, we may be able to prove its good scalability by building scaled

versions and performing experiments to compare with the baseline. To evaluate area and power

cost, it should be more accurate to use RTL implementations of the routers, though the current

method should be sufficient for comparisons. Lastly and more importantly, we will target a QoS

co-design with the DRAM controller. From system's perspective, this may lead to a more

effective yet maybe simpler design.

51

Bibliography

[1] J. Hruska, "Exynos 4212," Extreme Tech, 4 1 2012. [Online]. Available:

http://www.extremetech.com/computing/111315-blood-in-the-water-nvidia-qualcomm-

samsung-and-ti-prepare-for-arm-war.

[2] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks, Elsevier,

Inc., 2004.

[3] W. J. Dally and B. Towles, "Route packets, not wires: On-chip interconnection networks,"

in Design Automation Conference, 2001.

[4] N. Enright Jerger and L.-S. Peh, On-Chip Networks, M. Hill, Ed. Morgan and Claypool

Publishers, 2009.

[5] Y. Hoskote, "A 5-GHz mesh interconnect for a Teraflops processor," IEEE MICRO, vol. 27,

no. 5, pp. 51-61, 2007.

[6] J. Howard et al., "A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS,"

International Solid State Circuit Conference, 2010.

[7] J. A. Kahle et al., "Introduction to the cell multiprocessor," IVM Journal of Research and

Development, vol. 49, no. 4.

[8] D. Wentzlaff et al., "On-chip interconnection architecture of the tile processor," IEEE

MICRO, vol. 28, 2007.

[9] J. Kim, J. Balfour and W. J. Dally, "Flattened butterfly topology for on-chip networks,"

IEEE MICRO, 2007.

[10] N. Brookwood, "AMD fusion family of apus – enabling a superior, immersive pc," AMD

52

white paper, 2010.

[11] K. Lee, S.-J. Lee and H.-J. Yoo, "Low-power network-on-chip for high-performance SoC

design,," in IEEE Trans. VLSI Syst., 2006.

[12] A. Lambrechts et al., "Power breakdown analysis for a heter. NoC," in ASAP, 2005.

[13] M. Kreutz et al., "Design space exploration comparing homogenous and heterogeneous

network-on-chip architectures," in SBCCI, 2005.

[14] T. Bjerregaard and S. Mahadevan, "A survey of research and practices of network-on-chip,"

ACM Comput. Surv., 2006.

[15] K. Goossens, "Networks on silicon: Combining best effort and guaranteed services".IEEE

DATE.

[16] J. W. van den Brand, C. Ciordas, K. Goossens and T. Basten, "Congestion-controlled best-

effort communication for networks-on-chip," in Des., Autom. Test Eur. Conf., 2007.

[17] P. Avasare et al., "Centralized end-to-end flow control in a besteffort network-on-chip," in

EMSOFT, 2005.

[18] B. Grot et al., "Preemptive virtual clock: A flexible, efficient, and cost-effective QoS

scheme for networks-on-a-chip," IEEE MICRO, 2009.

[19] T. Bjerregaard and J. Sparso, "A router architecture for connection-oriented service

guarantees in the MANGO clockless network-on-chip," IEEE DATE, vol. 2, pp. 1226-1231,

2005.

[20] K. Goossens et al., "The Æthereal network on chip: concepts, architectures, and

implementations," in IEEE Design and Test of Computers, 2005.

[21] S. Murali et al., "A methodology for mapping multiple use-cases onto networks on," in Des.

Autom. Test Eur. Conf., 2006.

53

[22] L. Cheng et al., "Interconnect-aware coherence protocols for chip multiprocessors,"

IEEE/ACM ISCA, pp. 339-351, 2006.

[23] B. Grot et al., "Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and

service guarantees," IEEE/ACM ISCA, vol. 38, 2011.

[24] E. Bolotin et al., "QNoC: QoS architecture and design process for network on chip," J. Syst.

Architecture: EUROMICRO J., vol. 50, no. 2/3, pp. 105-128, 2004.

[25] V. Soteriou, H. Wang and L.-S. Peh, "A statistical traffic model for on-chip interconnection

networks," Int. Symp. Model., Anal., Simul. Comput. Telecommun. Syst., pp. 104-116, 2006.

[26] J. Hestness, B. Grot and S. W. Keckler, "Netrace: dependency-driven trace-based network-

on-chip simulation," the Third Internanional Workshop on NoC Architectures, 2010.

[27] A. Gutierret et al., "Full-system analysis and characterization of interactive smartphone

applications," IEEE Intl. Sym. on Workload Characterization, 2011.

[28] P. Rosenfeld, E. Cooper-Balis and B. Jacob, "DRAMSim2: A cycle accurate memory

system simulator," Computer Architecture Letters, vol. 10, no. 1, pp. 16-19, 2011.

[29] N. Binkert et al., "The gem5 simulator," SIGARCH Compute. Archit. News, vol. 39, 2011.

[30] S. Meyn, R. L. Tweedie and P. W. Glynn, Markov Chains and Stochastic Stability, 2 ed.,

Cambridge University Press, 2008.

[31] J. L. Henning, "SPEC CPU2006 benchmark descriptions," ACM SIGARCH Computer

Architecture News, 2005.

[32] V. J. Reddi et al., "Pin: a binary instrumentation tool for computer architecture research and

education," WCAE, 2004.

[33] "Self-similarity," Wikipedia, [Online]. Available: http://en.wikipedia.org/wiki/Self-

similarity.

54

[34] A. Lewin and T. K. Zvi, "Configurable Weighted Round Robin Arbiter". United States

Patent 6,032,218, 29 02 2000.

[35] "Compression Ratio Rules of Thumb," [Online]. Available:

http://www.kanecomputing.co.uk/pdfs/compression_ratio_rules_of_thumb.pdf.

[36] D. Pham et al., "The design and implementation of a first-generation cell processor," in

IEEE International Solid-State Circuits Conference, 2005.

Documents

Quality-of-Service for Network-on-Chip-based Smartphone/Tablet … · Quality-of-Service for Network-on-Chip-based Smartphone/Tablet Systems-on-Chip Kai Feng Master of Applied Science