Learning-Based Proactive Resource Allocation for Delay ...bbcr.uwaterloo.ca/~wzhuang/papers/TCCN_resource allocation.pdf · learning. I. INTRODUCTION The proliferation of new applications

1

Learning-Based Proactive Resource Allocation forDelay-Sensitive Packet Transmission

Jiayin Chen, Peng Yang, Member, IEEE, Qiang Ye, Member, IEEE, Weihua Zhuang, Fellow, IEEE,Xuemin (Sherman) Shen, Fellow, IEEE, and Xu Li

Abstract—In this paper, a learning-based proactive resourcesharing scheme is proposed for the next-generation core com-munication networks, where the available forwarding resourcesat a switch are proactively allocated to the traffic flows inorder to maximize the efficiency of resource utilization withdelay satisfaction. The resource sharing scheme consists of twojoint modules, estimation of resource demands and allocation ofavailable resources. For service provisioning, resource demandof each traffic flow is estimated based on the predicted packetarrival rate. Considering the distinct features of each trafficflow, a linear regression algorithm is developed for resourcedemand estimation, utilizing the mapping relation between trafficflow status and required resources, upon which a networkswitch makes decision on allocating available resources fordelay satisfaction and efficient resource utilization. To learn theimplicit relation between the allocated resources and delay, amulti-armed bandit learning-based resource allocation scheme isproposed, which enables fast resource allocation adjustment totraffic arrival dynamics. The proposed algorithm is proved to beasymptotically approaching the optimal strategy, with polynomialtime complexity. Extensive simulation results are presented todemonstrate the effectiveness of the proposed resource sharingscheme in terms of delay satisfaction, traffic adaptiveness, andresource allocation gain.

Index Terms—Delay requirements, efficient resource allocation,proactive resource sharing, traffic dynamics, multi-armed banditlearning.

I. INTRODUCTION

The proliferation of new applications has placed significantpressure on service provisioning in future core communicationnetworks beyond the fifth generation (5G). Some applicationsare delay-sensitive with strict requirements that are chal-lenging to satisfy. For instance, video conferencing requireslow latency and high reliability to ensure interactivity andvideo quality. To meet these requirements, the utilization ofbuffer spaces at each network switch needs to be properlycontrolled. Specifically, the dominant contributing factor forend-to-end (E2E) packet transmission delay is the delay forpacket queuing at network switches, and the packet loss is

Jiayin Chen, Weihua Zhuang, and Xuemin (Sherman) Shen are withthe Department of Electrical and Computer Engineering, University ofWaterloo, Waterloo, ON, Canada, N2L 3G1 (email: {j648chen, wzhuang,sshen}@uwaterloo.ca).

Peng Yang is with the School of Electronic Information and Communica-tions, Huazhong University of Science and Technology, Wuhan, 430074, P.R.China (email: [email protected]).

Qiang Ye is with the Department of Electrical and Computer Engineeringand Technology, Minnesota State University, Mankato, MN 56001 USA(email: [email protected]).

Xu Li is with Huawei Technologies Canada Inc., Ottawa, ON, Canada,K2K 3J1 (email: [email protected]).

mainly caused by buffer overflow during network congestion[1].

Recently, some congestion control methods are proposed tomeet the stringent quality-of-service (QoS) requirements, e.g.,performance-oriented congestion control (PCC) [2] and BBR[3], both of which are implemented at the packet source node,to adjust sending rate based on the observed E2E performance(e.g., achieved goodput, packet loss rate, and average latency).On the other hand, in-network congestion control schemes(e.g., active queue management (AQM)) adjust the queuelengths at switches by dropping or marking packets [4]. In [5],the interplay of end congestion control and in-network AQMis investigated to enable real-time Web browsing. A variationof BBR is proposed for a multi-path transmission scenario[6]. Congestion control schemes in general assume a fixedamount of forwarding resources for the traffic flow of eachservice (i.e., an aggregation of packets of the same servicetype being transmitted from a source node to a destinationnode). To guarantee QoS satisfaction for different applications,resource over-provisioning is usually the case to support thepeak traffic volume, while resource multiplexing is exploitedamong traffic flows of different services to improve QoS withefficient resource utilization.

To support delay-sensitive services, packet transmissiondelay in core networks should be minimized, in order tomeet the E2E delay requirement. The delivery delay of eachservice flow in core networks is determined by the routingpath and the allocated forwarding resources at each switchon route. Software-defined networking (SDN) is an emergingtechnology to enhance the routing performance in the next-generation core networks, in which the routing path for packettransmission can be customized and optimized for differentservices [7] [8]. Specifically, the SDN controller calculatesthe routing path for each traffic flow and distributes the flowtables to all the related switches, thereby guiding the packetforwarding. In addition, the SDN controller can calculate theaverage delay of packets traversing each switch [9] [10], whichcan be utilized to configure the per-hop delay requirement.Given the routing path, the amount of allocated forwardingresources at each passing switch collectively determines theE2E delay of a traffic flow. The decomposition of E2E delayrequirement enables the development of a per-hop resourceallocation solution. Compared with an E2E solution, a per-hop resource allocation solution can be more flexible whendealing with varying network conditions.

To satisfy the decomposed per-hop delay requirement withefficient resource utilization, each switch makes decision on

2

resource sharing among traffic flows. Delay requirements forper-hop packet transmission are taken into account by waittime priority (WTP) [11] and earliest due date (EDD) [12]algorithms. In using the algorithms, the switch makes decisioneach time that a packet is forwarded. Considering the highforwarding rate of core network switches, flow-level resourcesharing schemes with lower computational complexity areinvestigated for fairness and delay requirements, such asweighted fair queuing (WFQ) [13] and deficit round robin(DRR) [14] [15]. To adapt to varying traffic and network con-ditions, dynamic WFQ updates the allocation of resources atthe beginning of each resource sharing interval [16], where theresource allocation decisions are made based on the estimationof average packet queuing delay in the upcoming interval.For supporting applications with stringent delay requirements,resource allocation is expected to be conducted for delaysatisfaction of individual packets. Furthermore, if the resourcesharing is conducted in line with packet arrival instants, theQoS can be degraded due to burst of traffic in the upcomingresource sharing interval. Hence, a proactive resource sharingand packet scheduling solution potentially can help timelyQoS enhancement based on traffic prediction [17] [18]. Dif-ferent from the packet-level decision made based on trafficcharacteristics, estimating resource demand is required forflow-level resource sharing decisions. Thus, to accommodatethe large data volume and traffic fluctuations in future corecommunication networks while achieving efficient resourceutilization, we resort to generalized traffic prediction andresource demand estimation to develop a proactive resourcesharing scheme.

To support delay-sensitive applications in a dynamic net-working environment, learning-based scheduling algorithmsare investigated [19] [20]. To be adaptive to traffic dynamics,the resource scheduler applies machine learning techniquesto categorize traffic flows with different characteristics [21].Based on instantaneous system states (e.g., number of arrivedpackets and resource occupancy), different decisions on allo-cating resources are made by the learning module, such as re-inforcement learning (RL) [22], deep Q network (DQN) [23],and stacked auto-encoder (SAE) deep learning [24], throughwhich the implicit relation between the allocated resourcesand the achieved QoS satisfaction can be learned. Due to thedistinct delay requirements of applications, different models(e.g., parameters of neural networks) are trained to capture therelations between resource allocation and QoS performanceseparately, which leads to increased computational cost. Toincorporate a large number of flows in the core communicationnetworks, a resource sharing algorithm implemented at in-network switches needs to be effective in achieving satisfac-tory QoS performance with low computational complexity.

In this work, we aim at developing a lightweight online re-source demand estimation model for adaptive resource sharing.The resources are for packet forwarding at the egress port ofa network switch. On the basis of resource demand estimationto accommodate differentiated delay requirements of differentflows, a learning module is developed to learn the relationbetween allocated resources and achieved QoS for distinctapplications. Based on the information of flows (including

estimated resource demand, predicted traffic load and resourceoccupancy), a resource allocation module facilitates resourcesharing among flows to achieve maximal overall QoS satisfac-tion, with the consideration of tradeoff between exploitationand exploration. The multi-armed bandit (MAB) frameworkhas a potential to balance the exploration-exploitation tradeoff,in resource allocation [25], data offloading decision [26],and beam alignment [27] in wireless networks. In the MABframework, the player makes sequential decisions, choosingone from the arm set each time to maximize the obtainedreward. The player focuses on exploiting the most rewardingarms based on historical performance or pulling new arms forexploration. Upper confidence bound (UCB) [28] is used tobalance the exploitation-exploration tradeoff for bandit prob-lems, which provides theoretical confidence bound of regret.In the resource sharing problem, the amount of resourcesallocated to each flow can be formulated as an arm, while theobtained QoS satisfaction is the reward for pulling the arm.The resource demand can also be considered as the guide forresource sharing, which provides context information of eacharm. Thus, the resource sharing is a feature-based exploration-exploitation problem, which can be formulated as a contextualbandit problem [29].

In this paper, we aim at maximizing the efficiency ofresource sharing at a switch (i.e., the ratio of delay satisfactionto the allocated resources), and propose a bandit learning-based proactive forwarding resource sharing scheme, whichmaps the delay requirement and packet arrivals of each trafficflow to forwarding resource demand. Then, the availableresources are allocated to those traffic flows accordingly. Theresource allocation is formulated as a bandit learning problemand a regret bounded allocation strategy is developed, which isshown to be efficient in resource utilization. We consider theresource sharing at a software-programmable switch, whichis capable of supporting virtual network functions and packetprocessing functions, in addition to packet forwarding [30][31]. The main contributions of this paper are summarized inthe following:

(1) Resource sharing framework – We develop a learning-based framework to make resource sharing decisions ateach switch. In this framework, the resource demand isfirstly extracted as a service feature, by considering bothtraffic arrivals and delay requirements. Then, the switchallocates an optimal amount of available resources todifferent traffic flows based on their features;

(2) Resource demand estimation – We propose an online re-source demand estimation module in the resource sharingframework, which combines a linear regression modelwith an online gradient descent method. Utilizing the lin-ear mapping relation between traffic loads and the requiredforwarding resources, the resource demand for each flowat a switch is estimated based on traffic prediction and itsper-hop delay requirement;

(3) Allocation of available resources – We design a MAB-based allocation of available resources scheme in theresource sharing framework. Based on the estimated re-source demand, the available resources are allocated to

3

the flows accordingly. Using the measured delay as feed-back, the parameters of the proposed learning module areupdated.

The remainder of this paper is organized as follows. Thesystem model is described and the research problem is for-mulated in Section II. In Section III, the learning-based avail-able resource sharing solution is presented. The performanceevaluation and comparison are given in Section IV. Finally,conclusions are drawn in Section V.

II. SYSTEM MODEL AND PROBLEM FORMULATION

A. Network Model

Ingress

port

Packet

scheduler

Weight

distributionQueue length

measurement

Arrival rate

measurement

...

Traffic

arrivals

Egress

port

...

Traffic

departures

Fig. 1: Illustration of packet transmission at a switch.

Traffic flows of different services traverse a sequence ofnetwork switches in core networks before reaching their des-tinations for E2E service delivery. A network switch refers to asoftware switch, e.g., an openflow switch, centrally managedby an SDN control module in the core networks [32] [33].An openflow switch is capable of both layer 2 and layer 3network functionalities, depending on whether the device islocated within a local area network or is interconnecting twopublic network segments. The E2E delay requirement for eachtraffic flow is considered, which is composed of the per-hoppacket delays at all passing switches along the routing pathfrom the source to the destination. This per-hop delay consistsof packet queuing delay and transmission delay at a switch.

At each switch, packets from different flows enter cor-responding transmission queues of the packet scheduler, asshown in Fig. 1. The packet scheduler makes packet for-warding decisions based on the resource allocation for trafficflows. If more forwarding resources are allocated to a flow,the forwarding rate becomes higher, leading to a reducedqueuing delay and transmission delay for this flow. Hence,delay requirements of the traffic flows can be satisfied throughadjusting the amount of allocated resources. Consider a set,N , of in-network switches. The set of flows traversing anintermediate switch n ∈ N is denoted by Fn, and the setof switches on the routing path of flow f ∈ Fn is denoted byNf ⊆ N . We assume the flow set, Fn, and the switch set,Nf , remain stable in the course of scheduling. The amountof forwarding resources of switch n is denoted by C(n) ,and the amount of pre-allocated forwarding resources forflow f is denoted by C

(n)f , which is determined based on

the long-term QoS satisfaction and is assumed to be alwaysguaranteed. The amount of available resources of switch n isC

(n)I = C(n) −

∑f∈Fn

C(n)f .

Time is partitioned into resource sharing intervals of con-stant duration. At the beginning of the m th interval, switch nmakes decision on the amount of available resources allocatedto flow f , denoted by ∆C

(n)f,m (≥ 0). Then, the allocated

forwarding resource for flow f is C(n)f,m = C

(n)f + ∆C

(n)f,m,

and[C

(n)f,m

]f∈Fn

are the allocated forwarding resources for

all the flows at the m th interval. The weighted round-robinscheme is adopted for packet forwarding, where the weightsof flow f , denoted by v

(n)f,m, is proportional to its allocated

forwarding resources. That is,

v(n)f1,m

: v(n)f2,m

= C(n)f1,m

: C(n)f2,m

,∀f1, f2 ∈ Fn (1)

where∑f∈Fn

v(n)f,m = 1.

B. Per-Hop Delay Requirements

We consider traffic flows with differentiated delay require-ments and packet arrival patterns. Suppose the E2E packetdelay for flow f is decomposed to per-hop delay requirementsbased on the pre-allocated forwarding resources at each switch.To support the applications with strict delay requirements, weconsider SDN enabled core networks, in which an SDN con-troller has global network information. Upon service requests,the SDN controller configures routing paths and distributesthe flow tables to the passing switches for packet forwarding[34] [35]. In addition, the controller can calculate the averagequeuing delay and average transmission delay for packetstraversing each switch [9] [10], the summation of which canbe utilized to configure the per-hop delay requirement for theswitch. Specifically, the delay requirement for flow f ∈ Fn atswitch n ∈ N is denoted by D(n)

f .We denote the overall delay satisfaction ratio of switch n

in interval m as ρ(n)m =

∑f∈Fn

ρ(n)f,m, where ρ(n)

f,m is the delaysatisfaction ratio for flow f , i.e., the ratio of number of packetswith experienced delay smaller than D(n)

f over total number oftransmitted packets belonging to flow f in the m th interval.Let ε be the tolerance of per-hop delay violation ratio, i.e.,ρ

(n)f,m ≥ 1− ε.

C. Problem Formulation

We consider the resource allocation ratio when making theresource sharing decision, which is denoted by η(n)

m for switchn in interval m, η(n)

m =∑f∈Fn

C(n)f,m/C

(n). As shown in (1),the weighted resource sharing is conducted proportionally tothe assigned weight for each flow.

Considering resource allocation for delay satisfaction, wedescribe the resource sharing efficiency as the delay satisfac-tion ratio achieved by per unit of allocated resources, denotedby ρ

(n)m /η

(n)m . The resource sharing optimization problem is

formulated as (P1), in which our objective is to maximizethe resource sharing efficiency at switch n under the resourceconstraints to determine the optimal decision of allocatedavailable resources (

{∆C

(n)f,m

}f∈Fn

).

4

Packet schedulingResource demand

estimation

Allocation of

available resources...

Input (per-flow)

· Per-hop delay

requirements

· Packet arrivals

· Queue lengths

...

Required transmission

resources (per-flow)Allocation of

resources for

all flows

Resource sharing

Delay

measurements

Reward

Resource sharing interval

Packet scheduling interval

time

Fig. 2: Illustration of resource sharing for packet transmission at a switch.

(P1) : max{∆C

(n)f,m

}f∈Fn

ρ(n)m /η(n)

m

s.t.

∆C

(n)f,m ≥ 0, f ∈ Fn (2a)∑

f∈Fn

∆C(n)f,m ≤ C

(n)I . (2b)

Delay satisfaction ratio ρ(n)m is determined based on the

forwarding resources and the packet-level traffic information.Due to the uncertainty of traffic arrival patterns, it is difficult toestablish an analytical model for the ratio. Thus, a model-freelearning-based method is expected to solve the optimizationproblem, in which the delay satisfaction ratio can be measuredas feedback to the algorithm.

III. LEARNING-BASED PROACTIVE RESOURCE SHARING

In this section, we present a bandit learning-based proactiveresource sharing framework to improve delay satisfactionof different services while achieving efficient utilization ofnetwork resources.

A. Proactive Resource Sharing Framework

For proactive resource sharing among all flows at eachswitch, we propose two modules, the resource demand es-timation and allocation of available resources, as shown inFig. 2. Within each resource sharing interval, there are integermultiple packet scheduling intervals. The resource sharingframework for packet transmission at a switch consists of thefollowing four functionalities:• Resource demand estimation – To make the resource

allocation decisions, the resource demands from differentflows are estimated, based on their delay requirements forper-hop packet transmission, the queue lengths for eachflow, and the predicted numbers of arrived packets;

• Allocation of available resources – At the beginning ofeach resource sharing interval, the allocation of resources

for all flows passing through the switch is updated tomaximize the efficiency of resource sharing at the switch,based on the estimated resource demands;

• Packet scheduling – For packet transmission, the switchschedules packets from different flows at the egress portaccording to their allocated resource shares. The decisionon packet scheduling is made at each scheduling interval,which is shorter than the resource sharing interval;

• Delay measurements – When a packet is being transmit-ted at the egress port, its delay at the switch is measuredby comparing the time difference between the packettransmission instant and the packet arrival instant [32].Based on the delay requirements, the overall delay satis-faction ratio at the switch is calculated at the end of eachresource sharing interval. With the resource allocationratio, the resource sharing efficiency is evaluated, whichis fed back to the resource demand estimation module andthe resource allocation module for updating the resourcesharing decisions in the following intervals.

For each flow, the more allocated forwarding resources, thehigher delay satisfaction ratio. However, the improvement ofdelay satisfaction achieved by allocating the same amount ofresources to different flows may be different. Considering aflow with sufficient allocated resources, which already hasa high delay satisfaction ratio, its performance improvementupon additional allocated resources is limited. On the contrary,allocating resources to a flow with low delay satisfaction ratiowill lead to more significant performance improvement. Thus,when allocating resources to different flows, their resourcedemands should be considered as a feature of the flows, suchthat a higher resource demand indicates a larger potentialimprovement of overall delay satisfaction ratio.

The resource demands of flows in the resource sharingproblem is analogous to user preferences in an advertisementclick problem, referred to as context information. This typeof context-based decision making problems can be formu-lated as contextual bandit problem, and solved through UCBmethods [36]. Hence, we propose a MAB formulation with

5

context information to describe the resource sharing problemin (P1). In our MAB-based resource sharing problem, anarm represents a potential resource allocation decision forall flows in each resource sharing interval. To maximize theresource sharing efficiency, the reward of the MAB problemis represented by the delay satisfaction ratio, and the optimalarm represents the resource allocation decision which achievesthe highest ratio of the reward over the allocated resources.Before arm selection, the rewards of pulling different armsare estimated, considering the context information (i.e., theestimated resource demands in our problem). The achievedreward depends on both arm selection and context information,and the resource sharing among flows is iteratively convergedthrough the learning process with reward feedback.

At the beginning of each resource sharing interval, thelearning module makes resource sharing decision. Decisionvariables ∆C

(n)f,m in (P1) is continuous, leading to an infinite

number of arms. Since the reward distribution knowledge islearned through pulling different arms, it is necessary to makethe arm set finite, i.e., a finite number of arms in the set.Thus, we discretize the available resources at switch n intoI(n) =

⌊C

(n)I /B

⌋resource blocks (RBs), each with the same

amount of forwarding resources of B (in unit of bits persecond). The decision variables ∆C

(n)f,m are also discretized

correspondingly as ∆I(n)f,m. Hence, (P1) is transformed to

(P2).

(P2) : max{∆I

(n)f,m

}f∈Fn

ρ(n)m /η(n)

m

s.t.

∆I

(n)f,m ∈ N, f ∈ Fn (3a)

∆C(n)f,m = ∆I

(n)f,m ×B (3b)∑

f∈Fn

∆C(n)f,m ≤ C

(n)I . (3c)

A contextual-bandit algorithm proceeds in the discrete re-source sharing intervals. At the beginning of the m th interval,switch n estimates its current resource demand,

[C

(n)f,m

]f∈Fn

,

based on reward information from the previous (m − 1)

intervals[ρ

(n)i

]i=1,2,...,m−1

. The arm is a vector of |Fn|integers, each element representing the number of allocatedavailable resource blocks to a flow. Thus, each selected arm[∆I

(n)f,m

]f∈Fn

determines the allocation of available resources.

To avoid redundant resource allocation for a flow, its resourcedemand is considered. Thus, the number of RBs that canbe allocated to flow f at the m th interval should be less

than⌈C

(n)f,m−C

(n)f,m

B

⌉. As the total amount of available resources

at switch n is I(n), for each flow, we obtain the upperbound of the number of allocated resource blocks ∆I

(n)f,m, as

U(n)f,m = min

[I(n),max

(⌈C

(n)f,m−C

(n)f,m

B

⌉, 0

)]. Since the arm

describes the allocation of available resources for all flowstraversing the switch, the set of potential arms is denotedby A(n)

f,m|f∈Fn , where A(n)f,m =

{0, 1, 2, ..., U

(n)f,m

}, and the

size of the arm set is∏f∈Fn

(U

(n)f,m + 1

). In each interval,

an arm is selected from A(n)f,m|f∈Fn to achieve the maximal

potential reward ρ(n)m , i.e., the estimated value of overall delay

satisfaction ratio, ρ(n)m .

We propose the resource sharing framework to solve thiscontextual-bandit problem, with the resource demand estima-tion module for context information extraction and allocationmodule of available resource blocks for arm selection. Fig. 3shows the operation procedure of allocating available resourceblocks to different flows based on their estimated resourcedemands. The switch executes the following steps in the m thinterval:• For flow f ∈ Fn, switch n estimates its current resource

demand C(n)f,m, and determines its arm set A(n)

f,m accord-ingly, as discussed in Subsection III-B. Only the flowswith U

(n)f,m > 0 are processed in the following steps, to

reduce computing complexity in a large-scale networkscenario;

• Based on rewards[ρ

(n)i

]i=1,2,...,m−1

in the previous (m−

1) intervals and C(n)f,m, switch n estimates the potential

value of reward, ρ(n)m , for each arm from A(n)

f,m|f∈Fn , asdiscussed in Subsection III-C. Then, the arm with themaximal potential reward is selected and the availableresources are allocated;

• At the end of the interval, ρ(n)m is observed and used as

the feedback reward. With this new observation, the tu-

ple,([C

(n)f,m

]f∈Fn

,[∆I

(n)f,m

]f∈Fn

, ρ(n)m

), including the

context information, the selected arm, and reward, canbe used to improve the arm-selection strategy. Note thatonly the selected arm has the feedback of reward.

B. Resource Demand Estimation

At the end of the m th resource sharing interval, to obtaindelay satisfaction ratio ρ

(n)f,m for flow f , switch n measures

the delay of staying at the switch for all xR packets fromflow f arriving within interval m. Denote the delay of eachpacket staying at the switch as di, i = 1, 2, ..., xR. The delaysatisfaction ratio ρ(n)

f,m is calculated as

ρ(n)f,m =

1

xR

xR∑i=1

1(D

(n)f − di

)(4)

where 1 (x) = 1 if x ≥ 0, otherwise 1 (x) = 0. The delaymeasurement can be accomplished through pipelined packetprocessing, which includes the functionalities of reading andwriting packet headers [32]. At the end of interval m, currentqueue length (i.e., initial queue length of interval (m+ 1))and the number of arrived packets during interval m, denotedby b(n)

f,m+1 and λ(n)f,m respectively, are measured by the switch.

At switch n, for packets from flow f arriving within intervalm, the required forwarding resources for delay satisfaction isdetermined by a linear combination of λ(n)

f,m/l and b(n)f,m/l as

in [16], where l is the average per-hop delay requirement.For supporting applications with stringent delay requirements,resource allocation decisions are made by considering the

6

Resource demand

estimation (per-flow)

...

Input (per- flow):

· Per-hop delay

requirements

· Packet arrivals

· Queue lengths

{0, 1, 2, 3}

Arm set

(per-flow)

{0, 1, , 9}

{0}

{0, 1, 2}

...

Reward prediction

(per-flow)

Greedy

available

resource

block

allocation

{Δr1, Δr2, Δr3}

{Δr1, Δr2}

{Δr1, ...,Δr9}

Output:

Allocation of

available resources

{ I1, I2, ΔIF}

Reward

feedback

Real allocated resources

and delay feedback

Fig. 3: Details of resource sharing framework.

delay requirements, D(n)f , for individual packet transmission.

At switch n, the resource demand for flow f in intervalm, C(n)

f,m, is defined as the minimal amount of forwardingresources to satisfy ρ

(n)f,m ≥ 1 − ε. Since both new packet

arrivals and the packets waiting in the queue are to beprocessed in the next resource sharing interval, the resourcedemand is dependent on the predicted number of arrivedpackets λ

(n)f,m, b(n)

f,m, and D(n)f . To estimate C

(n)f,m, we use

a linear regression (LR) model to approximate the mappingrelation between

(D

(n)f , λ

(n)f,m, b

(n)f,m

)and C(n)

f,m [37]. The LRmodel is widely used in engineering areas, such as signalprocessing and financial engineering [38]. To approximate themapping relation, the number of arrived packets, the observedqueue length, and the measured delay bound to satisfy theper-hop delay violation ratio are collected at each resourcesharing interval to train the model parameters. For betterapproximation accuracy, we combine an online weight updatemethod (e.g., gradient decent algorithm in [39]) with linearregression. The detailed description of the linear regressionbased resource demand estimation is given in Algorithm 1, inwhich the initial model parameters are obtained in the trainingstage.

Algorithm 1: Online Linear Regression based Re-source Demand Estimation Algorithm.

1 Initialize W1

2 for m = 1, 2, 3, ... do3 Im ←

[λm,

λm

D , bm,bmD , 1

]4 Estimate Cm = WT

mIm5 Give Cm to allocation of available resources

module6 At the end of interval, observe λm, Cm, and DR

m

7 IRm ←[λm,

λm

DRm, bm,

bmDR

m, 1]

8 UpdateWm+1 ←Wm − ηm+1

(WT

mIRm − Cm)IRm

The resource demand estimation module takes(D

(n)f , λ

(n)f,m, b

(n)f,m

)as input, and estimates corresponding

C(n)f,m as output. We omit n and f from the symbols used

in Algorithm 1 for clarity. Based on the LR model, Cm isestimated as

Cm = w1mλm + w2

m

λmD

+ w3mbm + w4

m

bmD

+ w5m (5)

where the weight vector in the m th resource sharing intervalis denoted as Wm =

[w1m, ..., w

5m

], and the input vector is

Im =[λm,

λm

D , bm,bmD , 1

].

As shown in Fig. 3, at the beginning of each resourcesharing interval, say interval m, the amount of requiredresources is estimated and used as contextual information forthe following allocation of available resources module. At theend of the interval, the switch can observe the actual numberof arrived packets, λm, and the utilized forwarding resources,Cm. In addition, based on the measurement for the delay ofeach packet staying at the switch, di, we can determine DR

m

that satisfies 1xR

∑xR

i=1 1(DRm − di

)= 1−ε, with the allocated

forwarding resources Cm in interval m. Hence,(DRm, λm, bm

)and Cm are used to iteratively refine the weight parameters ofthe LR model in (5). According to [39], Wm is updated as

Wm+1 = Wm − ηm+1

(WT

mIRm − Cm)IRm (6)

where IRm =[λm,

λm

DRm, bm,

bmDR

m, 1], ηm+1 = 1

m+1 .

C. Allocation of Available Resource BlocksTo determine the amount of allocated available resources,

switch n first estimates the potential reward (ρ(n)m ) by

selecting each possible arm in A(n)f,m|f∈Fn

, which has∏f∈Fn

(U

(n)f,m + 1

)arms. However, the increasing number of

flows leads to a growth of arm set with high computationalcomplexity for reward estimation. To make the algorithmscalable in a large scale network, we estimate the reward ona per-flow basis. As illustrated in Fig. 3, switch n estimatesthe potential per-flow reward (ρ(n)

f,m) with the correspondingnumber of allocated available resource blocks (a ∈ A(n)

f,m),i.e., ρ(n)

f,m is estimated under the action of ∆I(n)f,m = a. After

that, a greedy method is used to select the arm that achievesthe highest ratio of potential reward ρ

(n)m over the allocated

resources.

7

In the following, we explain in detail how the per-flowperformance is estimated. This is similar to the reward es-timation in a contextual bandit problem, which observes theuser feature and potentially selected arm as prior knowledge.The problem is well solved by LinUCB, a variation of UCBmethod proposed as a generic contextual bandit algorithm in[36]. It is proved that a closed-form confidence interval (i.e.,the deviation of reward estimation) can be computed efficientlywhen the reward model is linear. For each flow f , switch nruns one LinUCB to estimate per-flow reward. For clarity,we present the symbols used in the LinUCB based rewardestimation algorithm, where indexes n and f are omitted. Inthe m th interval, we have

1) Two-dimensional feature: xm =[1, C

(n)f,m

], xm ∈ R2;

2) Arm: a = ∆I(n)f,m, a ∈ A

(n)f,m;

3) Reward: rm,a = ρ(n)f,m, with the action of choosing arm a.

Under the assumption that the potential reward of a givenarm is linear versus its observed feature xm with unknownparameter vector θ∗a, we have

E [rm,a|xm] = xTmθ∗a. (7)

If arm a was selected m′ (m′ < m) times in the previous(m− 1) time intervals, the associated reward feedback can beused to improve the estimation of θ∗a. Let Xm,a and Rm,a

denote the previous m′ feature vectors and observed rewards,where we have Xm,a ∈ Rm′×2 and Rm,a ∈ Rm′ . Withtraining data Xm,a and Rm,a, parameter vector θ∗a is estimatedas

θm,a =(XTm,aXm,a + I2

)−1XTm,aRm,a (8)

where I2 is the 2 × 2 identity matrix. For clarity, we denoteAm,a = XT

m,aXm,a + I2, and bm,a = XTm,aRm,a. Based on

the confidence interval given in [36], with the probability ofat least (1− δ), we have∣∣∣xTmθm,a − E [rm,a|xm]

∣∣∣ ≤ α√xTmA−1m,axm (9)

for ∀δ > 0, where α = 1 +√

ln (2/δ) /2 is a constant. Basedon the analysis in [36], we set α = 1.5. The inequality in (9)gives an upper bound of the reward estimated by (7)-(8). Inthe m th interval, the potential reward for arm a is

rm,a = xTmθm,a + α

√xTmA−1

m,axm. (10)

We define the marginal performance gain of allocating onemore available RB to flow f as ∆rfm,a = rm,a− rm,a−1, a =

1, 2, ..., U(n)f,m, where rm,a is obtained by the LinUCB for

flow f . Since a better delay satisfaction ratio should beachieved with more allocated forwarding resources, we have∆rfm,a > 0. Due to the constant size of RB, a greater ∆rfm,aindicates a larger delay satisfaction improvement achieved bythe allocated RB. Thus, based on ∆rfm,a, switch n allocatesavailable RBs greedily for all flows in Fn. Algorithm 2 givesa detailed description of how to allocate available RBs atswitch n, and the details of greedy RB allocation are given inAlgorithm 3.

To demonstrate how to allocate available RBs, we use the

Algorithm 2: Proposed Allocation Algorithm forAvailable Resource Blocks.

1 for m = 1, 2, 3, ... do2 Estimate resource demand (by Algorithm 1) for all

flows f ∈ Fn, C(n)f,m

3 for f = 1, 2, 3, ... |Fn| do4 Determine xm =

[1, C

(n)f,m

]and U (n)

f,m

5 for a = 0, 1, 2, 3, ...U(n)f,m do

6 if a is new then7 Am,a ← I2

8 bm,a ← 02×1 (zero vector)

9 θm,a ← A−1m,abm,a

10 rm,a ← xTmθm,a + α√

xTmA−1m,axm

11 if a = 0 then12 ∆rfm,a = rm,a13 else14 ∆rfm,a = rm,a − rm,a−1

15 Execute Algorithm 316 Implement the resource sharing decision[

C(n)f,m

]f∈Fn

through packet scheduling module

17 At the end of interval, ρ(n)f,m is observed as the

feedback to all flows f ∈ Fn as real-valuedreward rf,m

18 for f = 1, 2, 3, ... |Fn| do19 a = af,m, r = rf,m20 Am,a ← Am,a + xmxTm21 bm,a ← bm,a + rxm

Flows

1

2

3

4

5

Allocation of

resource blocks

RBs have

been allocated

The potential RB

in current iteration

RBs to be

allocated

Fig. 4: Illustration of the greedy available resource blockallocation.

resource sharing among 5 flows as an example shown in Fig.4. The thick red line indicates the upper bound of the amountof RBs that can be allocated to each flow, and the greenblocks indicate the amount of RBs already allocated to theflows. The blocks highlighted in orange indicate the potentialRB allocation to the flows in each algorithm iteration, if thehighest marginal gain (e.g., ∆rfm,a for flow f ) is observed. InAlgorithm 3, one RB is allocated in each iteration, until allthe available RBs at switch n have been allocated or all the

8

Algorithm 3: Greedy Allocation of Available Re-source Blocks.

1 Input: ∆rfm,a2 Initialization: Ft ← {}3 for f = 1, 2, 3, ... |Fn| do4 af,m ← 0

5 if U(n)f,m > 0 then

6 Ft ← Ft + {f}

7 for i = 1, 2, 3, ...I(n) do8 if Ft is not empty then9 fi = arg maxf∈Ft

∆rfm,a where a = af,m + 1.Randomly select fi if multiple flows have themaximal ∆rm,a

10 afi,m ← afi,m + 1

11 if afi,m = A(n)fi,m

then12 Ft ← Ft − {fi}

13 for f = 1, 2, 3, ... |Fn| do14 ∆I

(n)f,m ← af,m

15 Based on[∆I

(n)f,m

]f∈Fn

, obtain the corresponding

resource sharing decision[C

(n)f,m

]f∈Fn

16 Output: Resource sharing decision[C

(n)f,m

]f∈Fn

and

the selected arm [af,m]f∈Fn

flows get sufficient RBs (i.e., their upper bounds are reached).

D. Discussion

1) Regret analysis: As shown in (7), the potential rewardis linear with respect to the observed feature xm, and the truecoefficient vector is θ∗a. Without loss of generality, we assume‖xm‖ ≤ L, where ‖·‖ represents the l2-norm.

In the m th interval, denote the best arm[a∗f,m

]f∈Fn

that

satisfies [a∗f,m

]f∈Fn

= arg max[af ]

f∈Fn

∑f∈Fn

xTmθ∗af

(11)

where af ∈ A(n)f,m. Then, comparing with the reward achieved

by the selected arm [af,m]f∈Fn, the M -trail regret of this

resource allocation algorithm is calculated by

LM =

M∑m=1

∑f∈Fn

xTmθ∗a∗f,m−∑f∈Fn

xTmθ∗af,m

. (12)

Considering the confidence interval in (9), with probabilityat least 1− δ, we have∑f∈Fn

xTmθ∗af,m

≥∑f∈Fn

(xTmθm,af,m − α

√xTmA−1

m,af,mxm

).

(13)

According to the proposed available resource block allocationalgorithm and (9), we obtain∑f∈Fn

xTmθ∗a∗f,m

≤∑f∈Fn

(xTmθm,a∗f,m + α

√xTmA−1

m,a∗f,mxm

)≤∑f∈Fn

(xTmθm,af,m + α

√xTmA−1

m,af,mxm

).

(14)Thus,

LM ≤M∑m=1

∑f∈Fn

2α√

xTmA−1m,af,mxm

. (15)

To simplify (15), we apply lemma 4.4 and lemma 4.5 in [40],as

M∑m=1

xTmA−1m,af,m

xm ≤ 2 log

∣∣Am,af,m

∣∣β

(16)∣∣Am,af,m

∣∣ ≤ (β +mL2/d)d

(17)

where d ≥ 1 and β ≥ max(1, L2

). Apply the lemmas in (15),

we have

LM ≤ 2α∑f∈Fn

(M∑m=1

√xTmA−1

m,af,mxm

)

≤ 2α∑f∈Fn

√√√√M

(M∑m=1

xTmA−1m,af,mxm

)

≤ 2α∑f∈Fn

√2M log

∣∣Am,af,m

∣∣β

≤ 2α∑f∈Fn

√2Md · log

(1 +

ML2

βd

)≤ 2αF

√2Md ·

√log (1 +M/d).

(18)

In (18), the M -trail regret bound, O(√

Md · log (1 +M/d))

,

indicates the zero-regret feature1, i.e., limM→∞LM

M = 0.The zero-regret strategy is guaranteed to converge to anoptimal strategy after enough rounds are played [41]. Thus, theproposed algorithm is proved to be asymptotically approachingthe optimal strategy [42].

2) Time complexity analysis: To analyze the scalabilityof the proposed resource sharing scheme in terms of thenumber of flows (F ), we evaluate the time complexity foreach stage in Fig. 3 in one interval. First, the resource demandestimation module runs Algorithm 1 for each flow separately,which leads to O (F ) time complexity. Then, in Algorithm 2,the reward is calculated for all the arms, where flow f has(U

(n)f,m + 1

)arms, according to its resource demand. Thus,

the complexity is proportional to the total number of arms,∑f∈Fn

(U

(n)f,m + 1

). The overall number of resource blocks

required by flows should be comparable to the number ofavailable resource blocks (I(n)). Otherwise, the network would

1A strategy whose average regret per round tends to zero when the horizontends to infinity is defined as zero-regret strategy [41].

9

TABLE I: Parameters of traffic flow

Number of flows (F ) 50Average traffic rate 15 MbpsDelay violation tolerance (ε) 0.5 %

Per-hop delay requirement - number of flows0.5 ms - 100.7 ms - 151.5 ms - 25

incur unstable queue accumulation. We obtain the complexityof reward calculation in Algorithm 2 as O

(I(n)

). After that,

each resource block is allocated by comparing the marginalreward for candidate flows, as given in Algorithm 3. Themaximal time complexity is O (F ), in which all the F flowsbelong to the candidate set. As the comparison is conductedat most for I(n) times, the time complexity of Algorithm3 is O

(FI(n)

). Therefore, combining the complexity of

three algorithms, the time complexity of the resource sharingscheme is O

(FI(n)

).

Since I(n) is determined by the amount of available re-sources and the resource block size, we can set the value ofI(n) by adjusting the block size. If we set I(n) as a fixedvalue, the complexity is reduced to O (F ). However, when thenumber of flows increases, the resource sharing scheme withthe fixed I(n) may have a degraded performance. In particular,if there are more flows than resource blocks, i.e., F > I(n),the resources allocated to different flows are distinct even ifall the flows have same features. To avoid this inconsistencybetween the allocated resources and the observed features dueto insufficient resource blocks, we set I(n) ∝ F , and thepolynomial time complexity, O

(F 2), is achieved. Note that if

the resource block size has to be set as a small value with theincrease of F in order to satisfy I(n) ∝ F , the improvementon delay satisfaction by allocating one resource block couldbe insignificant, leading to a low marginal gain. Hence, alower limit on the resource block size needs to be determinedproperly, considering the allocation efficiency, which remainsan important and open issue for the proposed algorithm.

IV. NUMERICAL RESULTS

To demonstrate the effectiveness of the proposed resourcesharing scheme, numerical results are presented.

A. Simulation Scenario

In our work, the resource allocation algorithm is designedfor each switch, aiming at satisfying the per-hop delay require-ment. To evaluate the performance of the proposed algorithm,we consider the packet transmission at a network switch in thesimulation. For the F traffic flows traversing the switch, the re-source sharing among them is simulated and the delay of eachpacket passing the switch is measured, as shown in Fig. 5. Thenetwork switch, consisting of the transmission queues, packetscheduler, resource sharing module, and delay measurementmodule, is emulated on a high-performance computer serverusing a Python IDE called PyCharm [43]. For each trafficflow, its packets enter the corresponding transmission queue

...

flowF

flow1

flow2

flowF-1

...

flowF

flow1

flow2

flowF-1Packet scheduler

Allocation of resources

Resource sharing module

Delay measurement module

Transmission scheduling

Transmission queues

Reward feedback

Fig. 5: Illustration of the simulation scenario.

and wait for transmission scheduling. Given the amount ofresources allocated to each traffic flow, the packet schedulermakes packet forwarding decisions, i.e., adjusting the for-warding rate for each flow. During the packet transmission,the delay measurement module records the packet receptiontime and departing time to measure the packet delay at thisswitch. At the end of each resource sharing interval, the delaymeasurement module calculates the delay satisfaction ratio foreach flow, which is utilized by the resource sharing module tomake future resource allocation decisions.

The detailed parameters of the traffic flows are given inTable I. To simulate flows of different services, we dividedthe 50 flows into 3 subsets with different per-hop delayrequirements. The duration of each resource sharing intervalis 5 ms. From Table I, the average traffic rate at this switchis 750 Mbps, considering 50 flows traversing the switch withaverage traffic rate of 15 Mbps. In reality, the traffic volumesare different during peak and off-peak hours, which leads tovarying average resource utilization of the switch. To simulatethese variations, we set the switch forwarding resources forthe off-peak hour case as 1250 Mbps (i.e., 60% resourceutilization) and for the peak hour case as 833 Mbps (i.e., 90%resource utilization). In addition to traffic load variations, weconsider different pre-allocated resource conditions that impactthe demand of resource sharing, including over-provisioningand on-demand matching cases. The pre-allocated conditionsare determined by both the pre-allocated forwarding resourcesand the average traffic rate of each flow (15 Mbps in oursimulation). Parameters of these cases are given in Table II,where the minimal resource allocation ratio is the ratio ofoverall pre-allocated resources over the switch forwardingresources.

B. Performance of Resource Demand Estimation

To evaluate the accuracy of resource demand estimation, wedivide the traffic data trace into two sets, one is used as trainingdata set (including 489, 930 observations) and the remainingis used for testing. Each observation represents the flow trans-mission status during one resource sharing interval, consistingof input vector (including the number of arrived packets, initialqueue length, and the measured delay bound to satisfy the per-

10

TABLE II: Parameters of switch

Network traffic condition Off-peak PeakSwitch forwarding resources (Mbps) 1250 833Available resource block size (Mbps) 3.5 5 0.83Pre-allocated resources (Mbps / flow) 18 15 15Minimal resource allocation ratio 0.72 0.6 0.9Pre-allocated condition Over-provisioning Matching traffic

0 500 1000 1500 2000 2500

Intervals

0

5

10

15

20

25

30

Err

or

of

estim

atio

n (

Mb

ps) Online LR

LR

Fig. 6: Error of resource demand estimation.

hop delay violation ratio), and the output (allocated forwardingresources). In the training stage, all the observations are fedinto an LR module given by (5), and a weight vector W ∈ R5

is obtained by the LR module. Then, we apply the onlineresource demand estimation algorithm with the initial vectorW to estimate the required resources. The estimation error ineach interval represents the difference between an estimatedamount and an actual amount of allocated resources. To showthe performance of resource demand estimation, the movingaverage of errors during 10 consecutive intervals is shown inFig. 6 where online LR refers to the proposed method and LRmethod utilizing W for estimation without weights update.Without the weights update, the error of LR method variesbetween 15% and 35% . However, due to the weights update in(6), the estimation error of online LR shows a decreasing trendin Fig. 6, and becomes stable around 1% after 1500 intervals.Therefore, the proposed resource estimation module is capableof providing an accurate resource demand estimation for theallocation of available resources module.

C. Delay Reduction via Resource Sharing

To evaluate the delay performance of the resource sharingscheme, we determine the proportion of flows that meet theirdelay requirements (i.e., satisfying ρ(n)

f,m ≥ 1− ε), referred toas delay guaranteed proportion. In terms of cost, we measurethe resource allocation fraction including both pre-allocatedresources and allocated available resources. For comparisonbetween different resource sharing schemes, the cumulativedistribution functions (CDF) of packet delay, delay guaranteedproportion, and resource allocation gain which is the ratio ofdelay guaranteed proportion over resource allocation ratio arecalculated for 4, 000 intervals.

The proposed MAB based resource sharing scheme iscompared with an on-switch resource allocation approach, thedynamic WFQ (D-WFQ) method [16]. The D-WFQ is anenhanced version of WFQ under dynamic traffic conditions,considering the differentiated per-hop delay requirements oftraffic flows. We also compare with the resource sharingscheme without resource demand estimation module (MAB-WO). In the MAB-WO method, the allocation of availableresources module directly uses the per-hop delay requirements,the predicted traffic arrival, and the queue lengths as input, in-stead of using the estimated resource demands as shown in Fig.3. To show the performance improvement achieved by utilizingavailable resources, we simulate the packet transmissions withonly pre-allocated resources given in Table II (pre-allocated).If the forwarding resources are fully used, we make a fixedresource sharing decision to achieve the optimal accumulativedelay guaranteed proportion during the simulation intervals(optimal).

When the flows are pre-allocated with the resources matchedtheir packet arrival rates, the packet delay experiences a long-tailed distribution as shown in Fig. 7. It leads to the delayguaranteed proportion less than 90% among 20% of intervals.However, during the off-peak hours, all four resource sharingschemes achieve nearly 100% delay guaranteed proportionthrough allocating available resources. Since MAB and MAB-WO schemes require less forwarding resources than D-WFQand optimal schemes, they have higher resource allocationgain, as shown in Fig. 7. This is because MAB schemes canlearn to allocate the available resources to the flows for thehighest marginal gain, through the greedy allocation algorithmin Algorithm 3. Comparing MAB and MAB-WO schemes,the resource demand estimation module makes it possible toidentify the flows with stringent delay requirements, and sethigher upper bounds of allocated available resource blocks. Asmore available resources can be allocated to the flows withstringent requirements, MAB outperforms MAB-WO in mostresource sharing intervals in terms of resource allocation gain.In over-provisioning situations, we simulate the case that eachflow is pre-allocated with more resources than necessary onaverage. Better delay guaranteed proportion is shown in Fig.8, compared with the on-demand matching case. However, dueto the high resource allocation ratio in the over-provisioningcase, its resource allocation gain is lower than the on-demandmatching case.

In the peak-hour scenario, all four resource sharing schemestend to make full use of the resources to support the heavytraffic, and experience delay guaranteed proportion higher

11

0 0.05 0.1 0.15 0.2 0.25 0.3

Delay (ms)

0

0.2

0.4

0.6

0.8

1C

DF

MAB

MAB-WO

D-WFQ

Pre-allocated

Optimal

0.01 0.02 0.03 0.04 0.050.98

0.985

0.99

0.995

1

(a) Packet delay

50 60 70 80 90 100

Delay guaranteed proportion (%)

0

0.2

0.4

0.6

0.8

1

CD

F

MAB

MAB-WO

D-WFQ

Pre-allocated

Optimal

91 93 95 97 990

0.1

0.2

0.3

0.4

0.5

0.6

(b) Delay guaranteed proportion

0.7 0.9 1.1 1.3 1.5 1.7

Gain

0

0.2

0.4

0.6

0.8

1

CD

F

MAB

MAB-WO

D-WFQ

Optimal

(c) Resource allocation gain

Fig. 7: The performance in a off-peak hour condition with matching traffic resources.

0 0.01 0.02 0.03 0.04 0.05

Delay (ms)

0

0.2

0.4

0.6

0.8

1

CD

F

MAB

MAB-WO

D-WFQ

Pre-allocated

Optimal

0 0.01 0.02 0.030.98

0.99

1

(a) Packet delay

75 80 85 90 95 100Delay guaranteed proportion (%)

0

0.2

0.4

0.6

0.8

1C

DF

MAB

MAB-WO

D-WFQ

Pre-allocated

Optimal

96 97 98 99 1000

0.1

0.2

0.3


0.9 1 1.1 1.2 1.3 1.4

Gain

0

0.2

0.4

0.6

0.8

1

CD

F

MAB

MAB-WO

D-WFQ

Optimal


Fig. 8: The performance in a off-peak hour condition with over-provisioning resources.

0 0.1 0.2 0.3 0.4

Delay (ms)

0

0.2

0.4

0.6

0.8

1

CD

F

MAB

MAB-WO

D-WFQ

Pre-allocated

Optimal

0.05 0.1 0.15 0.20.95

0.96

0.97

0.98

0.99

1

(a) Packet delay

0 20 40 60 80 100

Delay guaranteed proportion (%)

0

0.2

0.4

0.6

0.8

1

CD

F

MAB

MAB-WO

D-WFQ

Pre-allocated

Optimal

90 92 94 96 98 1000

0.2

0.4

0.6

0.8


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Gain

0

0.2

0.4

0.6

0.8

1C

DF

MAB

MAB-WO

D-WFQ

Optimal


Fig. 9: The performance in a peak-hour condition with matching traffic resources.

than 90% among 80% of the simulation intervals, as shownin Fig. 9. Comparing with transmission upon pre-allocatedresources, the MAB method has a higher delay guaranteedproportion by avoiding the long-tailed distribution of packetdelay. Due to the small amount of available resources duringpeak hours, it is unlikely to allocate the required amount ofavailable resources to a flow, based on the estimated resourcedemand. Thus, comparing with the off-peak hour case, thedifference between MAB and MAB-WO methods vanishesduring peak hours.

The highest resource allocation gain is obtained when100% delay guaranteed proportion is achieved with the pre-allocated resources. Thus, resource allocation gain is boundedby the reciprocal of minimal resource allocation ratio (i.e.,the ratio of overall pre-allocated resources over the switch

forwarding resources). Since the optimal and D-WFQ schemesmake full use of available forwarding resources, the resourceallocation gain is bounded by 1. However, MAB and MAB-WO schemes can achieve better delay guaranteed proportionwith less resources, and their gains outperform the optimaland D-WFQ schemes. Due to the errors at the starting stageof resource demand estimation, the MAB scheme experiencesover-allocation of available resources compared with the sub-sequent stable stages. Thus, there is a step change in theresource allocation gain, which becomes negligible with morefine-grained available resource block sizes. As shown in Fig.7 to Fig. 9, the step shrinks with the available resource blocksize decreases from 5 Mbps, to 3.5 Mbps, and to 0.83 Mbps,respectively.

12

D. Adaptive Resource Sharing

0 20 40 60 80 100 120 1400

50

100

De

lay g

ua

ran

tee

d

pro

po

rtio

n (

%)

D-WFQ

MAB

MAB-WO

0 20 40 60 80 100 120 140Intervals

0.6

1

1.4

1.8

2.2

No

rma

lize

d t

raff

ic r

ate

Fig. 10: Resource sharing performance in a peak conditionwith matching traffic resources.

Due to insufficient forwarding resources, instantaneous de-lay degradation happens during traffic peaks, such as inintervals 25, 87, and 125, as shown in Fig. 10, where the real-time packet arrival rate is normalized to the average traffic rate.Although all three schemes experience degraded performance,both MAB and MAB-WO schemes outperform D-WFQ, dueto the proactive resource allocation.

54 55 56 57 58 59 60 61 62 63 64 65

Intervals

0

20

40

60

80

100

Dela

y g

uara

nte

ed

pro

po

rtio

n (

%)

94

95

96

97

98

99

100

101

102

103

104

Reso

urc

e a

llo

cati

on

rati

o (

%)

Delay

Resource

Traffic

peak

(a) Traffic peak 1

81 82 83 84 85 86 87 88 89Intrevals

0

20

40

60

80

100

Dela

y g

uara

nte

ed

pro

po

rtio

n (

%)

94

95

96

97

98

99

100

101

102

103

104

Reso

urc

e a

llo

cati

on

rati

o (

%)

Delay

Resource

Traffic

peak

(b) Traffic peak 2

Fig. 11: The response of the proposed resource sharingscheme to traffic peaks.

Furthermore, to adapt to traffic dynamics, it requires afast response of resource allocation when traffic peak comes.

Fig. 11 shows how the MAB scheme adaptively allocatesresources. During traffic peak, packet delay increases, andmore forwarding resources are needed. To obtain the responsetime to traffic peak, we focus on the time instants whereresource allocation ratio starts to increase or falls back toa stable level. Comparing with the traffic peak duration, itis observed that the response time is at most one intervalwhen traffic peak appears and disappears, which is 5 msin our simulation. Therefore, the proposed resource sharingscheme demonstrates adaptation capability to traffic dynamics,by allocating suitable resources to the flows.

V. CONCLUSION

In this paper, we have proposed a novel learning-basedproactive resource sharing scheme to maximize resource uti-lization efficiency with delay satisfaction. Two modules forestimating online resource demand and allocating availableresources are developed jointly to achieve efficient resourcesharing at each network switch. To learn the implicit relationbetween the allocated resources and differentiated delay re-quirements from traffic flows of different services, a multi-armed bandit learning-based resource allocation scheme isproposed, which enables fast and proactive resource adjust-ment upon traffic variations. During the data transmission,delay satisfaction ratios are measured as the reward feedbackto refine the learning parameters for better convergence. Theproposed scheme is proved to be asymptotically approachingthe optimal strategy with the polynomial time complexity.Extensive simulation results are presented to demonstrate boththe advantages of the proposed resource sharing scheme overconventional schemes and the robustness to traffic dynamics.For future work, the proposed joint resource demand estima-tion and resource allocation framework will be extended forreliable end-to-end packet transmission.

ACKNOWLEDGMENT

The authors would like to thank the undergraduate researchassistant, Mingxi Zhang, for her help in conducting simula-tions.

REFERENCES

[1] S. Manjunath and G. Raina, “Stability and performance of compoundTCP with a proportional integral queue policy,” IEEE Trans. Contr. Syst.Technol., vol. 27, no. 5, pp. 2139–2155, Sept. 2019.

[2] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, “PCC:Re-architecting congestion control for consistent high performance,” inProc. USENIX, May 2015, pp. 395–408.

[3] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson,“BBR: Congestion-based congestion control,” Commun. ACM, vol. 60,no. 2, pp. 58–66, Feb. 2017.

[4] S. Floyd, K. Ramakrishnan, and D. L. Black, “The addition of explicitcongestion notification (ECN) to IP,” IETF, RFC 3168, Sept. 2001.

[5] G. Carlucci, L. De Cicco, and S. Mascolo, “Controlling queuingdelays for real-time communication: the interplay of E2E and AQMalgorithms,” ACM SIGCOMM Comput. Commun. Rev., vol. 46, no. 3,pp. 1–7, Jul. 2018.

[6] J. Han, K. Xue, Y. Xing, P. Hong, and D. S. Wei, “Measurement andredesign of BBR-based MPTCP,” in Proc. ACM SIGCOMM, Aug. 2019,pp. 75–77.

[7] J. Chen, Q. Ye, W. Quan, S. Yan, P. T. Do, P. Yang, W. Zhuang, X. Shen,X. Li, and J. Rao, “SDATP: An SDN-based traffic-adaptive and service-oriented transmission protocol,” IEEE Trans. Cogn. Commun., vol. 6,no. 2, pp. 756–770, Jun. 2020.

13

[8] C. Gao, V. Rajabian-Schwart, W. Zhang, G. Xue, and J. Tang, “Howwould you like your packets delivered? An SDN-enabled open platformfor QoS routing,” in Proc. IEEE/ACM IWQoS, Jun. 2018, pp. 1–10.

[9] J. W. Guck, A. Van Bemten, and W. Kellerer, “DetServ: network modelsfor real-time QoS provisioning in SDN-based industrial environments,”IEEE Trans. Netw. Service Manag., vol. 14, no. 4, pp. 1003–1017, Sept.2017.

[10] Q. Ye, W. Zhuang, X. Li, and J. Rao, “End-to-end delay modeling forembedded VNF chains in 5G core networks,” IEEE Internet Things J.,vol. 6, no. 1, pp. 692–704, Feb. 2019.

[11] C. Dovrolis, D. Stiliadis, and P. Ramanathan, “Proportional differenti-ated services: Delay differentiation and packet scheduling,” IEEE/ACMTrans. Networking, vol. 10, no. 1, pp. 12–26, Feb. 2002.

[12] E. Davies, M. A. Carlson, W. Weiss, D. Black, S. Blake, and Z. Wang,“An architecture for differentiated services,” IETF, RFC 2475, Dec.1998.

[13] A. Demers, S. Keshav, and S. Shenker, “Analysis and simulation of a fairqueueing algorithm,” ACM SIGCOMM Comput. Commun. Rev., vol. 19,no. 4, pp. 1–12, Sept. 1989.

[14] M. Shreedhar and G. Varghese, “Efficient fair queuing using deficitround-robin,” IEEE/ACM Trans. Networking, vol. 4, no. 3, pp. 375–385,Jun. 1996.

[15] M. Menth, M. Mehl, and S. Veith, “Deficit round robin with limiteddeficit savings (DRR-LDS) for fairness among TCP users,” in Proc.MMB, Jan. 2018, pp. 188–201.

[16] C.-C. Li, S.-L. Tsao, M. C. Chen, Y. Sun, and Y.-M. Huang, “Propor-tional delay differentiation service based on weighted fair queuing,” inProc. IEEE ICCCN, Oct. 2000, pp. 418–423.

[17] K. Bao, J. D. Matyjas, F. Hu, and S. Kumar, “Intelligent software-definedmesh networks with link-failure adaptive traffic balancing,” IEEE Trans.Cogn. Commun. Netw., vol. 4, no. 2, pp. 266–276, Jun. 2018.

[18] K. Chen and L. Huang, “Timely-throughput optimal scheduling withprediction,” IEEE/ACM Trans. Networking, vol. 26, no. 6, pp. 2457–2470, Dec. 2018.

[19] M. D. F. De Grazia, D. Zucchetto, A. Testolin, A. Zanella, M. Zorzi,and M. Zorzi, “QoE multi-stage machine learning for dynamic videostreaming,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 1, pp. 146–161, Mar. 2018.

[20] J. Li, W. Shi, N. Zhang, and X. Shen, “Delay-aware VNF scheduling: Areinforcement learning approach with variable action set,” IEEE Trans.Cogn. Commun. Netw., DOI: 10.1109/TCCN.2020.2988908, pp. 1–15,Apr. 2020.

[21] L. Wang, X. Wang, M. Tornatore, K. J. Kim, S. M. Kim, D.-U. Kim,K.-E. Han, and B. Mukherjee, “Scheduling with machine-learning-basedflow detection for packet-switched optical data center networks,” IEEEJ. Opt. Commun. Netw., vol. 10, no. 4, pp. 365–375, Apr. 2018.

[22] I.-S. Comsa, S. Zhang, M. E. Aydin, P. Kuonen, Y. Lu, R. Trestian,and G. Ghinea, “Towards 5G: A reinforcement learning-based schedul-ing solution for data traffic management,” IEEE Trans. Netw. ServiceManag., vol. 15, no. 4, pp. 1661–1675, Dec. 2018.

[23] J. Luo, X. Su, and B. Liu, “A reinforcement learning approach formultipath TCP data scheduling,” in Proc. IEEE CCWC, Jan. 2019, pp.0276–0280.

[24] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep-Q-learning-basedtransmission scheduling mechanism for the cognitive Internet of Things,”IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385, Aug. 2017.

[25] O. Avner and S. Mannor, “Multi-user communication networks: A co-ordinated multi-armed bandit approach,” IEEE/ACM Trans. Networking,vol. 27, no. 6, pp. 2192–2207, Dec. 2019.

[26] A. Mukherjee, S. Misra, V. S. P. Chandra, and M. S. Obaidat, “Resource-optimized multi-armed bandit based offload path selection in edge UAVswarms,” IEEE Internet Things J., vol. 6, no. 3, pp. 4889–4896, Jun.2019.

[27] W. Wu, N. Cheng, N. Zhang, P. Yang, W. Zhuang, and X. Shen, “Fastmmwave beam alignment via correlated bandit learning,” IEEE Trans.Wireless Commun., vol. 18, no. 12, pp. 5894–5908, Dec. 2019.

[28] P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,” J. Mach. Learn. Res, vol. 3, pp. 397–422, Nov. 2002.

[29] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual bandits withlinear payoff functions,” in Proc. AISTATS, Apr. 2011, pp. 208–214.

[30] M. M. Tajiki, S. Salsano, L. Chiaraviglio, M. Shojafar, and B. Akbari,“Joint energy efficient and QoS-aware path allocation and VNF place-ment for service function chaining,” IEEE Trans. Netw. Service Manag.,vol. 16, no. 1, pp. 374–388, Mar. 2019.

[31] Q. Ye, J. Li, K. Qu, W. Zhuang, X. Shen, and X. Li, “End-to-end qualityof service in 5G networks: Examining the effectiveness of a network

slicing framework,” IEEE Veh. Technol. Mag., vol. 13, no. 2, pp. 65–74,Jun. 2018.

[32] “OpenFlow switch specification,” Open Networking Foundation, Mar.2015.

[33] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster, N. McKeown, andJ. Rexford, “Pisces: A programmable, protocol-independent softwareswitch,” in Proc. ACM SIGCOMM, Aug. 2016, pp. 525–538.

[34] O. Alhussein, P. T. Do, Q. Ye, J. Li, W. Shi, W. Zhuang, X. Shen,X. Li, and J. Rao, “A virtual network customization framework formulticast services in NFV-enabled core networks,” IEEE J. Select. AreasCommun., vol. 38, no. 6, pp. 1025–1039, Jun. 2020.

[35] O. Alhussein and W. Zhuang, “Robust online composition, routingand NF placement for NFV-enabled services,” IEEE J. Select. AreasCommun., vol. 38, no. 6, pp. 1089–1101, Jun. 2020.

[36] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-banditapproach to personalized news article recommendation,” in Proc. ACMWWW, Apr. 2010, pp. 661–670.

[37] X. Yan and X. Su, Linear regression analysis: theory and computing.World Scientific, 2009.

[38] Y. Feng and D. P. Palomar, “A signal processing perspective on financialengineering,” Found. Trends Signal Process., vol. 9, no. 1–2, pp. 1–231,2016.

[39] E. Hazan, A. Rakhlin, and P. L. Bartlett, “Adaptive online gradientdescent,” in Proc. NIPS, Dec. 2007, pp. 65–72.

[40] L. Zhou, “A survey on contextual multi-armed bandits,”arXiv:1508.03326, pp. 1–43, Aug. 2015.

[41] J. Vermorel and M. Mohri, “Multi-armed bandit algorithms and empir-ical evaluation,” in Proc. ECML, Oct. 2005, pp. 437–448.

[42] P. Yang, N. Zhang, S. Zhang, L. Yu, J. Zhang, and X. Shen, “Contentpopularity prediction towards location-aware mobile edge caching,”IEEE Trans. Multimedia, vol. 21, no. 4, pp. 915–929, Apr. 2019.

[43] “Pycharm.” [Online]. Available: https://www.jetbrains.com/pycharm/

Jiayin Chen received the B.E. degree and the M.S.degree in the School of Electronics and InformationEngineering from Harbin Institute of Technology,Harbin, China, in 2014 and 2016, respectively. Sheis currently pursuing the Ph.D. degree with theDepartment of Electrical and Computer Engineer-ing, University of Waterloo, Waterloo, ON, Canada.Her research interests are in the area of vehicularnetworks and machine learning, with current focuson Intelligent Transport System and big data.

Peng Yang (IEEE S’16-M’18) received his B.E.degree in Communication Engineering and Ph.D.degree in Information and Communication Engi-neering from Huazhong University of Science andTechnology (HUST), Wuhan, China, in 2013 and2018, respectively. He was with the Department ofElectrical and Computer Engineering, University ofWaterloo, Canada, as a Visiting Ph.D. Student fromSept. 2015 to Sept. 2017, and a Postdoctoral Fellowfrom Sept. 2018 to Dec. 2019. Since Jan. 2020, hehas been a faculty member with the School of Elec-

tronic Information and Communications, HUST. His current research focuseson next-generation networking, mobile edge computing, video streaming andanalytics.

14

Qiang Ye received the Ph.D. degree in electri-cal and computer engineering from the Universityof Waterloo, Waterloo, ON, Canada, in 2016. Hewas a Postdoctoral Fellow with the Department ofElectrical and Computer Engineering, University ofWaterloo, from December 2016 to November 2018,where he was a Research Associate from December2018 to September 2019. He has been an AssistantProfessor with the Department of Electrical andComputer Engineering and Technology, MinnesotaState University, Mankato, MN, USA, since Septem-

ber 2019. His current research interests include 5G networks, software-definednetworking and network function virtualization, network slicing, artificialintelligence and machine learning for future networking, protocol design, andend-to-end performance analysis for the Internet of Things.

Weihua Zhuang (IEEE M’93–SM’01–F’08) hasbeen with the Department of Electrical and Com-puter Engineering, University of Waterloo, Canada,since 1993, where she is currently a Professorand a Tier I Canada Research Chair of wirelesscommunication networks. She was a recipient ofthe 2017 Technical Recognition Award from theIEEE Communications Society Ad Hoc and SensorNetworks Technical Committee, and several bestpaper awards from IEEE conferences. She was theTechnical Program Symposia Chair of the IEEE

GLOBECOM 2011 and the Technical Program Chair/Co-Chair of the IEEEVTC in 2016 and 2017. She is a Fellow of the Royal Society of Canada, theCanadian Academy of Engineering, and the Engineering Institute of Canada.She is an Elected Member of the Board of Governors and VP Publications ofthe IEEE Vehicular Technology Society. She was an IEEE CommunicationsSociety Distinguished Lecturer from 2008 to 2011. She was the Editor-in-Chief of the IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGYfrom 2007 to 2013.

Xuemin (Sherman) Shen (IEEE M’97-SM’02-F’09) received the Ph.D. degree in electrical engi-neering from Rutgers University, New Brunswick,NJ, USA, in 1990. He is currently a UniversityProfessor with the Department of Electrical andComputer Engineering, University of Waterloo, Wa-terloo, ON, Canada. His research focuses on networkresource management, wireless network security, so-cial networks, 5G and beyond, and vehicular ad hocand sensor networks. He is a registered ProfessionalEngineer of Ontario, Canada, an Engineering Insti-

tute of Canada Fellow, a Canadian Academy of Engineering Fellow, a RoyalSociety of Canada Fellow, a Chinese Academy of Engineering Foreign Fellow,and a Distinguished Lecturer of the IEEE Vehicular Technology Society andCommunications Society.

Dr. Shen received the R.A. Fessenden Award in 2019 from IEEE, Canada,James Evans Avant Garde Award in 2018 from the IEEE Vehicular TechnologySociety, Joseph LoCicero Award in 2015 and Education Award in 2017from the IEEE Communications Society. He has also received the ExcellentGraduate Supervision Award in 2006 from the University of Waterloo, and thePremier’s Research Excellence Award (PREA) in 2003 from the Province ofOntario, Canada. He served as the Technical Program Committee Chair/Co-Chair for the IEEE Globecom’16, the IEEE Infocom’14, the IEEE VTC’10Fall, the IEEE Globecom’07, the Symposia Chair for the IEEE ICC’10, theTutorial Chair for the IEEE VTC’11 Spring, and the Chair for the IEEECommunications Society Technical Committee on Wireless Communications.He was the Editor-in-Chief of the IEEE INTERNET OF THINGS JOURNALand Vice President on Publications of the IEEE Communications Society.

Xu Li is a senior principal researcher at HuaweiTechnologies Canada. He received a PhD (2008)degree from Carleton University, an MSc (2005)degree from the University of Ottawa, and a BSc(1998) degree from Jilin University, China, allin computer science. Prior to joining Huawei, heworked as a research scientist (with tenure) at Inria,France. His current research interests are focused in5G. He contributed extensively to the developmentof 3GPP 5G standards through 90+ standard propos-als. He has published 100+ refereed scientific papers

and is holding 30+ issued US patents. He is/was on the editorial boards ofIEEE Communications Magazine, the IEEE Transactions on Parallel and Dis-tributed Systems, the Wiley Transactions on Emerging TelecommunicationsTechnologies and a number of other international archive journals.

Documents

Learning-Based Proactive Resource Allocation for Delay ...bbcr.uwaterloo.ca/~wzhuang/papers/TCCN_resource allocation.pdf · learning. I. INTRODUCTION The proliferation of new applications