8
Fast Markov Decision Process for Data Collection in Sensor Networks Thai Duong and Thinh Nguyen School of Electrical Engineering and Computer Science Oregon State University Abstract—We investigate the data collection problem in sensor networks. The network consists of a number of stationary sensors deployed at dierent sites for sensing and storing data locally. A mobile element moves from sites to sites to collect data from the sensors periodically. There are dierent costs associated with the mobile element moving from one site to another, and dierent rewards for obtaining data at dierent sensors. Furthermore, the costs and the rewards are assumed to change abruptly. The goal is to find a ”fast” optimal movement pattern/policy of the mobile element that optimizes for the costs and rewards in non-stationary environments. We propose a novel optimization framework called Fast Markov Decision Process (FMDP) to solve this problem. The proposed FMDP framework extends the classical Markov Decision Process theory by incorporating the notion of mixing time that allows for the tradeobetween the optimality and the convergence rate to the optimality of a policy. Theoretical and simulation results are provided to verify the proposed approach. KeywordsFast Markov Decision Process, Data Collection, Mobile Element, Sensor Networks, Mixing Time. I. Introduction Wireless Sensor Networks (WSNs) have been deployed in many applications, including but not limited to, tracking and monitoring of cattle in agriculture, studies of species migra- tion patterns in biological sciences, and detecting hazardous conditions. A typical WSN consists of a large number of battery-operated sensors that sense and collect data from the environment. The data are then sent to a sink node via a multi- hop routing scheme [1]. As a result, the sensors that are closer to the sink will consume more energy for relaying data from other sensors. And since communication energy constitutes a large percentage of the overall energy consumption, the batteries of these sensors will drain much earlier than others. Consequently, the life time of the entire sensor network is shortened significantly. A number of strategies have been proposed to address this energy problem [2][3][4][5]. One approach is to not use the multi-hop communication. Instead, one or more mobile elements (MEs) are proposed to travel around in the network and collect data from the sensors. The MEs then hand othe collected data to the sink when they are in its vicinity. This approach not only enhances the life time of the sensor network, but also allows data collection in the environments where wireless communication are not ecient or possible, e.g., when sensors are geographically far apart or when they are placed in a forest whose the water in the trees can attenuate This work is supported partially by the NSF CAREER grant: CNS-0845476 the wireless signals significantly. On the other hand, relying on MEs to deliver the data increases the latency and the chance of buer overflow at a sensor node due to the possibility that the ME does not visit the node often enough to collect its data. Therefore, there has been a number of algorithms to find an optimal movement pattern of the ME that maximizes the amount of data collected in a given period of time while satisfying constraints on latency and buer overflow [2][4][5]. That said, an optimal movement pattern or policy is de- termined by a number of parameters including (1) the cost associated with the ME moving from one sensor to another and (2) the various rates of useful data being collected at dierent sensors. In many cases, finding an optimal movement policy can be casted as a classic Markov Decision Process (MDP) problem [6]. In this problem, the solution to the MDP problem is a policy that tells the ME where to go next given that it is currently at a certain location so that over time, the optimal policy will maximize a pre-specified objective. We note that an optimal MDP policy aims to maximize the given objective in the long run under the assumption of stationary environments. On the other hand, many real world settings are non-stationary, where the parameters can vary unexpectedly. Therefore, the optimal movement policy should also be designed to not only to maximize the objective but also be robust against these unexpected changes. In this paper, we formalize the problem of finding the optimal movement policy for the ME in non- stationary environments as a random walk on a graph [7][8] and solve it using a novel Fast Markov Decision Process framework (FMDP). Our proposed FMDP framework extends the classical MDP theory by incorporating the theory of mixing time that allows for the tradeobetween the optimality of a policy and the how fast the optimality can be obtained. Thus, in non-stationary environments, a FMDP policy might be preferable since it can achieve the given objective quickly before the environment changes. Specifically, if the costs of traveling between two sensors modeled as the distance between them are changed abruptly due to bad weather, or the rates of data being gathered at sensors change significantly, then it is better to to use the movement policy produced by the FMDP framework. We will quantitatively show these results in Section VI. In addition, we show that under certain assumptions, finding the optimal movement policy of the ME is in fact a convex problem, and can be solved eectively. As such, unlike many existing heuristic approaches, the optimality of the solution of the proposed algorithm can be quantified and obtained. Our paper is organized as follows. Section II presents the 978-1-4799-3572-7/14/$31.00 ©2014 IEEE

[IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

  • Upload
    thinh

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

Fast Markov Decision Process for Data Collection inSensor Networks

Thai Duong and Thinh NguyenSchool of Electrical Engineering and Computer Science

Oregon State University

Abstract—We investigate the data collection problem in sensornetworks. The network consists of a number of stationary sensorsdeployed at different sites for sensing and storing data locally. Amobile element moves from sites to sites to collect data from thesensors periodically. There are different costs associated with themobile element moving from one site to another, and differentrewards for obtaining data at different sensors. Furthermore, thecosts and the rewards are assumed to change abruptly. The goalis to find a ”fast” optimal movement pattern/policy of the mobileelement that optimizes for the costs and rewards in non-stationaryenvironments. We propose a novel optimization framework calledFast Markov Decision Process (FMDP) to solve this problem.The proposed FMDP framework extends the classical MarkovDecision Process theory by incorporating the notion of mixingtime that allows for the tradeoff between the optimality and theconvergence rate to the optimality of a policy. Theoretical andsimulation results are provided to verify the proposed approach.

Keywords—Fast Markov Decision Process, Data Collection,Mobile Element, Sensor Networks, Mixing Time.

I. Introduction

Wireless Sensor Networks (WSNs) have been deployed inmany applications, including but not limited to, tracking andmonitoring of cattle in agriculture, studies of species migra-tion patterns in biological sciences, and detecting hazardousconditions. A typical WSN consists of a large number ofbattery-operated sensors that sense and collect data from theenvironment. The data are then sent to a sink node via a multi-hop routing scheme [1]. As a result, the sensors that are closerto the sink will consume more energy for relaying data fromother sensors. And since communication energy constitutesa large percentage of the overall energy consumption, thebatteries of these sensors will drain much earlier than others.Consequently, the life time of the entire sensor network isshortened significantly.

A number of strategies have been proposed to addressthis energy problem [2][3][4][5]. One approach is to not usethe multi-hop communication. Instead, one or more mobileelements (MEs) are proposed to travel around in the networkand collect data from the sensors. The MEs then hand offthe collected data to the sink when they are in its vicinity.This approach not only enhances the life time of the sensornetwork, but also allows data collection in the environmentswhere wireless communication are not efficient or possible,e.g., when sensors are geographically far apart or when theyare placed in a forest whose the water in the trees can attenuate

This work is supported partially by the NSF CAREER grant: CNS-0845476

the wireless signals significantly. On the other hand, relying onMEs to deliver the data increases the latency and the chanceof buffer overflow at a sensor node due to the possibility thatthe ME does not visit the node often enough to collect itsdata. Therefore, there has been a number of algorithms tofind an optimal movement pattern of the ME that maximizesthe amount of data collected in a given period of time whilesatisfying constraints on latency and buffer overflow [2][4][5].

That said, an optimal movement pattern or policy is de-termined by a number of parameters including (1) the costassociated with the ME moving from one sensor to another and(2) the various rates of useful data being collected at differentsensors. In many cases, finding an optimal movement policycan be casted as a classic Markov Decision Process (MDP)problem [6]. In this problem, the solution to the MDP problemis a policy that tells the ME where to go next given that it iscurrently at a certain location so that over time, the optimalpolicy will maximize a pre-specified objective. We note that anoptimal MDP policy aims to maximize the given objective inthe long run under the assumption of stationary environments.On the other hand, many real world settings are non-stationary,where the parameters can vary unexpectedly. Therefore, theoptimal movement policy should also be designed to not onlyto maximize the objective but also be robust against theseunexpected changes. In this paper, we formalize the problemof finding the optimal movement policy for the ME in non-stationary environments as a random walk on a graph [7][8]and solve it using a novel Fast Markov Decision Processframework (FMDP). Our proposed FMDP framework extendsthe classical MDP theory by incorporating the theory of mixingtime that allows for the tradeoff between the optimality ofa policy and the how fast the optimality can be obtained.Thus, in non-stationary environments, a FMDP policy mightbe preferable since it can achieve the given objective quicklybefore the environment changes. Specifically, if the costs oftraveling between two sensors modeled as the distance betweenthem are changed abruptly due to bad weather, or the rates ofdata being gathered at sensors change significantly, then it isbetter to to use the movement policy produced by the FMDPframework. We will quantitatively show these results in SectionVI. In addition, we show that under certain assumptions,finding the optimal movement policy of the ME is in facta convex problem, and can be solved effectively. As such,unlike many existing heuristic approaches, the optimality ofthe solution of the proposed algorithm can be quantified andobtained.

Our paper is organized as follows. Section II presents the

978-1-4799-3572-7/14/$31.00 ©2014 IEEE

Page 2: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

related literature on sensor networks. In Section III, we providea brief background on MDP, mixing time, and random walkon a graph, necessary for the development of the proposedsolution approach. In Section IV, we describe the FMDPframework and formalize the problem in its context. Analgorithmic solution is proposed in Section V. Section VIprovides the simulation results to demonstrate the benefits ofthe proposed approach. Finally, the Section VII gives a fewconcluding remarks.

II. RelatedWork

There is a rich research literature on WSNs. We can onlymention a few in this paper. A good survey of WSNs can befound in [1]. Many WSNs are assumed to be dense. Thus,data collected by the sensors are sent to the sink using multi-hop routing schemes. As discussed in the Introduction, thisapproach severely limits the longevity of a WSN. To overcomethis problem, MEs are introduced into WSNs. For example,mobile data collectors such as mobile sinks [2] or mobile relays[3][9][10] are used to collect data from the sensors periodically.They move around, connect to the sensors, and collect the data.In other applications such as animal tracking, the sensors areattached to cattle and thus naturally mobile [11].

There are two models for ME’s mobility: uncontrolledmobility and controlled mobility. For the uncontrolled mobility,the movement of the MEs is either deterministic and cannot bechanged, or followed a random distribution in trajectory andspeed [1]. For controlled mobility, the trajectory and speed ofan ME can be actively altered and controlled to a certain extent.There has been number of studies on this controlled mobilitymodel. The goal is to control the mobility element’s speeds andtrajectories to maximize the amount of data collected whileminimizing latency and buffer overflow at the sensors. Forexample, many algorithms for designing deterministic trajecto-ries have been proposed in [2]. For dynamic trajectories, therehas been two main schemes to dynamically change the ME’strajectories. The first scheme is based on sensor’s demand[4]. The ME will change its course when a node request isreceived. The second scheme is based on node’s priority whichdepends on the node’s buffer overflow status [5]. Several speedcontrol algorithms have also been proposed, including Stop toCollect, Adaptive Speed Control, etc. [12]. Some algorithmsfor controlling both trajectories and speeds have also beeninvestigated in [13].

Unlike many existing approaches with heuristic flavors, wecast the data collection problem in the framework of a novelFast Markov Decision Process and solve it using the convexoptimization. As such, we can provide the optimality of theproposed solution.

III. Preliminaries

A. Markov Decision Process (MDP)

MDP is a framework for studying optimal decision makingprocess under uncertainty [6]. In an MDP setting, there is acontroller who interacts with the environment by taking actionsbased on its observations at every discrete time step. Eachaction by the controller induces a change in the environment.Typically, the environment is described by a finite set of states.An action will move the environment from the current state

to some other states with certain probabilities. Associated witheach action in each state is a reward given to the controller. Thegoal of the controller is to maximize the expected cumulativereward or average reward over some finite or infinite number oftime steps by making sequential decisions based on its currentobservations.

Formally, a discrete-time MDP represents a dynamic sys-tem and is specified by a finite set of states S , representingthe possible states of the system, a set of control actions A,a transition probability matrix P|S |×|S |, and a reward functionr. The transition probability specifies the dynamics of thesystem whose each entry P(i, j) � P(sn+1 = j|sn = i,an = a)represents the conditional probability of the system movingto state sn+1 = j in the next time step after taking an action ain the current state sn = i. The dynamics are Markovian in thesense that the probability of the next state j depends only onthe current state i and the action a, and not on any previoushistory. The reward function R(s,a) assigns a real number tothe state s the action a, so that R(s,a) represents the immediatereward of being in state s and taking action a. A policy Π isa sequence of actions a1, a2, . . . an taken by the controllerwith n denotes the time index. Formally, a policy specifies amapping from states to actions at each time step Πn : S → A.The policy Π that is called stationary if its actions dependonly on the state s, independent of time index. A stationarypolicy induces a time-invariant transition probability matrix.Typically, time is assumed to be discrete and that the controlpolicy selects one action at each time step. Every policy Π isassociated with a value function VΠ such that VΠ(s) gives theexpected cumulative reward achieved by Π when starting instate s. The solution to an MDP problem is the optimal policyΠ∗ that maximizes the average reward or expected cumulativereward over some finite or infinite number of time steps. Inthis paper, we use the average reward to be defined in SectionIV-B as a criterion for optimal policy.

B. Mixing Time

Proposition 1: For an irreducible, aperiodic, finite anddiscrete Markov chain with a transition probability matrix P,there exists a unique stationary distribution π such that

limn→∞ν

T Pn = πT . (1)

In other words, the chain will converge to a stationary distri-bution of states after a large number of steps regardless of theinitial state ν.

Definition 1 (Total variation distance): For any two prob-ability distributions ν and π on a finite state space Ω, we definethe total variation distance as:

‖ν−π‖TV =12

i∈Ω|ν(i)−π(i)| . (2)

The total variation distance measures the distance between twodistributions. It is used to evaluate the convergence rate of theMarkov Chain to the stationary distribution.

Definition 2 (Mixing time): [8] For a discrete, aperiodicand irreducible Markov chain with transition probability P and

Page 3: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

stationary distribution π, given an ε > 0, the mixing time tmix(ε)is defined as

tmix(ε) = inf{n : ‖νT Pn−πT ‖TV ≤ ε, for all

probability distributions ν}. (3)

Definition 3 (Reversible Markov Chain): A discreteMarkov chain with a transition probability P is said to bereversible if

P(i, j)π(i) = P( j, i)π( j) (4)

Theorem 1 (Bound on mixing time): [8]. Let P be thetransition matrix of a reversible, irreducible and aperiodicMarkov chain with state space S , and let πmin := minx∈S π(x).Then

tmix(ε) ≤ 11−μ(P)

log( 1επmin

). (5)

where μ(P) is the second largest eigenvalue modulus (SLEM)of matrix P

As seen in the theorem, the larger the spectral gap (1−μ(P))is, the faster the chain converges to the stationary distributionis. This will be used in our FMDP framework to characterizethe convergence rate of a protocol/policy.

C. Random Walk on Graphs and Reversible Transition Matrix

Random walk on graph is a probabilistic model that de-scribes the random movement of a walker from one vertexto another on a given graph. Specifically, the movement ismodeled locally in the following sense. If the walker is at acertain vertex V , the probability that it will move to a particularneighboring vertex depends only on the weights of edgesconnecting to V . It is shown that for any reversible transitionmatrix, it can be represented as a weighted random walk ongraph [7][8]. Suppose a matrix P is reversible with stationarydistribution π or π(s)P(s, s′) = π(s′)P(s′, s) = C(s, s′). Then wecan use the conductance C(s, s′) as the weight of the edge(s, s′).

A weighted graph can be described by a symmetric weightmatrix W where its positive entry W(s, s′) =W(s′, s) is theweight of the edge (s, s′). Let H(s) =

∑s′W(s, s′) be the total

weight of all edges connected to node s. If H(s) = 0, then thenode s does not connect to any other nodes and we can ignorenode s when calculating transition probability. Specifically, theprobability of moving from node s to node s′ will be P(s, s′) =W(s,s′)

H(s) if H(s)> 0 and 0 if H(s)= 0. The stationary distribution

for the transition matrix is calculated as: π(s) = H(s)∑k H(k) . Note

that the transition matrix and its stationary distribution onlydepend on the ratio between weights. Therefore, without lossof generality, we normalize the sum of all the weights in theweight matrix, i.e.

∑(i, j) W(i, j) =

∑i H(i) = 1.

IV. ProblemModeling and Formulation

Modeling. We model a sensor network as a graph (V,E)(Figure 1a) with the set of vertices or nodes V representing thesensors or groups of sensors close to each other such that theME can collect the data from all nodes in the same group atthe same time. The edges in E represent the path the ME cantravel from one node to another node to collect data in discretetime step. Once the ME reaches to a node i, it collects data

(a) Reward and cost structures on a graph

(b) Corresponding random walk on the graph

Figure 1: An example of graph to model sensor network andits random walk induced by an MDP policy

from that node and obtains a benefit modeled as a real positivenumber b(i). One way the value b(i) can be interpreted is theamount of data collected at node i. Thus, if the rate of usefuldata being sensed and stored at a node i per unit time is highthen b(i) should be large. In addition to rewards, there are costs.Specifically, traveling from one node to another incurs a costc(i, j) which can represents among many quantities, the energyspent to travel the actual distance between the two sensors i andj. The self-loop in a node represents the cost of staying in thenode after collecting data. This self-loop models the situationwhere data has already been collected, so if the ME stays atthe same node at the next time step, there is a high chance thatthere is little data to collect, and thus wasting precious timeto collect fresh data from other nodes. Finally, the benefits ateach node and costs on each edge might change abruptly.

Objective. The goal is to find an optimal movement policyof the ME that maximizes the average reward defined as thedifference between the benefit and the cost per unit time, butthe policy must be robust to change in costs and benefits,i.e., it should obtain reasonably high average reward. We willmake this statement precise shortly when describing the FMDPframework.

A. Random Walk on Graph and Probabilistic MDP Policy

An optimal MDP policy in our problem is the rule on howthe ME moves to another node given that it is currently ina certain node. Assume that the ME is at some node, theremight be several neighboring nodes that the ME can moveto. Each way represents an action a in MDP as described in

Page 4: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

Section III-A. For example, for node 5 connected to nodes1 and 4 as shown in Figure 1b, for one policy, an action a1could be ”going to node 1” and for another policy, the actiona2 is ”going to node 4”. One can also employ a probabilisticpolicy that includes an action a3 such as ”going to node 1with probability 0.4, going to node 4 with probability 0.6”.As a result, we can compute the cost of selecting an actiona based on the cost of edges on the graph. For example,c(a1) = c(1,5),c(a2) = c(4,5) and c(a3) = 0.4c(1,5)+0.6c(4,5).Since deterministic policies are special cases of randomizedones, without loss of generality, we only consider randomizedpolicies that produce reversible transition matrices. As such,each randomized policy induces a random walk on a graphwith different weights W(i, j) as shown in Figure 1b. Recallthat in Section III-C, the weights W(i, j) on graph specifieshow the walker moves around the vertices of the graph.

Now we will formalize the connection between a randomwalk on a graph and a probabilistic policy. Let us consideran MDP with state space S , a set of action for each states Aand a reward function R(s,a). We model the reward R(s,a)as the difference between the benefit b(s) and the cost ofthe action c(a): R(s,a) = b(s)− c(a). The number of possiblediscrete policies is |A||S |. For typical MDP problems, this is avery large number. As such, we will parameterize the discretepolicy space using a smaller subset of policies Φ. Each discretepolicy i ∈Φ can be characterized by a symmetric weight matrixWi. Specifically, we define a probabilistic weight matrix as aweighted combination of all the weight matrices: W =

∑i θiWi

where θi ≥ 0 ∀i and∑

i θi = 1. Now, we define a probabilisticpolicy as follows. Let p(s, i) be the probability that we selectthe policy i ∈Φ while in state s. Then p(s, i) can be computedusing the following Proposition:

Proposition 2: For each combined weight matrix with θ,the probability that we select policy i in state s is:

p(s, i) =θiHi(s)H(s)

for all s, i

where Hi(s) =∑

j Wi(s, j), H(s) =∑

j W(s, j).

Proof: We omit the proof due limited space.

The Proposition 2 tells us for a given probabilistic policyparameterized by a specific set of θi’s, the probability ofpicking a policy i while in state s is θiHi(s)

H(s) . Then, since givenstate s, an action a might appear in many policies i in Φ, theprobability that an action a is selected while in state s can becomputed as:

d(s,a) =∑

i(s)=a

p(s, i) (6)

B. Fast Markov Decision Process

There are many algorithms to find out the optimal de-terministic policies for an MDP problem. These include al-gorithms such as Value Iteration, Policy Iteration, etc. [6].However, these policies are asymptotically optimal in the sensethat the policies have to run a for long time under stationaryenvironment. The underlying reason for this is the transitionprobability matrix associated with one of these policies has asmall spectral gap (Section III-B), which leads to the slowconvergence rate to the its stationary distribution π which

directly affects the optimal reward. Specifically, the averagereward for a given probabilistic policy is as follows:

ρπ1 =∑

s

π(s)∑

a

d(s,a)R(s,a),

=∑

s

π(s)∑

i

p(s, i)R(s, i) (7)

where d(s,a) (computed in Eq. (6)) is the probability of takingaction a in state s, R(s,a) is the immediate reward of takingaction a in state s [14], p(s, i) is the probability of pickingpolicy i and R(s, i) = R(s, i(s)). As seen, the reward dependson the stationary distribution, the probabilistic policy and theimmediate reward. However, a policy will obtain the rewardfaster if its transition probability matrix has faster convergencerate due to larger spectral gap or smaller SLEM. A faster policyis especially useful in a non-stationary environment in thesense that it attempts to get a large reward as quick as possiblebefore the environment changes. In contrast, an optimal butslow policy will not perform as well since the optimal rewardis never obtained as it quickly becomes a sub-optimal policy ina new environment. On the other hand, a faster policy mighthave lower average reward if the environment is stationary.Therefore, we propose an FMDP framework that allows us toquantify the trade-off between obtaining a high reward and theability to adapt to time-varying environments.

Now, we note that the mixing rate or convergence rateis characterized by the second largest eigenvalue modulus(SLEM) μ(P) of a reversible, irreducible and aperiodic tran-sition matrix P [8][15] or its spectral gap (1− μ(P)). For afast policy characterized by its transition probability matrix P,(1−μ(P)) needs to be large. Therefore, we want to maximizeρ2 = 1−μ(P) or minimize μ(P).

To balance the average reward and fast convergence, wewant to find θ1, θ2, . . . , θn that maximize the following objectivefunction:

ρ = γρ1+ (1−γ)ρ2,

= γ∑

s

π(s)∑

i

p(s, i)R(s, i)+ (1−γ)(1−μ(P)). (8)

where the coefficient 0 < γ < 1 specifies the trade-off betweenthe average reward and the convergence rate.

We note that once θi’s are found, it can be used to computep(s, i) in Proposition 2, then using p(s, i) to compute theprobability of taking action a in state s using Eq. (6).

V. Solution Approach via FastMarkov Decision ProcessFramework

We describe an algorithm for finding the optimal FMDPpolicy, i.e., θi’s. The proposed algorithm is based on theprojected gradient method used extensively in convex op-timization. The main idea is to search in the direction ofthe gradient of the objective function. Sometimes the searchdirection results in a point outside the feasible set. In thatcase, the point is projected to closest point in the feasible set,then the process repeats. In our problem, the overall objectiveconsists of two components: the average reward and the SLEM.We now show how to compute the gradient for the averagereward and the directional gradient for SLEM since they arethe essential components of the proposed algorithm.

Page 5: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

A. Gradient for Average Reward

Since π(s) = H(s)∑k∈S H(k) , p(s, i) = θiHi(s)

H(s) ,∑

k H(k) = 1 and∑i θi = 1, the average reward ρ1 is simplified as in the following

Proposition:

Proposition 3: The average reward ρ1 could be representedas:

ρ1 =∑

s

Hn(s)R(s,n)+n−1∑

i=1

θi

⎛⎜⎜⎜⎜⎜⎝∑

s

(Hi(s)R(s, i)−Hn(s)R(s,n))

⎞⎟⎟⎟⎟⎟⎠

Proof: We omit the proof due limited space.

Then,dρ1

dθi=∑

s

(Hi(s)R(s, i)−Hn(s)R(s,n)), ∀i = 1,2, . . . ,n−1.

B. Directional Gradient for SLEM μ(P)

Consider the second largest eigenvalue modulus (SLEM)μ(P) of a reversible transition matrix P with stationary distri-bution π. μ(P) is also the largest eigenvalue of the symmetricmatrix A= μ(P)=D1/2PD−1/2− √π√πT [16]. Therefore, it canbe calculated as

μ(P) = ||D1/2PD−1/2− √π√πT ||2where D is the diagonal matrix with diagonal elements takenfrom the stationary distribution π.

For a symmetric matrix A(x)=A0+ x1A1+ . . .+ xnAn wherex= (x1, . . . , xn), Ai is symmetric matrix for all i. Let f (x) be thelargest eigenvalue of A(x). Let y be the eigenvalue correspond-ing to the largest eigenvalue. Then yT A1y,yT A2y, . . . ,yT Any arethe directional gradients for the largest eigenvalue of A at pointx in the direction A1,A2, . . . ,An, respectively, as proved in [17]and considered as sub-gradient in [16][18]. In our problem,A = D1/2PD−1/2 − √π√πT . From Section III-C we can seethat D1/2PD−1/2 = D−1/2WD−1/2 (normalized Laplacian of agraph [19]). Therefore, A can be represented as follows.

Proposition 4: The matrix A = D1/2PD−1/2− √π√πT canbe represented as:

A =n−1∑

i=0

θiD−1/2(Wi−Wn)D−1/2+D−1/2WnD−1/2− √π√πT

Proof: We omit the proof due limited space.

At each step, the stationary distribution π is consideredfixed. Let A0 = D−1/2WnD−1/2 − √π√πT , Ai = D−1/2(Wi −Wn)D−1/2, which are symmetric matrices, then we have thedirectional gradient gi = yT Aiy of μ(P) in the direction of Aifor all i. Note that unlike gradient of ρ1, to compute gi weneed to perform eigen-decomposition of Ai.

Since∑

i θi = 1, the gradient is taken only on the parametervector θ = (θ1, θ1, . . . , θn−1). Let g = (yT A1y, yT A2y, . . . ,yT An−1y), then we have the gradient for the combined objectivefunction ρ with respect to θ:

dρdθ= γ

dρ1

dθ+ (1−γ)dρ2

dθ= γ

dρ1

dθ− (1−γ)g (9)

Convexity. In many cases, the stationary distribution π∗ ofthe ME is given due to the design requirements of the problem.

For example, when the rate of data generated at each node isknown a priori. Then, to obtain high average reward, the MEshould visit the nodes with higher generating rates more oftenthan others. In other words, the stationary distribution π∗ ofthe ME should be proportional to data rate at each node. Inthis case, our optimization problem is concave. As an example,our simulation will make this assumption.

Proposition 5: The objective function ρ is a concave func-tion in θi’s

Proof: We omit the proof due limited space.

Since ρ is a convex function, we propose to find a globaloptimal solution using a gradient method discussed in SectionV-C.

C. Algorithm

Consider the combined objective: ρ = γρ1 + (1−γ)ρ2. Themethod combines gradient of the average reward and thedirectional gradient of the SLEM according to the coefficientγ. Denote θi as the parameter vector at step i. The algorithm isshown below (Algorithm 1). Essentially, the algorithm searches

Algorithm 1 Sub-gradient Method for Fast MDP

Require: θ0,α,k = 01: repeat2: θk+1 = θk +α

dρdθ

3: k = k+14: Project θk+1 back to the feasible set.5: until convergence (The change in the objective function

is smaller than a given ε)

in gradient direction with step size α. If the point is outside thefeasible set, it is projected back to nearest point in the feasibleset. Specifically, the feasible set includes the constraint on θi:θi ≥ 0 ∀i, the constraint on the sum of all θi’s:

∑i θi = 1 and

constraint on the stationary distribution π: π = π∗ where π∗ isthe designed stationary distribution.

The projection could be done by solving the followingconvex optimization problem:

minimizeθ′ (‖θ′ −θ‖1)s.t. θ′ ≥ 0,

‖θ′‖1 ≤ 1,θ′ΔHi+Hn = π

∗,where the matrix ΔH is defined as ΔH(i, s) = Hi(s)−Hn(s).

The condition θ′ΔHi +Hn = π∗ comes from the constraint

on fixed stationary distribution π = π∗ as shown below:

π∗(s) = π(s) =H(s)∑

k∈S H(k)= H(s) =

n∑

i=1

θ′i Hi(s),

=

n−1∑

i=1

θ′i Hi(s)+ θ′nHn(s),

=

n−1∑

i=1

θ′i Hi(s)+ (1−n∑

i=1

θ′i )Hn(s),

=

n−1∑

i=1

θ′i (Hi(s)−Hn(s))+Hn(s) ∀s. (10)

Page 6: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

or in matrix form: θ′ΔH +Hn = π∗ where the matrix ΔH is

defined as ΔH(i, s) = Hi(s)−Hn(s).

In other words, we find the closest point to θ that satisfiesall the constraints in the problem formulation.

Figure 2: The graph for scenario 1

Figure 3: The graph for scenario 2

0 50 100 1500.7

0.75

0.8

0.85

0.9

0.95

1

Number of algorithm steps

SLE

M

γ = 0.5γ = 1γ = 0

Figure 4: The SLEM for scenario 1 as a function of algorithmsteps

VI. Simulations and Results

In this section, we consider two graphs of WSN, one with9 nodes on a 3×3 grid and one with 10 nodes on a triangle-topology graph as shown in Figures 2 and 3. The correspondingbenefits and costs are also shown in Figures 2 and 3. Giventhe benefits, i.e. the data generating rate, we can compute thedesigned stationary distribution which is proportional to thebenefits. Note that this is only an example. In general, onecan use any benefit structure. We show the results for various

0 50 100 1508.45

8.5

8.55

8.6

8.65

8.7

8.75

8.8

Number of algorithm steps

Ave

rage

Rew

ard

γ = 0.5γ = 1γ = 0

Figure 5: The average reward for scenario 1 as a function ofalgorithm steps

0 50 100 1504.25

4.3

4.35

4.4

4.45

4.5

Number of algorithm steps

ρ

γ = 0.5γ = 1γ = 0

Figure 6: The combined objective function for scenario 1 as afunction of algorithm steps

values of γ. γ = 1 implies that we only care about the averagereward while γ = 0 implies we only care about the convergencerate. For 0 < γ < 1, we care both about the reward and theconvergence rate with the weight γ and 1−γ, respectively.

For the first scenario, 9 nodes are on a 3 × 3grid with benefits and costs shown in the Figure2. The cost of staying at one node is 1 (notshown). The desired stationary distribution π∗ =[0.14 0.12 0.10 0.12 0.08 0.12 0.14 0.07 0.11],which is proportional to the nodes’ benefits. Assumethat the ME starts at node 9 or the initial distribution is[0 0 0 0 0 0 0 0 1]. Here, the algorithm stepsize is set at α = 0.005. The SLEM, the average reward andthe combined objective function ρ are plotted on Figures 4,5, 6 as a function of algorithm steps.

For the second scenario, 10 nodes are on triangle-topology network as seen in the Figure 3. The benefitat each node and the cost of moving on edges are alsoshown in the Figure 3. The cost of staying at one nodeis 1 (not shown). The desire stationary distribution π∗ =[0.08 0.14 0.12 0.19 0.13 0.06 0.08 0.09 0.06 0.05],which is proportional to the nodes’ benefits. Assume thatthe ME starts at node 10 or the initial distribution is

Page 7: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

0 50 100 150 200

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of algorithm steps

SLE

M

γ = 0.5γ = 1γ = 0

Figure 7: The SLEM for scenario 2 as a function of algorithmsteps

0 50 100 150 2002.75

2.8

2.85

2.9

2.95

3

3.05

3.1

3.15

Number of algorithm steps

Ave

rage

Rew

ard

γ = 0.5γ = 1γ = 0

Figure 8: The average reward for scenario 2 as a function ofalgorithm steps

[0 0 0 0 0 0 0 0 0 1]. Here, the algorithm stepsize α is also 0.005. The SLEM, the average reward and thecombined objective function ρ versus the algorithm steps areplotted on Figures 7, 8, 9.

For both scenarios, we can see that, the policy with γ =1 provides the largest reward but also with a large SLEM,i.e., slow convergence (average reward ≈ 8.75 and SLEM ≈0.97 for scenario 1, average reward ≈ 3.05 and SLEM ≈ 0.99for scenario 2). For the policy with γ = 0, we get a policywith fastest converge but lower reward (average reward ≈ 8.6and SLEM ≈ 0.72 for scenario 1, the average reward ≈ 2.90and SLEM ≈ 0.63 for scenario 2). For the policy with γ =0.5, we get a policy that converges fast and at the same timecollects a large reward (average reward ≈ 8.65 and SLEM ≈0.75 for scenario 1, average reward ≈ 2.95 and SLEM ≈ 0.65for scenario 2). As seen in the Figures 6, 9, the combinedobjective function ρ of our method outperforms others.

To show the advantage of the proposed FMDP approachin changing network conditions or limited collection time,we perform the following simulations. Based on the optimalpolicies for γ = 1 and γ = 0.5, we compute their correspondingtransition matrix and the average rewards over time in a chang-ing environment. Specifically, the benefits of the data collected,

0 50 100 150 200

1.5

1.55

1.6

1.65

Number of algorithm steps

ρ

γ = 0.5γ = 1γ = 0

Figure 9: The combined objective function for scenario 2 as afunction of algorithm steps

0 50 100 150 2000

2

4

6

8

10

Number of time steps

Ave

rage

Rew

ard

γ = 0.5γ = 1

Figure 10: The average reward collected in non-stationarydistribution for scenario 1

which is proportional to the generating rates, changes twice attime step 50 and 120 for scenario 1 and at time step 73 and 140for scenario 2. Each time the reward changes, we recomputethe average reward. Figures 10 and 12 show that with theFMDP based policy (γ = 0.5), the ME collects more rewardsthan the classical optimal MDP policy (γ = 1). This is dueto the fact that the environment changes before the classicaloptimal MDP policy accumulate the optimal reward.

For completeness, we also show the rates of convergencefor the two policies to the new stationary distributions whenthe reward structure changes for the above scenarios. Figures11 and 13 shows the total variation distances decreases fortwo policies as the ME move around the nodes. As seen, theFMDP policy (γ = 0.5) has a faster convergence rate than theone with γ = 1. This explains the reason for the FMDP policyto obtain faster reward. However, we should note that in astationary environment, the policy with γ = 1 will eventuallyoutperform the other.

VII. Conclusion

We introduced the FMDP framework for designing tra-jectories/policies for data collection in sensor networks infast-changing environments. The proposed FMDP framework

Page 8: [IEEE 2014 23rd International Conference on Computer Communication and Networks (ICCCN) - China (2014.8.4-2014.8.7)] 2014 23rd International Conference on Computer Communication and

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

Number of time steps

Dis

tanc

e to

Sta

tiona

ry D

istr

ibut

ion

γ = 0.5γ = 1

Figure 11: The total variance distance to the stationary distri-bution in non-stationary environment for scenario 1

0 50 100 150 2000

0.5

1

1.5

2

2.5

3

3.5

4

Number of time steps

Ave

rage

Rew

ard

γ = 0.5γ = 1

Figure 12: The average reward collected in non-stationarydistribution for scenario 2

and its algorithmic tools allow the engineers to systemati-cally design trajectories/policies that can be optimized for awide range of objectives. Importantly, the proposed FMDPframework enables one to design a protocol/policy that strikesa balance between maximizing a given objective and doingso in quantifiable fast time. The proposed framework can beextended to model explicitly many other real world constraintsincluding the limitations on latency and buffer size.

References

[1] M. Di Francesco, S. K. Das, and G. Anastasi, “Data collection inwireless sensor networks with mobile elements: A survey,” ACM Trans.Sen. Netw., vol. 8, no. 1, pp. 7:1–7:31, Aug. 2011.

[2] J. Rao and S. Biswas, “Joint routing and navigation protocols for dataharvesting in sensor networks,” in Mobile Ad Hoc and Sensor Systems,2008. MASS 2008. 5th IEEE International Conference on, Sept 2008,pp. 143–152.

[3] S. R. Shah, Rahul C. and, S. Jain, and W. Brunette, “Data mules:modeling and analysis of a three-tier architecture for sparse sensornetworks.” Ad Hoc Networks, vol. 1, no. 2-3, pp. 215–233, 2003.

[4] Y.-C. Tseng, Y.-C. Wang, K.-Y. Cheng, and Y.-Y. Hsieh, “imouse: Anintegrated mobile surveillance and wireless sensor system,” Computer,vol. 40, no. 6, pp. 60–66, June 2007.

[5] A. A. Somasundara, A. Ramamoorthy, and M. B. Srivastava, “Mobileelement scheduling for efficient data collection in wireless sensornetworks with dynamic deadlines,” in Proceedings of the 25th IEEE

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

Number of time steps

Dis

tanc

e to

Sta

tiona

ry D

istr

ibut

ion

γ = 0.5γ = 1

Figure 13: The total variation distance to the stationary distri-bution in non-stationary environment for scenario 2

International Real-Time Systems Symposium, ser. RTSS ’04. Wash-ington, DC, USA: IEEE Computer Society, 2004, pp. 296–305.

[6] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy-namic Programming, 1st ed. New York, NY, USA: John Wiley &Sons, Inc., 1994.

[7] D. Aldous and J. A. Fill, Reversible Markov Chains and RandomWalks on Graphs. In preparation. [Online]. Available: http://www.stat.berkeley.edu/\∼{}aldous/RWG/book.html

[8] D. A.Levin, Y. Peres, and E. L.Wilmer, Markov Chains and MixingTimes. American Mathematical Society, 2008.

[9] H. Jun, M. H. Ammar, and E. W. Zegura, “Power management in delaytolerant networks: A framework and knowledge-based mechanisms,”2005.

[10] W. Zhao, M. Ammar, and E. Zegura, “A message ferrying approach fordata delivery in sparse mobile ad hoc networks,” in Proceedings of the5th ACM International Symposium on Mobile Ad Hoc Networking andComputing, ser. MobiHoc ’04. New York, NY, USA: ACM, 2004, pp.187–198.

[11] P. Juang, H. Oki, Y. Wang, M. Martonosi, L. S. Peh, and D. Rubenstein,“Energy-efficient computing for wildlife tracking: Design tradeoffs andearly experiences with zebranet,” SIGARCH Comput. Archit. News,vol. 30, no. 5, pp. 96–107, Oct. 2002.

[12] A. Kansal, A. A. Somasundara, D. D. Jea, M. B. Srivastava, andD. Estrin, “Intelligent fluid infrastructure for embedded networks,” inIn Proc. ACM MobiSys04, 2004.

[13] I. Papadimitriou and L. Georgiadis, “Energy-aware routing to maximizelifetime in wireless sensor networks with mobile sink,” Journal ofCommunications Software and Systems, vol. 2, pp. 141–151, 2006.

[14] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy gradientmethods for reinforcement learning with function approximation,” in InAdvances in Neural Information Processing Systems 12. MIT Press,2000, pp. 1057–1063.

[15] P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation,and Queues, ser. Texts in Applied Mathematics. Springer, 1999.

[16] S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on agraph,” SIAM REVIEW, vol. 46, pp. 667–689, 2003.

[17] M. Torki, “Second-order directional derivatives of all eigenvalues ofa symmetric matrix,” Nonlinear Anal., vol. 46, no. 8, pp. 1133–1150,Dec. 2001.

[18] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY,USA: Cambridge University Press, 2004.

[19] F. Chung, Spectral Graph Theory, ser. CBMS Regional ConferenceSeries. American Mathematical Society, 1997, no. 92.