14
arXiv:1602.04301v1 [cs.SI] 13 Feb 2016 Latent Space Model for Road Networks to Predict Time-Varying Traffic Dingxiong Deng 1 , Cyrus Shahabi 1 , Ugur Demiryurek 1 , Linhong Zhu 2 , Rose Yu 1 , Yan Liu 1 Department of Computer Science, University of Southern California 1 Information Sciences Institute, University of Southern California 2 {dingxiod, shahabi, demiryur, qiyu, Yanliu.cs}@usc.edu, [email protected] ABSTRACT Real-time traffic prediction from high-fidelity spatiotemporal traf- fic sensor datasets is an important problem for intelligent trans- portation systems and sustainability. However, it is challenging due to the complex topological dependencies and high dynamism associated with changing road conditions. In this paper, we pro- pose a Latent Space Model for Road Networks (LSM-RN) to ad- dress these challenges holistically. In particular, given a series of road network snapshots, we learn the attributes of vertices in latent spaces which capture both topological and temporal properties. As these latent attributes are time-dependent, they can estimate how traffic patterns form and evolve. In addition, we present an incre- mental online algorithm which sequentially and adaptively learn the latent attributes from the temporal graph changes. Our frame- work enables real-time traffic prediction by 1) exploiting real-time sensor readings to adjust/update the existing latent spaces, and 2) training as data arrives and making predictions on-the-fly. By con- ducting extensive experiments with a large volume of real-world traffic sensor data, we demonstrate the superiority of our frame- work for real-time traffic prediction on large road networks over competitors as well as baseline graph-based LSM’s. 1. INTRODUCTION Recent advances in traffic sensing technology have enabled the acquisition of high-fidelity spatiotemporal traffic datasets. For ex- ample, at our research center, for the past five years, we have been collecting data from 15000 loop detectors installed on the high- ways and arterial streets of Los Angeles County, covering 3420 miles cumulatively (see the case study in [12]). The collected data include several main traffic parameters such as occupancy, volume, and speed at the rate of 1 reading/sensor/min. These data streams enable traffic prediction, which in turn improves route navigation, traffic regulation, urban planning, etc. The traffic prediction problem is to predict the future travel speed of each and every edge of a road network, given the historical speed readings sensed from the sensors on these edges. To solve the traf- fic prediction problem, the majority of existing techniques utilize the historical information of an edge to predict its future travel- ACM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 speed using regression techniques such as Auto-regressive Inte- grated Moving Average (ARIMA) [16], Support Vector Regres- sion (SVR) [18] and Gaussian Process (GP) [28]. There are also other studies that leverage spatial/topological similarities to predict the readings of an edge based on its neighbors in either Euclidean space [9] or network [13]. Even though there are few notable ex- ceptions such as Hidden Markov Model (HMM) [13, 25] that pre- dict traffic of edges by collectively inferring temporal information, these approaches simply combine the local information of neigh- bors with temporal information. Furthermore, existing approaches such as GP and HMM are computationally expensive and require repeated offline trainings. Therefore, it is very difficult to adapt these to real-time traffic forecasting. Motivated by these challenges, we propose Latent Space Model- ing for Road Networks (LSM-RN), which enables more accurate and scalable traffic prediction by utilizing both topology similarity and temporal correlations. Specifically, with LSM-RN, vertices of dynamic road network are embedded into a latent space, where two vertices that are similar in terms of both time-series traffic behavior and the road network topology are close to each other in the latent space. Recently, several machine learning problems such as com- munity detection [21, 26], link prediction [?, 30] and sentimental analysis [29] have been formulated using Latent Space Modeling. Most relevant to our work is the latent space modeling for social networks (hereafter called LSM-SN) because RN and SN are both graphs and each vertex has different attributes. However, existing approaches for LSM-SN are not suitable for both identifying the edge and/or sensor latent attributes in road networks and exploiting them for real-time traffic prediction due to the following reasons. First, road networks show significant topological(e.g., travel-speeds - weights - between two sensors on the same road segment are sim- ilar), and temporal (e.g., travel-speeds measured every 1 minute on a particular sensor are similar) correlations. These correlations can be exploited to alleviate the missing data problem, which is unique to road networks, due to the fact that some road segments may contain no sensors and any sensor may occasionally fail to re- port data. Second, unlike social networks, LSM-RN is fast evolving due to the time-varying traffic conditions. On the contrary, social networks evolve smoothly and frequent changes are very unlikely (e.g., one user changes its political preferences twice a day). In- stead, in road networks, traffic conditions on a particular road seg- ment can change rapidly in a short time (i.e., time-dependent) be- cause of rush/non-rush hours and traffic incidents. Third, LSM-RN is highly dynamic where fresh data come in a streaming fashion, whereas the connection (weights) between nodes in social networks is almost static. The dynamism requires partial updates of the model as opposed to the time-consuming full updates in LSM-SN. Finally, with LSM-RN, the ground truth can be observed shortly

Latent Space Model for Road Networks to Predict Time ...ilar), and temporal (e.g., travel-speeds measured every 1 minute on a particular sensor are similar) correlations. These correlations

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

  • arX

    iv:1

    602.

    0430

    1v1

    [cs.

    SI]

    13

    Feb

    201

    6

    Latent Space Model for Road Networks to PredictTime-Varying Traffic

    Dingxiong Deng1, Cyrus Shahabi1, Ugur Demiryurek1, Linhong Zhu2, Rose Yu1, Yan Liu1Department of Computer Science, University of Southern California1

    Information Sciences Institute, University of Southern California2

    {dingxiod, shahabi, demiryur, qiyu, Yanliu.cs}@usc.edu, [email protected]

    ABSTRACTReal-time traffic prediction from high-fidelity spatiotemporal traf-fic sensor datasets is an important problem for intelligent trans-portation systems and sustainability. However, it is challengingdue to the complex topological dependencies and high dynamismassociated with changing road conditions. In this paper, wepro-pose a Latent Space Model for Road Networks (LSM-RN) to ad-dress these challenges holistically. In particular, givena series ofroad network snapshots, we learn the attributes of verticesin latentspaces which capture both topological and temporal properties. Asthese latent attributes are time-dependent, they can estimate howtraffic patterns form and evolve. In addition, we present an incre-mental online algorithm which sequentially and adaptivelylearnthe latent attributes from the temporal graph changes. Our frame-work enables real-time traffic prediction by 1) exploiting real-timesensor readings to adjust/update the existing latent spaces, and 2)training as data arrives and making predictions on-the-fly.By con-ducting extensive experiments with a large volume of real-worldtraffic sensor data, we demonstrate the superiority of our frame-work for real-time traffic prediction on large road networksovercompetitors as well as baseline graph-based LSM’s.

    1. INTRODUCTIONRecent advances in traffic sensing technology have enabled the

    acquisition of high-fidelity spatiotemporal traffic datasets. For ex-ample, at our research center, for the past five years, we havebeencollecting data from 15000 loop detectors installed on the high-ways and arterial streets of Los Angeles County, covering 3420miles cumulatively (see the case study in [12]). The collected datainclude several main traffic parameters such as occupancy, volume,and speed at the rate of 1 reading/sensor/min. These data streamsenable traffic prediction, which in turn improves route navigation,traffic regulation, urban planning, etc.

    Thetraffic predictionproblem is to predict the future travel speedof each and every edge of a road network, given the historicalspeedreadings sensed from the sensors on these edges. To solve thetraf-fic prediction problem, the majority of existing techniquesutilizethe historical information of an edge to predict its future travel-

    ACM ISBN 978-1-4503-2138-9.

    DOI: 10.1145/1235

    speed using regression techniques such as Auto-regressiveInte-grated Moving Average (ARIMA) [16], Support Vector Regres-sion (SVR) [18] and Gaussian Process (GP) [28]. There are alsoother studies that leverage spatial/topological similarities to predictthe readings of an edge based on its neighbors in either Euclideanspace [9] or network [13]. Even though there are few notable ex-ceptions such as Hidden Markov Model (HMM) [13, 25] that pre-dict traffic of edges by collectively inferring temporal information,these approaches simply combine the local information of neigh-bors with temporal information. Furthermore, existing approachessuch as GP and HMM are computationally expensive and requirerepeated offline trainings. Therefore, it is very difficult to adaptthese to real-time traffic forecasting.

    Motivated by these challenges, we propose Latent Space Model-ing for Road Networks (LSM-RN ), which enables more accurateand scalable traffic prediction by utilizing both topology similarityand temporal correlations. Specifically, with LSM-RN, vertices ofdynamic road network are embedded into a latent space, wheretwovertices that are similar in terms of both time-series traffic behaviorand the road network topology are close to each other in the latentspace. Recently, several machine learning problems such ascom-munity detection [21, 26], link prediction [?, 30] and sentimentalanalysis [29] have been formulated using Latent Space Modeling.Most relevant to our work is the latent space modeling for socialnetworks (hereafter calledLSM-SN) because RN and SN are bothgraphs and each vertex has different attributes. However, existingapproaches for LSM-SN are not suitable for both identifyingtheedge and/or sensor latent attributes in road networks and exploitingthem for real-time traffic prediction due to the following reasons.

    First, road networks show significant topological(e.g., travel-speeds- weights - between two sensors on the same road segment are sim-ilar), and temporal (e.g., travel-speeds measured every 1 minuteon a particular sensor are similar) correlations. These correlationscan be exploited to alleviate the missing data problem, which isunique to road networks, due to the fact that some road segmentsmay contain no sensors and any sensor may occasionally fail to re-port data. Second, unlike social networks, LSM-RN is fast evolvingdue to the time-varying traffic conditions. On the contrary,socialnetworks evolve smoothly and frequent changes are very unlikely(e.g., one user changes its political preferences twice a day). In-stead, in road networks, traffic conditions on a particular road seg-ment can change rapidly in a short time (i.e., time-dependent) be-cause of rush/non-rush hours and traffic incidents. Third, LSM-RNis highly dynamic where fresh data come in a streaming fashion,whereas the connection (weights) between nodes in social networksis almost static. The dynamism requires partial updates of themodel as opposed to the time-consuming full updates in LSM-SN.Finally, with LSM-RN, the ground truth can be observed shortly

    http://arxiv.org/abs/1602.04301v110.1145/1235

  • after making the prediction (by measuring the actual speed later infuture), which also provides an opportunity to improve/adjust themodel incrementally (i.e., online learning).

    With our proposed LSM-RN, each dimension of the embeddedlatent space represents a latent attribute, thus the attribute distri-bution of vertices and how the attributes interact with eachotherjointly determine the underlying traffic pattern. To enforce thetopology of road network, LSM-RN adds a graph Laplacian con-straint which not only enables global graph similarity, butalso com-pletes the missing data by a set of similar edges with non-zeroreadings. Subsequently, we incorporate the temporal propertiesinto our LSM-RN model by considering time-dependent latentat-tributes and a global transition process. With these time-dependentlatent attributes and the transition matrix, we are able to understandhow traffic patterns form and evolve.

    To infer the time-dependent latent attributes of our LSM-RNmodel, a typical method is to utilize multiplicative updatealgo-rithms [15], where we jointly infer the whole latent attributes viaiterative updates until they become stable, termed asglobal learn-ing. However, global learning is not only slow but also not practicalfor real-time traffic prediction. This is because, the spatiotemporaltraffic data are of high-fidelity (i.e., updates are frequentin everyone minute) and the actual ground-truth of traffic speed becomesavailable shortly afterwards ( e.g., after making a prediction for thenext five minutes, the ground truth data will be available instantlyafter five minutes). We thus propose an incremental online learningwith which we sequentially and adaptively learn the latent attributefrom the temporal traffic changes. In particular, each time when ouralgorithm makes a prediction with the latent attributes learned fromthe previous snapshot, it receives feedback from the next snapshot(i.e., the ground truth speed reading we already obtained) and sub-sequently modifies the latent attributes for more accurate predic-tions. Unlike traditional online learning which only performs onesingle update (e.g., update one vertex per prediction) per round, ourgoal is to make predictions for the entire road network, and thus weupdate the latent attributes for many correlated vertices.

    Leveraging global and incremental learning algorithms ourLSM-RN model can strike a balance between accuracy and efficiencyforreal-time forecasting. Specifically, we consider a settingwith a pre-defined time window where at each time window (e.g., 5 minutes),we learn our traffic model with the proposed incremental inferenceapproach on-the-fly, and make predictions for the next time span.Meanwhile, we batch the re-computation of our traffic model at theend of one large time window (e.g., one hour). Under this setting,our LSM-RN model enables the following two properties: (1) real-time feedback information can be seamlessly incorporated into ourframework to adjust for the existing latent spaces, thus allowing formore accurate predictions, and (2) our algorithms train andmakepredictions on-the-fly with small amount of data rather thanrequir-ing large training datasets.

    We conducted extensive experiments using a large volume ofreal-world traffic sensor dataset. We demonstrated that theLSM-RN framework achieves better accuracy than that of both existingtime series methods (e.g. ARIMA and SVR) and the LSM-SN ap-proaches. Moreover, we show that our algorithm scales to largeroad networks. For example, it only takes 4 seconds to make a pre-diction for a network with 19,986 edges. Finally, we show that ourbatch window setting works perfectly for streaming data, alternat-ing the executions of our global and incremental algorithms, whichstrikes a compromise between prediction accuracy and efficiency.For instance, incremental learning is one order of magnitude fasterthan global learning, and it requires less than 1 seconds to incorpo-rate real-time feedback information.

    The remainder of this paper is organized as follows. In Section 2,we discuss the related work. We define our problem in Section 3,and explain LSM-RN in Section 4. We present the global learningand increment learning algorithms, and discuss how to adaptour al-gorithms for real-time traffic forecasting in Section 5. In Section 6,we report the experiment results and Section 7 concludes thepaper.

    2. BACKGROUND AND RELATED WORKS2.1 Traffic analysis

    Many studies have focused on the traffic prediction problem,butno single study so far tackled all the challenges in a holistic manner.Some focused on missing values [17] or missing sensors [11, 27],but not both. Some studies [16,28] focus on utilizing temporal datawhich models each sensor (or edge) independently and makes pre-dictions using time series approaches (e.g., Auto-regressive Inte-grated Moving Average (ARIMA) [16], Support Vector Regression(SVR) [18] and Gaussian Process (GP) [28]). For instance, Panet. al. [16] learned an enhanced ARIMA model for each edge inadvance, and then perform traffic prediction on top of these mod-els. Very few studies [13, 25] utilize spatiotemporal modelwithcorrelated time series based on Hidden Markov Model, but only forsmall number of time series and not always using the network spaceas the spatial dimension (e.g., using Euclidean space [9]).In [24],Xu et. al. consider using the newly arrived data as feedback toreward one classifier vs. the other but not for dynamically updat-ing the model. Note that many existing studies [5, 11, 22, 25,27]on traffic prediction are based on GPS dataset, which is differentwith the sensor dataset, where we have fine-grained and steadyreadings from road-equipped sensors. We are not aware of anystudy that uses latent space modeling (considering both time andnetwork topology) for real-time traffic prediction from incomplete(i.e., missing sensors and values) sensor datasets.

    2.2 Latent space model and NMFRecently, many real data analytic problems such as community

    detection [21,26], recommendation system [6], topic modeling [20],image clustering [4], and sentiment analysis [29], have been formu-lated as the problem of latent space learning. These studiesassumethat, given a graph, each vertex resides in a latent space with at-tributes, and vertices which are close to each other are morelikelyto be in the same cluster (e.g., community or topic) and form alink.In particular, the focus is to infer the latent matrix by minimizingthe loss (e.g., squared loss [26, 29] or KL-divergence [4]) betweenobserved and estimated links. However, existing methods are notdesigned for the highly correlated (topologically and temporally)and very dynamic road networks. Few studies [19] have consideredthe temporal relationships in SN with the assumption that networksare evolving smoothly. In addition, the temporal graph snapshotsin [19] are treated separately and thus newly observed data willnot be incorporated to improve the model. Compared with exist-ing works, we explore the feasibility of modeling road networkswith time-varying latent space. The traffic speed of a road segmentis determined by their latent attributes and the interaction betweencorresponding attributes. To tackle the sparsity of road network, weutilize the graph topology by adding a graph Laplacian constraint toimpute the missing values. In addition, the latent positionof eachvertex, is varying with time and allows for sudden movement fromone timestamp to the next timestamp via a transition matrix.

    Different techniques have been proposed to learn the latentprop-erties, where Non-negative Matrix Factorization (NMF) is one ofthe most popular method because of its ease of interpreting andflexibility. In this work, we explore the feasibility of applying dy-namic NMF to traffic prediction domain. We design a global algo-

  • rithm to infer the latent space based on the traditional multiplicativealgorithm [8,15]. We further propose a topology-aware incrementalalgorithm, which adaptively updates the latent space representationfor each node in the road network with topology constraints.Theproposed algorithms differ from traditional online NMF algorithmssuch as [3], which independently perform each online update.

    Table 1: Notations and explanationsNotations ExplanationsN , n road network, number of vertices of the road networkG the adjacency matrix of a graphU latent space matrixB attribute interaction matrixA the transition matrixk the number of dimensions of latent attributesT the number of snapshotsspan the gap between two continuous graph snapshotsh the prediction horizonλ, γ regularization parameters for graph Laplacian and transition process

    3. PROBLEM DEFINITIONWe denote a road network as a directed graphN = (V, E),

    whereV is the set of vertices andE ∈ V × V is the set of edges,respectively. A vertexvi ∈ V models a road intersection or anend of road. An edgee(vi, vj), which connects two vertices, repre-sents a directed network segment. Each edgee(vi, vj) is associatedwith a travel speedc(vi, vj) (e.g., 40 miles/hour). In addition,Nhas a corresponding adjacency matrix representation, denoted asG, whose(i, j)th entry represents the edge weight between theith

    andjth vertices.The road network snapshots are constructed from a large-scale,

    high resolution traffic sensor dataset (see detailed description ofsensor data in Section 6). Specifically, a sensors (i.e., a loop detec-tor) is located at one segment of road networkN , which providesa reading (e.g., 40 miles/hour) per sampling rate (e.g., 1 min). Wedivide one day into different intervals, wherespan is the length ofeach time interval. For example, whenspan = 5 minutes, we have288 time intervals per day. For each time intervalt, we aggregatethe readings of one sensor. Subsequently, for each edge segment ofnetworkN , we aggregate all sensor readings located at that edge asits weight. Therefore, at each timestampt, we have a road networksnapshotGt from traffic sensors.

    Example. Figure 1 (a) shows a simple road network with 7 verticesand 10 edges at one timestamp. Three sensors (i.e.,s1, s2, s3) arelocated in edges(v1, v2), (v3, v4) and (v7, v6) respectively, andeach sensor provides an aggregated reading during the time inter-val. Figure 1(b) shows the corresponding adjacent matrix aftermapping the sensor readings to the road segments. Note that thesensor dataset is incomplete with both missing values (i.e., sensorfails to report data) and missing sensors (i.e., edges without anysensors). Here sensors3 fails to provide reading, thus the edgeweight ofc(v3, v4) is ? due to missing value. In addition, the edgeweight ofc(v3, v2) is marked as× because of missing sensors.

    v� v�

    v� v�

    v� v�

    v�

    s�:40

    s�:?

    s:28.6v

    v�

    v�

    v

    v�

    v�

    v�

    v� v� v� v� v� v� v�

    0 28.6 0 0 0 0 0

    0 0 0 0 0 0 0

    0 x 0 ? 0 0 0

    0 x 0 0 x 0 0

    0 x 0 0 0 0 0

    0 x 0 0 0 0 x

    0 0 0 0 0 40 0

    (a) An abstract road networkN (b) Adjacency matrix representationG

    Figure 1: An example of road networkGiven a small number of road network snapshots, or a dynamic

    road network, our objective is to predict the future traffic condi-tions. Specifically, adynamic road network, is a sequence of snap-shots(G1, G2, · · · , GT )with edge weights denoting time-dependenttravel cost.

    With a dynamic road network, we formally define the problemof edge traffic prediction with missing data as follows:

    Problem 1 Given a dynamic road network(G1, G2, · · · , GT )withmissing data at each timestamp, we aim to achieve the followingtwo goals:

    • complete the missing data (i.e., both missing value and sen-sor) ofGi , where1 ≤ i ≤ T ;

    • predict the future readings ofGT+h, whereh is the predic-tion horizon. For example, whenh = 1, we predict the trafficcondition ofGT+1 at the next timestamp.

    For ease of presentation, Table 1 lists the notations we use through-out this paper. Note that since each dimension of a latent spacerepresents a latent attribute, we thus use latent attributes and latentpositions interchangeably.

    4. LATENT SPACE MODEL FOR ROAD NET-WORKS (LSM-RN)

    In this section, we describe our LSM-RN model in the contextof traffic prediction. We first introduce the basic latent space model(Section 4.1) by considering the graph topology, and then incorpo-rate both temporal and transition patterns (Section 4.2). Finally, wedescribe the complete LSM-RN model to solve the traffic predic-tion problem with missing data (Section 4.3).

    4.1 Topology in LSM-RNOur traffic model is built upon the latent space model of the ob-

    served road network. Basically, each vertex of road networkhavedifferent attributes and each vertex has an overlapping representa-tion of attributes. The attributes of vertices and how each attributeinteracts with others jointly determine the underlining traffic pat-terns. Intuitively, if two highway vertices are connected,their cor-responding interaction generates a higher travel speed than that oftwo vertices located at arterial streets. In particular, given a snap-shot of road networkG, we aim to learn two matricesU andB,where matrixU ∈ Rn×k+ denotes the latent attributes of vertices,and matrixB ∈ Rk×k+ denotes the attribute interaction patterns.The product ofUBUT represents the traffic speed between anytwo vertices, where we use to approximateG. Note thatB is anasymmetric matrix since the road networkG is directed. There-fore, the basic traffic model which considers the graph topologycan be determined by solving the following optimization problem:

    argminU≥0,B≥0

    J = ||G− UBUT ||2F (1)

    U

    B U�

    G≈ ×

    n × n

    ×

    n × k

    k × k k × n

    28.6

    c(v�, v�) U(v�) U�(v�)B��

    = × ×0.6 0.10.450

    20

    15

    30 0.5

    highway business

    (a) Basic model (b) Travel time ofc(v1, v2)

    Figure 2: An example of our traffic model, whereG represents a roadnetwork, U denotes the attributes of vertices in the road network,n isnumber of nodes, andk is number of attributes, and B denotes howone type of attributes interacts with others.

    Figure 2 (a) illustrates the intuition of our static traffic model. Asshonw in Figure 2 (b), suppose we know that each vertex is associ-ated with two attributes (e.g., highway and business area),and theinteraction pattern between two attributes is encoded in matrix B,we can accurately estimate the travel speed between vertexv1 andv2, using their corresponding latent attributes and the matrix B.

  • Overcome the sparsity ofG. In our road network,G is very sparse(i.e., zero entries dominate the items inG) for the following rea-sons: (1) the average degree of a road network is small [23], andthus the edges of road network is far from fully connected, (2) thedistribution of sensors is non-uniform, and only a small numberof edges are equipped with sensors; and (3) there exists missingvalues (for those edges equipped with sensors) due to the failureand/or maintenance of sensors.

    Therefore, we define our loss function only on edges with ob-served readings, that is, the set of edges with travel costc(vi, vj) >0. In addition, we also propose an in-filling method to reduce thegap between the input road network and the estimated road net-work. We consider graph Laplacian dynamics, which is an effec-tive smoothing approach for finding global structure similarity [14].Specifically, we construct a graph Laplacian matrixL, defined asL = D −W , whereW is a graph proximity matrix that is con-structed from the network topology, andD is a diagonal matrixDii =

    ∑j(Wij). With these new constraints, our traffic model for

    one snapshot of road networkG is expressed as follows:

    argminU,B

    J = ||Y ⊙ (G− UBUT )||2F + λTr(UTLU) (2)

    whereY is an indication matrix for all the non-zero entries inG,i.e,Yij = 1 if and only ifG(i, j) > 0; ⊙ is the Hadamard productoperator, i.e.,(X ⊙ Z)ij = Xij × Zij ; andλ is the Laplacianregularization parameter.

    4.2 Time in LSM-RNWe now combine the temporal information, including time-dependent

    modeling of latent attributes and the temporal transition.With thismodel, each vertex is represented in a unified latent space, whereeach dimension either represents a spatial or temporal attribute.

    4.2.1 Temporal effect of latent attributesThe behavior of the vertices of road networks may evolve pretty

    quickly. For instance, the behavior of a vertex that is similar tothat of a highway vertex during normal traffic condition, maybe-come similar to that of an arterial street node during congestionhours. Because the behavior of each vertex can change over time,we must employ a time-dependent modeling for attributes of ver-tices for real-time traffic prediction. Therefore, we add the time-dependent effect of attributes into our traffic model. Specifically,for eacht ≤ T , we aim to learn a corresponding time-dependentlatent attribute representationUt. Although the latent attribute ma-trix Ut is time-dependent, we assume that the attribute interactionmatrixB is an inherent property, and thus we opt to fixB for alltimestamps. By incorporating this temporal effect, we obtain ourmodel based on the following optimization problem:

    argminUt,B

    J =T∑

    t=1

    ||Yt ⊙ (Gt − UtBUTt )||

    2F +

    T∑

    t=1

    λTr(UtLUTt )

    (3)4.2.2 Transition matrix

    Due to the dynamics of traffic condition, we not only want tolearn the time-dependent latent attributes, but also learna transitionmodel to capture the evolving behavior from one snapshot to thenext. The transition should be able to capture both periodicevolv-ing patterns (e.g., morning/afternoon rush hours) and non-recurringpatterns caused by traffic incidents (e.g., accidents, roadconstruc-tion, or work zone closures). For example, during the interval of anaccident, a vertex transition from normal state to congested at thebeginning, then become normal again after the accident is cleared.

    We thus assume a global process to capture the state transitions.Specifically, we use a matrixA that approximates the changes of

    U between timet − 1 to time t, i.e.,Ut = Ut−1A, whereU ∈Rn×k+ , A ∈ R

    k×k+ . The transition matrixA represents how likely

    a vertex is to transit from attributei to attributej for that particulartime interval.

    4.3 LSM-RN ModelConsidering all the above discussions, the final objective func-

    tion for our LSM-RN model is defined as follows:

    argminUt,B,A

    J =T∑

    t=1

    ||Yt ⊙ (Gt − UtBUTt )||

    2F +

    T∑

    t=1

    λTr(UtLUTt )+

    T∑

    t=2

    γ||Ut − Ut−1A||2F

    (4)whereλ andγ are the regularization parameters.

    4.3.1 Edge traffic prediction with missing dataSuppose by solving Equation 4, we obtain the learned matrices

    of Ut, B andA from our LSM-RN model. Consequently, the taskof both missing value and sensor completion can be accomplishedby the following:

    Gt =UtBUTt , when1 ≤ t ≤ T (5)

    Subsequently, the edge traffic for snapshotGT+h, whereh is thenumber of future time spans, can be predicted as follows:

    GT+h =(UTAh)B(UTA

    h)T (6)

    Graph snapshots

    G!, G", , G#

    Latent space learning

    of road network

    Edge traffic prediction

    with missing data

    Real-time feedback

    1 2

    3

    Figure 3: Overall Framework

    Figure 3 shows an overview of the LSM-RN framework. Given aseries of road network snapshots, LSM-RN processes them in threesteps: (1) discover the latent attributes of vertices at each times-tamp, which capture both the spatial and temporal properties; (2)understand the traffic patterns and build a predictive modelof howthese latent attributes change over time; and (3) exploit real timetraffic information to adjust the existing latent space models.

    5. INFERENCE OF LSM-RNIn this section, we first propose a global multiplicative algorithm,

    and then discuss a fast incremental algorithm that scales tolargeroad networks.

    5.1 Global Learning AlgorithmWe develop an iterative update algorithm to solve Equation 4,

    which belongs to the category of traditional multiplicative updatealgorithm [15]. By adopting the methods from [15], we can derivethe update rule ofUt,B andA.

    5.1.1 Update rule ofUtWe first consider updating variableUt while fixing all the other

    variables. The part of objective function in Equation 4 thatis re-lated toUt can be rewritten as follows:

    J =T∑

    t=1

    Tr((Yt ⊙ (Gt − UtBU

    Tt )

    )(Yt ⊙ (Gt − UtBU

    Tt )

    )T)

    +T∑

    t=1

    λTr(Ut(D −W )UTt ) +

    T∑

    t=2

    Tr(γ(Ut − Ut−1A)(Ut − Ut−1A)

    T)

  • Algorithm 1 Global-learning(G1, G2, · · · , GT )Input: graph matrixG1, G2, · · · , GT .Output: Ut (1 ≤ t ≤ T ),A andB.

    1: Initialize Ut,B andA2: while Not Convergentdo3: for t = 1 to T do4: updateUt according to equation 105: end for6: updateB according to Equation 117: updateA according to Equation 128: end while

    Because we have the non-negative constraint ofUt, followingthe standard constrained optimization theory, we introduce the La-grangian multiplier(ψt) ∈ Rn×k and minimize the LagrangianfunctionL:

    L = J +T∑

    t=1

    Tr(ψtUTt ) (7)

    Take the derivation ofL with respect toUt, we have the follow-ing expression. (The detail is described in Appendix 9.1)

    ∂L

    ∂Ut= −2(Yt ⊙Gt)(UtB

    T + UtB) + 2(Yt ⊙ UtBUTt )(UtB

    T + UtB)

    + 2λ(D −W )Ut + 2γ(Ut − Ut−1A) + 2γ(UtAAT − Ut+1A

    T ) + ψt(8)

    By setting ∂L∂Ut

    = 0, and using the KKT conditions(ψt)ij(Ut)ij =0, we obtain the following equations for(Ut)ij :[− (Yt ⊙Gt)(UtB

    T + UtB) + (Yt ⊙ UtBUTt )(UtB

    T + UtB)

    + λLUt + γ(Uτ − Uτ−1A) + γ(UτAAT − Uτ+1A

    T )]ij(Ut)ij = 0

    (9)

    Following the updating rules proposed and proved in [15], wederive the following update rule ofUt:

    (Ut)←(Ut)⊙( (Yt ⊙G)(UtBT + UtB) + λWUt + γ(Ut−1A+ Ut+1AT )(Yt ⊙ UtBUTt )(UtB

    T + UtB) + λDUt + γ(Ut + UtAAT )

    ) 14

    (10)

    5.1.2 Update rule ofB andAThe updating rules forA andB could be derived as follows in a

    similar way (see Appendix 9.2 for detailed calculation):

    B ← B ⊙( ∑T

    t=1UTt (Yt ⊙Gt)Ut∑T

    t=1 UTt (Yt ⊙ (UtBU

    Tt ))Ut

    )(11)

    A← A⊙( ∑T

    t=1UTt−1Ut∑T

    t=1UTt−1Ut−1A

    )(12)

    Algorithm 1 outlines the process of updating each matrix usingaforementioned multiplicative rules to optimize Eq. 4. Thegeneralidea is to jointly infer and cyclically update all the latentattributematricesUt,B andA. In particular, we first jointly learn the latentattributes for each timet from all the graph snapshots (Lines 2–4). Based on the sequence of time-dependent latent attributes<U1, U2, · · · , UT >, we then learn the global attribute interactionpatternB and the transition matrixA (Lines 6–7).

    From Algorithm 1, we now explain how our LSM-RN modeljointly learns the spatial and temporal properties. Specifically, when

    we update the latent attribute of one vertexUt(i), the spatial prop-erty is preserved by (1) considering the latent positions ofits ad-jacent vertices (Yt ⊙ Gt), and (2) incorporating the local graphLaplacian constraint (i.e., matrixW andD). Moreover, the tem-poral property of one vertex is then captured by leveraging its la-tent attribute in the previous and next timestamps (i.e.,Ut−1(i) andUt+1(i)), as well as the transition matrix.

    In the following, we briefly discuss the time complexity adn con-vergence for our global multiplicative algorithm. In each iteration,the computation is dominated by matrix multiplication operations:the matrix multiplication between an×nmatrix and an×kmatrix,and another matrix multiplicative operator between an× k matrixand ak × k matrix. Therefore, the worst case time complexity periteration is dominated byO(T (nk2+n2k)). However, since all thematrices are sparse, the complexity of matrix multiplication withtwo n× k sparse matrix, is much smaller thanO(n2k). Followedthe proof shown in previous works [4] [15] [29], we could provethat Algorithm 1 converges into a local minimal and the objectivevalue is non-increasing in each iteration.

    G$ G% G& G'

    U( U) U* U+

    BA

    ...

    ...

    G, G- G. G/

    U0 U1 U2 U3

    BA

    ...

    ...

    (a) Global learning (b) Incremental learning

    Figure 4: Illustration of algorithms

    5.2 Incremental learning algorithmAlthough global multiplicative algorithm accurately captures the

    latent attribute, it is computationally expensive. As illustrated inFigure 4 (a), the latent attributes are jointly inferred from the entireset of graph snapshots and cyclically updated until they become sta-ble. Unfortunately, this joint and cyclic inference is verytime con-suming. A naive method is to treat each snapshot independently,and learn the temporal latent attribute at each timestamp separatelywithout considering the temporal relationships. To improve the ef-ficiency as well as preserve the topology and temporal properties,we propose an incremental algorithm. As depicted in Figure 4(b),our incremental algorithm sequentially and adaptively learns thelatent attributes from the temporal graph changes.

    5.2.1 Framework of incremental algorithmThe intuition of our incremental algorithm is based on the fol-

    lowing observation: during a short time interval (e.g., 5 minutes),the overall traffic condition of the whole network tends to staysteady, and the travel costs of most edges change at a slow pace. Forthose edges with minor travel speed variations, their correspondingpositions in the latent space do not change much either. Neverthe-less, we still need to identify vertices with obvious variations interms of their travel speeds, and adjust their corresponding latentattributes. For example, some edges could be in a transitionstatefrom non-rush hour to rush hour, and thus the travel speed reducessignificantly. Therefore, instead of recomputing the latent attributeof each vertex from scratch at every time stamp, we perform “lazy"adjustment, utilizing the latent attributes we have already learned inthe previous snapshot. As shown in Figure 4, to learn the latent at-tribute ofUt, the incremental algorithm utilizes the latent attributes

  • we already learned in the previous snapshot (i.e.,Ut−1) and thedynamism of traffic condition.

    Algorithm 2 presents the pseudo-code of incremental learningalgorithm. Initially, we learn the latent space ofU1 from our globalmultiplicative algorithm (Line 1). With the learned latentmatrixUt−1, at each time stampt between2 andT , we incrementallyupdate the latent space ofUt from Ut−1 according to the observedgraph changes (Lines 2-4). After that, we learn the global transitionmatrixA (Line 5).

    Algorithm 2 Incremental-Learning(G1, G2, · · · , GT )Input: graph matrixG1, G2, · · · , GT .Output: Ut (1 ≤ t ≤ T ),A andB.

    1: (U1, B)←Global-learning(G1 )2: for t = 2 to T do3: Ut ← Incremental-Update(Ut−1 , Gt) (See Section 5.2.2)4: end for5: Iteratively learn transition matrixA using Equation 12 untilA con-

    verges

    5.2.2 Topology-aware incremental updateGivenUt−1 andGt, we now explain how to calculateUt incre-

    mentally fromUt−1 , with which we could accurately approximateGt. The main idea is similar to an online learning process. At eachround, the algorithm predicts an outcome for the required task (i.e.,predict the speed of edges). Once the algorithm makes a prediction,it receives feedback indicating the correct outcome. Then,the on-line algorithm can modify its prediction mechanism, presumablyimproving the changes of making a more accurate prediction onsubsequent timestamps. In our scenario, we first use the latent at-tribute matrixUt−1 to make a prediction ofGt as if we do not knowthe observation, subsequently we adjust the model ofUt accordingto the true observation ofGt we already have in hand.

    However, in our problem, we are making predictions for the en-tire road network, not for a single edge. When we predict for oneedge, we only need to adjust the latent attributes of two vertices,whereas in our scenario we need to update the latent attributes formany correlated vertices. Therefore, the effect of adjusting the la-tent attribute of one vertex could potentially affects its neighboringvertices, and influences the convergence speed of the incrementalalgorithm. Hence, the adjustment order of vertices also matters inour incremental update.

    In a nutshell, our incremental update consists of the followingtwo components: 1) identify candidate nodes based on feedbacks;2) update their latent attributes and propagate the adjustment fromone vertex to its neighbors. As outlined in Algorithm 3, givenUt−1andGt, we first make an estimation of̂Gt based onUt−1 (Line 1),subsequently we treatGt as the feedback information, and selectthe set of vertices where we make inaccurate predictions, and in-sert them into a candidate setcand (Lines 3-9). Consequently, foreach vertexi of cand, we adjust its latent attribute so that we couldmake more accurate predictions (Line 15) and then examine howthis adjustment would influence the candidate task set from the fol-lowing two aspects: (1) if the latent attribute ofi does not changemuch, we remove it from the set ofcand (Lines 17-19); (2) if theadjustment ofi also affects its neighborj, we add vertexj to cand(Lines 20-25).

    The remaining questions in our Incremental-Update algorithmare how to adjust the latent position of one vertex accordingto feed-backs, and how to decide the order of update. In the following, weaddress each of them separately.Adjust the latent attribute of one vertex. To achieve high ef-

    Algorithm 3 Incremental-Update(Ut−1, Gt)Input: the latent matrixUt−1 at t− 1, Observed graph readingGtOutput: Updated latent spaceUt.

    1: Ĝt ← Ut−1BUTt−12: cand← ∅3: for eachi ∈ G do4: for eachj ∈ out(i) do5: if |Gt(i, j)− Ĝt(i, j)| ≥ δ then6: cand← cand ∪ {i, j}7: end if8: end for9: end for

    10: Ut ← Ut−111: while Not Convergent ANDcand /∈ ∅ do12: for i ∈ cand do13: oldu← Ut(i)14: for eachj ∈ out(i) do15: adjustUt(i) with Eq. 1416: end for17: if ||Ut(i) − oldu||2F ≤ φ then18: cand← cand \ {i}19: end if20: for eachj ∈ out(i) do21: p← Ut(i)BUt(j)22: if |p−Gt(i, j)| ≥ δ then23: cand← cand ∪ {j}24: end if25: end for26: end for27: end while

    Ut-1(v1)

    Ut(v1)

    highway

    bus

    ines

    s

    1

    2

    v128.6 35

    v2

    v1 v2

    v3 v4

    v5 v6

    v7

    v3 v4v6 v7 v1v2 v5

    (a) Adjustment method (b) Adjustment order

    Figure 5: Two challenges of adjusting the latent attribute with feed-backs.

    ficiency of adjusting the latent attribute, we propose to make thesmallest changes of the latent space (as quick as possible) to pre-dict the correct value. For example, as shown in Figure 5(a),sup-pose we already know the new latent position ofv1, then fewer stepmovement (Option 1) is preferable than gradual adjustment (Option2). Note that in our problem, when we move the latent positionofa vertex to a new position, the objective of this movement is toproduce a correct prediction for each of its outgoing edges.Specif-ically, givenUt−1(i), we want to findUt(i) which could accuratelypredict the weight of each edgee(vi, vj) that is adjacent to vertexvi. We thus formulate our problem as follows:

    Ut(i), ξ∗ = arg min

    U(i)∈Rk+

    1

    2||U(i) − Ut−1(i)||

    2F + Cξ

    s.t. |U(i)BUT (j)−Gt(i, j)| ≤ δ + ξ

    (13)

    Note that we have non-negativity constraint over the latentspaceof Ut(i). To avoid over aggressive update of the latent space, weadd a non-negative slack variableξ into the optimization problem.Therefore, we adopt the approaches from [3]: When the predicted

  • valueŷt (i.e.,Ut(i)BUTt (j)) is less than the correct valueyt (i.e.,Gt(i, j)), we use the traditional online passive-aggressive algo-rithm [7] because it still guarantees the non-negativity ofU(i);Otherwise, we updateU(i) by solving a quadratic optimizationproblem. The detailed solution is as follows:

    Ut(i) = max(Ut−1(i) + (k∗ − θ∗) · BUt−1(j)

    T, 0) (14)

    k∗ andθ∗ are computed as follows:

    k∗ = αt, θ∗ = 0 if ŷt < yt

    k∗ = 0, θ∗ = C if ŷt > yt andf(C) ≥ 0k∗ = 0, θ∗ = f−1(0) if ŷt > yt andf(C) < 0

    (15)

    where

    αt = min(C,

    max(|ŷt − yt| − δ, 0)

    ||BUt−1(j)T ||2

    )

    ft(θ) = max(Ut(i)− θBUt(j)

    T, 0)· BUt(j)

    T −Gt(i, j)− δ

    Updating order of cand. As we already discussed, the updateorder is very important because it influences the convergence speedof our incremental algorithm. Take the example of the road networkshown in Figure 1, suppose our initialcand contains three verticesv7, v6 andv2, where we have two edgese(v7, v6) ande(v6, v2). Ifwe randomly choose the update sequence as< v7, v6, v2 >, thatis, we first adjust the latent attribute ofv7 so thatc(v7, v6) has acorrect reading; subsequently we adjust the latent attribute ofv6 tocorrect our estimation ofc(v6, v2). Unfortunately,the adjustment ofv6 could influence the correction we have already made tov7, thusleading to an inaccurate estimation ofc(v7, v6) again. A desirableorder is to first update vertexv6 before updatingv7.

    Therefore, we propose to consider the reverse topology of theroad network when we update the latent position of each candi-date vertexv ∈ cand. The general principle is that: given edgee(vi, vj), the update of vertexvi should be proceeded after the up-date ofvj , since the position ofvi is dependent onvj . This mo-tivates us to derive a reverse topological order in the graphof G.Unfortunately, the road networkG is not a Directed Acyclic Graph(DAG), and contains cycles. To address this issue, we first generateStrongly Connected Components (SCC) of the graphG, and con-tract each SCC as a super node. We then derive a topological orderbased on this condensed graph. For the vertex order in each SCC,we can either decide randomly, or use some heuristics to break theSCC into a DAG, thus generating an ordering of vertices insideeach SCC. Figure 5(b) shows an example of ordering for the roadnetwork of Figure 1, where each rectangle represents a SCC. Af-ter generating a reverse topological order based on the contractedgraph and randomly decide the vertex order of each SCC, we obtainone final ordering< v2, v6, v7, v1, v5, v4, v3 >. Therefore, eachtime when we update the latent attributes ofcand, we follow thisordering of vertices.

    Time complexity For each vertexi, the computational complexityof adjusting its latent attributes using Eq. 14 isO(k), wherek isnumber of attributes. Therefore, to compute latent attributesu, thetime complexity per iteration isO(kT (∆n + ∆m)), where∆nis number of candidate vertex incand, and∆m is total numberof edges incident to vertices incand. In practice,∆n ≪ n and∆m ≪ m ≪ n2, therefore, we conclude that the computationalcost per iteration is significantly reduced using Algorithm2 thanusing the global approach. The time complexity of computingthetransition matrixA using Algorithm 2 is same with that of usingthe global approach.

    5.3 Real-time forecastingIn this section, we discuss how to apply our learning algorithms

    to real-time traffic prediction, where the sensor reading isreceivedin a streaming fashion. In practice, if we want to make a predic-tion for the current traffic, we cannot afford to apply our globallearning algorithm to all the previous snapshots because itis com-putationally expensive. Moreover, it is not always true that moresnapshots would yield a better prediction performance. Thealter-native method is to treat each snapshot independently: i.e., eachtime we only apply our learning algorithm for the most recentsnap-shot, and then use the learned latent attribute to predict the trafficcondition. Obviously, this would yield poor prediction quality as ittotally ignores the temporal transitions.

    Global learning

    ...

    ...

    G456

    T

    ...

    ...

    2T

    G789 G:;< G=>

    U?@A UBCD UEFG UHIUJ UK UL UM

    GN GO GP GQ

    Incremental learning

    time

    ...

    Global learningIncremental learning

    Figure 6: A batch recomputing framework for real-time fore-casting. To provide a just-in-time prediction, at each timespan(a time span is very short, e.g., 5 minutes), we train our trafficmodel with the proposed incremental approach on-the-fly, andreturn the prediction instantly. Meanwhile, in order to mai n-tain good quality, we batch recompute our traffic model pertime window (a time window is much longer, e.g., one hour).

    To achieve a good trade-off between the above two methods, wepropose to adapt a sliding window-based setting for the learningof our LSM-RN model, where we apply incremental algorithm ateach timestamp during one time window, and only run our globallearning algorithm at the end of one time window. As shown inFigure 6, we apply our global learning at timestampsT (i.e., theend of one time window), which learns the time-dependent latentattributes for the previousT timestamps. Subsequently, for eachtimestampT + i between [T, 2T], we apply our incremental al-gorithm to adjust the latent attribute and make further predictions:i.e., we useUT+i to predict the traffic ofGT+i+1. Each time we re-ceive the true observation ofGT+i+1, we calculateUT+i+1 via theincremental update from Algorithm 3. The latent attributeswill bere-computed at timestamp2T (the end of one time window), andthe recomputed latent attributesU2T would be used for the nexttime window[2T, 3T ].

    Figure 7: Sensor distribution and Los Angles road network,where the green and blue lines depict the sensors and road net-work segments, respectively (Best viewed in color).

    6. EXPERIMENT

  • 6.1 DatasetWe used a large-scale and high resolution (both spatial and tem-

    poral) traffic sensor (loop detector) dataset collected from Los An-geles County highways and arterial streets. This dataset includesboth inventory and real-time data for 15000 traffic sensors coveringapproximately 3420 miles. The sampling rate of the streaming data,which provides speed, volume (number of cars passing from sen-sor locations) and occupancy, is 1 reading/sensor/min. This sensordataset have been continuously collected and archived since 2010.

    We chose two months of sensor dataset (i.e., March and Aprilin 2014) for our experiments, which include more than 60 millionrecords of readings. As for the road network, we used Los Ange-les road network which was obtained from HERE Map dataset [10].We constructed two subgraphs of Los Angeles road network, termedas SMALL and LARGE. The SMALL network contains 5984 ver-tices and 12538 edges, and LARGE contains 8242 vertices and19986 edges. As described in Section 3, the sensor data is mappedto the road network edges. Figure 7 shows sensors locations androad network segments, where the green lines depict the sensors,and blue lines represent the road network segments. After mappingthe sensor data, we have two months of network snapshots for bothSMALL and LARGE.

    6.2 Experimental setting

    6.2.1 AlgorithmsOur methods are termed asLSM-RN-All (i.e., global learning

    algorithm) andLSM-RN-Inc (i.e., incremental learning algorithm).For edge traffic prediction, we compare with LSM-RN-NAIVE,

    where we adapted the formulations from both [26] and [19]. Hence,LSM-RN-NAIVE considers both network and temporal relation-ship, but uses the Naive incremental learning strategy as describedin ??, which independently learns the latent attributes of each time-stamp first, then the transition matrix. We also compare our al-gorithms with two representative time series prediction methods:a linear model (i.e., ARIMA [16]) and a non-linear model (i.e.,SVR [18]). We train each model independently for each time se-ries with historical data. In addition, because these methods willbe affected negatively due to the missing values during the predic-tion stages (i.e, some of the input readings for ARIMA and SVRcould be zero), for fair comparison we consider ARIMA-Sp andSVR-Sp, which use the completed readings from our global learn-ing algorithm.

    We consider the task of missing-value and missing sensor com-pletion. For the task of missing-value completion, we compare ouralgorithms with two methods: (1) KNN [9], which uses the aver-age values of the nearby edges in Euclidean distance as the imputedvalue, (2) LSM-RN-NAIVE, which independently learns the latentattributes of each snapshot, then uses them to approximate the edgereadings. We also implemented the Tensor method [1, 2] for miss-ing value completion. However, it cannot address the sparsity prob-lem of our dataset and thus produce meaningless results (i.e., mostof the completed values are close to 0). Although our frameworkis very general and supports missing sensor completion, we do notevaluate it through our experiments since we do not have groundtruth values to verify.

    To evaluate the performance of online prediction, we considerthe scenario of a batch-window setting described in Section5.3.Considering a time window[0, 2T ], we sequentially predict thetraffic condition for the timestamps during[T + 1, 2T ], with thelatent attributes ofUT and transition matrixA learned from theprevious batch computing. Each time when we make a prediction,we receive the true observations as the feedback. We compareour

    Incremental algorithm (Inc), with three baseline algorithms: Old,LSM-RN-Naive and LSM-RN-ALL. Specifically, in order to pre-dictGT+i, LSM-RN-Inc utilizes the feedback ofGT+i−1 to adjustthe time-dependent latent attributes ofUT+i−1, whereas Old doesnot consider the feedback, and always uses latent attributesUT andtransition matrixA from the previous time window. On the otherhand, LSM-RN-Naive ignores the previous snapshots, and only ap-plies the inference algorithm to the most recent snapshotGT+i−1(aka Mini-batch). Finally, LSM-RN-All applies the global learn-ing algorithm consistently to all historical snapshots (i.e., G1 toGT+i−1) and then makes a prediction (aka Full-batch).

    Table 2: Experiment parametersParameters Value rangeT 2, 4, 6, 8, 10, 12span 5, 10, 15, 20, 25, 30k 5, 10, 15, 20, 25, 30λ 2−7, 2−5, 2−3, 2−1, 21,23, 25

    γ 2−7, 2−5, 2−3, 2−1, 21, 23, 25

    6.2.2 Configurations and measures.With our missing value completion and edge traffic prediction

    experiments, we selected two different time ranges that representrush hour (i.e., 7am-8am) and non-rush hour (i.e., 2pm-3pm)re-spectively. For the task of missing value completion, during eachtimestamps of one time range (e.g., rush hour), we randomly se-lected20% of values as unobserved and manipulated them as miss-ing, with the objective of completing those missing values.Foreach traffic prediction task at one particular timestamp (e.g., 7:30am), we also randomly selected20% of the values as unknown anduse them as ground-truth values.

    We varied the parametersT andspan: whereT is the numberof snapshots, andspan is time gap between two continuous snap-shots. We also variedk, λ, andγ, which are parameters of ourmodel. The default settings (shown with bold font) of the experi-ment parameter are listed in Table 2. Because of space limitations,the results of varyingγ are not reported, which are similar to resultof varying λ. We use Mean Absolute Percentage Error (MAPE)and Root Mean Square Error (RMSE) to measure the accuracy.In the following we only report the experiment results basedonMAPE, the experiment results based on RMSE are reported in Ap-pendix 9.3. Specifically, MAPE is defined as follows:

    MAPE = (1

    N

    N∑

    i=1

    |yi − ŷi|

    yi)

    With ARIMA and SVR, we use the dataset of March to traina model for each edge, and use 5-fold cross-validation to choosethe best parameters. All the tasks of missing value completion andedge traffic prediction tasks are conducted on April data. Wecon-ducted our experimentswith C++ on a Linux PC with i5-2400 CPU@ 3.10G HZ and 24GB memory.

    6.3 Comparison for missing value completionIn this set of experiments, we evaluate the completion accu-

    racy of different methods. The experiment results on SMALL areshown in Figure 8 (a) and (b). We observe that both LSM-RN-All and LSM-RN-Inc achieve much lower errors than that of othermethods. This is because LSM-RN-All and LSM-RN-Inc captureboth spatial and temporal relationships, while LSM-RN-OneandKNN only use spatial property. LSM-RN-All performs better thanLSM-RN-Inc by jointly infer all the latent attributes. On the otherhand, we note that LSM-RN-One and KNN have similar perfor-mances, which shows that the effect of utilizing either Euclidean

  • LSM-RN-IncLSM-RN-Naive LSM-RN-AllKNN

    5

    10

    15

    20

    25

    30

    7:007:05

    7:107:15

    7:207:25

    7:307:45

    8:008:05

    MAPE (%)

    Time (am)

    5

    10

    15

    20

    25

    30

    2:002:05

    2:102:15

    2:202:25

    2:302:45

    3:003:05

    MAPE (%)

    Time (pm)(a) Rush hour on SMALL (b) Non-Rush hour on SMALL

    5

    10

    15

    20

    25

    30

    7:007:05

    7:107:15

    7:207:25

    7:307:45

    8:008:05

    MAPE (%)

    Time (am)

    5

    10

    15

    20

    25

    30

    2:002:05

    2:102:15

    2:202:25

    2:302:45

    3:003:05

    MAPE (%)

    Time (pm)(c) Rush hour on LARGE (d) Non-Rush hour on LARGE

    Figure 8: Missing value completion MAPE

    or Topology proximity is not enough for missing value completion.This also indicates that utilizing both spatial and temporal propertyyields a large gain than only utilizing spatial property.

    As shown in Figure 8(b), the completion performance on thenon-rush hour is better as compared to on the rush hour time in-terval. This is because during rush hour range, the traffic conditionis more dynamic, and the underlying pattern and transition changesfrequently. All of these factors render worse performance duringrush hour. Figure 8 (c) and (d) depict the experiment resultsonLARGE, which are similar on that of SMALL.

    6.4 Comparison with edge traffic predictionIn the following, we present the results of edge traffic prediction

    experiments.

    6.4.1 One-step ahead predictionThe experimental results of SMALL are shown in Figure 9 (a)

    and (b). Among all the methods, LSM-RN-All and LSM-RN-Incachieve the best results, and LSM-RN-All performs slightlybet-ter than LSM-RN-Inc. This demonstrates that the effectivenessof time-dependent latent attributes and the transition matrix. Weobserve that without the imputation of missing values, timese-ries prediction techniques (i.e., ARIMA and SVR) perform muchworse than LSM-RN-ALL and LSM-RN-Inc. Meanwhile, LSM-RN-Naive, which separately learns the latent attributes ofeach snap-shot, cannot achieve good prediction results as compared toLSM-RN-All and LSM-RN-Inc. This indicates that simply combiningtopology and temporal is not enough for making accurate predic-tions. We also note that even with completed readings, the accu-racy of SVR-Sp and ARIMA-Sp is still worse than that of LSM-RN-All and LSM-RN-Inc. One reason is that simply combinationof the spatial and temporal properties does not necessarilyyielda better performance. Another reason is that both SVR-Sp andARIMA-Sp also suffer from missing data during the training stage,which renders less accurate prediction. In Appendix 9.4, wealsoshow how the ratio of missing data would influence the predictionperformance. Finally, we observe that SVR is more robust than

    LSM-RN-INC

    SVRARIMA ARIMA-SP

    LSM-RN-NAIVE LSM-RN-ALL

    SVR-SP

    5101520253035

    7:007:15

    7:307:45

    8:008:15

    8:308:45

    9:009:15

    MAPE (%)

    Time (am)

    5101520253035

    2:002:15

    2:302:45

    3:003:15

    3:303:45

    4:004:15

    MAPE (%)

    Time (pm)(a) Rush hour on SMALL (b) Non-Rush hour on SMALL

    5101520253035

    7:007:15

    7:307:45

    8:008:15

    8:308:45

    9:009:15

    MAPE (%)

    Time (am)

    5101520253035

    2:002:15

    2:302:45

    3:003:15

    3:303:45

    4:004:15

    MAPE (%)

    Time (pm)(c) Rush hour on LARGE (d) Non-Rush hour on LARGE

    Figure 9: One-step ahead prediction MAPE

    ARIMA when encountering missing value on prediction stages:i.e., ARIMA-Sp performs significantly better than ARIMA, whilethe improvement of SVR-Sp over SVR is not much. This is becauseARIMA is a linear model which mainly uses the weighted averageof the previous readings for prediction, while SVR is a non-linearmodel that utilizes a kernel function. Figure 9 (c) and (d) showthe experiment results on LARGE, the overall results are similar tothose of SMALL.

    LSM-RN-INC

    SVRARIMA ARIMA-SP

    LSM-RN-NAIVE LSM-RN-ALL

    SVR-SP

    10

    15

    20

    25

    30

    35

    7:007:15

    7:307:45

    8:008:15

    8:308:45

    9:009:15

    MAPE (%)

    Time (am)

    10

    15

    20

    25

    30

    35

    2:002:15

    2:302:45

    3:003:15

    3:303:45

    4:004:15

    MAPE (%)

    Time (pm)(a) Rush hour on SMALL (b) Non-Rush hour on SMALL

    10

    15

    20

    25

    30

    35

    7:007:15

    7:307:45

    8:008:15

    8:308:45

    9:009:15

    MAPE (%)

    Time (am)

    10

    15

    20

    25

    30

    35

    2:002:15

    2:302:45

    3:003:15

    3:30

    MAPE (%)

    Time (pm)(c) Rush hour on LARGE (d) Non-Rush hour on LARGE

    Figure 10: Six-steps ahead prediction MAPE

    6.4.2 Multi-steps ahead predictionWe now present the experiment results on multi-step ahead pre-

  • diction, with which we predict the traffic conditions for next 30minutes (i.e.,h = 6). The prediction accuracy comparison amongdifferent methods on SMALL are shown in Figure 10 (a) and (b).Although LSM-RN-All and LSM-RN-Inc still outperforms othermethods, the margin between our methods and the baselines issmaller. The reason is that, when we make multiple-step aheadprediction, we use the predicted values from the past for future pre-diction. This leads to the problem of error accumulation, i.e., errorsincurred in the past are propagated into future predictions. We ob-serve the similar trends on LARGE, from the results reportedinFigure 10 (c) and (d).

    6.5 Scalability of different methodsTable 3 shows the running time of different methods. Although

    ARIMA and SVR is fast for each prediction, they have much highertraining cost. Note that our methods do not require extra trainingdata, i.e., our methods train and predict at the same time. Amongthem, LSM-RN-Inc is the most efficient approach: it only takes lessthan 500 milliseconds to learn the time-dependent latent attributesand make predictions for all the edges of the road network. Thisis because our incremental learning algorithm conditionally adjuststhe latent attributes of certain vertices, and utilizes thetopologicalorder that enables fast convergence. Even for the LARGE dataset,LSM-RN-Inc only takes less than five seconds, which is acceptableconsidering the span between two snapshots is at least five min-utes in practice. This demonstrates that LSM-RN-Inc scaleswell tolarge road networks. Regarding LSM-RN-All and LSM-RN-Naive,they both require much longer running time than that of LSM-RN-Inc. In addition, LSM-RN-All is faster than LSM-RN-Naive. Thisis because LSM-RN-One independently runs the global learningalgorithm for each snapshotT times, while LSM-RN-All only ap-plies global learning for the whole snapshots once.

    Table 3: Running time comparisons. For ARIMA and SVR,the training time cost is the total running time for training themodels for all the edges for one-step ahead prediction in theexperiment, and the prediction time is the average predictiontime per edge per query. For methods LSM-RN-Naive, LSM-RN-All and LSM-RN-Inc), we train (learn the latent attribut es)and predict (the traffic condition of the whole road network) atthe same time with the given data on-the-fly.

    data SMALL LARGEtrain (s) pred.(ms) train (s) pred. (ms)

    LSM-RN-One - 1353 - 29439LSM-RN-All - 869 - 14247LSM-RN-Inc - 407 - 4145

    ARIMA 484 0.00015 987 0.00024SVR 47420 0.00042 86093.99 0.00051

    Convergence analysis.Figure 11 (a) and (b) report the conver-gence rate of iterative algorithm LSM-RN-All on both SMALL andLARGE. As shown in Figure 11, LSM-RN-All converges very fast:when the number of iterations is around 20, our algorithm tends toconverge in terms of our objective value in Equation 4.

    6.6 Comparison for Real-time ForecastingIn this set of experiments, we evaluate our online setting algo-

    rithms. As shown in Figure 12 (a) and (b), LSM-RN-Inc achievescomparable accuracy with LSM-RN-ALL (Full-batch). This isbe-cause LSM-RN-Inc effectively leverages the real-time feedback in-formation to adjust the latent attributes. We observe that LSM-RN-Inc performs significantly better than Old and LSM-RN-Naive(Mini-batch), which ignores either the feedback information (i.e.,

    20 40 60 80

    100 120

    0 10 20 30 40 50

    Objective values

    Number of iterations

    LSM-RN-All

    20 40 60 80

    100 120

    0 10 20 30 40 50

    Objective values

    Number of iterations

    LSM-RN-All

    (a) SMALL (b) LARGE

    Figure 11: Converge rate

    Old) or the previous snapshots (i.e., LSM-RN-Naive). One inter-esting observation is that Old performs better than LSM-RN-Naivefor the initial timestamps, whereas Old surpasses Mini-batch at thelater timestamps. This indicates that the latent attributes learned inthe previous time-window are more reliable for predicting the near-future traffic conditions, but may not be good for multi-stepaheadprediction because of the error accumulation problem. Similar re-sults have also been observed in Figure 10 for multi-step aheadprediction. Figure 12 (c) and (d) show similar effects on LARGE.

    Figure 13 (a) and (b) show the running time comparisons of dif-ferent methods. One important conclusion for this experiment isthat LSM-RN-Inc is the most efficient approach, which is on aver-age two times faster than LSM-RN-Naive and one order of mag-nitude faster than LSM-RN-ALL. This is because LSM-RN-Incperforms a conditional latent attribute update for vertices within asmall portion of road network, whereas LSM-RN-Naive and LSM-RN-All both recompute the latent attributes from at least one entireroad network snapshot. Because in the real-time setting, LSM-RN-All utilizes all the up-to-date snapshots and LSM-RN-Naiveonlyconsiders the most recent single snapshot, LSM-RN-Naive isfasterthan LSM-RN-All. Figure 13 (c) and (d) show the running timeon LARGE. We observe that LSM-RN-Inc only takes less than 1seconds to incorporate the real-time feedback information, whileLSM-RN-Naive and LSM-RN-All take much longer.

    Therefore, we conclude that LSM-RN-Inc achieves a good trade-off between prediction accuracy and efficiency, which is applicablefor real-time traffic prediction applications.

    6.7 Varying parameters of our methodsIn the following, we test the performance of our methods by

    varying the parameters of our model. Since the effect of param-eters does not have much correlation with the size of road network,we only show the experimental results on SMALL.

    6.7.1 Effect of varying TFigure 14 (a) and Figure 14 (b) show the prediction performance

    and the running time of varyingT , respectively. We observe thatwith more number of snapshots, the prediction error decreases. Inparticular, when we increaseT from 2 to 6, the results improvesignificantly. However, the performance tends to stay stable atT ≥ 6. This indicates that smaller number of snapshots (i.e.,two or less) are not enough to capture the traffic patterns andtheevolving changes. On the other hand, more snapshots (i.e., morehistorical data) do not necessarily yield much gain, considering therunning time increases when we have more number of snapshots.Therefore, to achieve a good trade-off between running timeandprediction accuracy, we suggest to use at least 6 snapshots,but nomore than 12 snapshots.

    6.7.2 Effect of varying span

  • LSM-RN-IncLSM-RN-Naive LSM-RN-AllOld

    10

    20

    30

    4045

    7:007:05

    7:107:15

    7:207:25

    MAPE (%)

    Time (am)

    10

    20

    30

    4045

    2:002:05

    2:102:15

    2:202:25

    MAPE (%)

    Time (pm)(a) Rush hour on SMALL (b) Non-Rush hour on SMALL

    10

    15

    20

    25

    30

    7:007:05

    7:107:15

    7:207:25

    MAPE (%)

    Time (am)

    10

    15

    20

    25

    30

    2:002:05

    2:102:15

    2:202:25

    MAPE (%)

    Time (pm)(c) Rush hour on LARGE (d) Non-Rush hour on LARGE

    Figure 12: Online prediction MAPE

    LSM-RN-IncLSM-RN-Naive LSM-RN-All

    101

    102

    103

    7:007:05

    7:107:15

    7:207:25

    Running time (milliseconds)

    Time (am)

    101

    102

    103

    2:002:05

    2:102:15

    2:202:25

    Runnint time (milliseconds)

    Time (pm)(a) Rush hour on SMALL (b) Non-Rush hour on SMALL

    102

    103

    104

    7:007:05

    7:107:15

    7:207:25

    Running time (milliseconds)

    Time (am)

    102

    103

    104

    2:002:05

    2:102:15

    2:202:25

    Runnint time (milliseconds)

    Time (pm)(c) Rush hour on LARGE (d) Non-Rush hour on LARGE

    Figure 13: Online Prediction time

    10

    15

    20

    25

    2 4 6 8 10 12

    MAPE (%)

    T

    LSM-RN-AllLSM-RN-Inc

    100

    400

    700

    1000

    2 4 6 8 10 12

    Running time (ms)

    T

    LSM-RN-AllLSM-RN-Inc

    (a) Prediction error (b) Running time

    Figure 14: Effect of varying T

    The results of varyingspan are shown in Figure 15. It is clear

    10

    15

    20

    25

    5 10 15 20 25 30

    MAPE (%)

    Span (minutes)

    LSM-RN-AllLSM-RN-Inc

    100

    300

    500

    5 10 15 20 25 30

    Running time (ms)

    Span (minutes)

    LSM-RN-AllLSM-RN-Inc

    (a) Prediction error (b) Running time

    Figure 15: Effect of varying span

    10152025303540

    5 10 15 20 25 30

    MAPE (%)

    k

    LSM-RN-AllLSM-RN-Inc

    10152025303540

    2-5 2-3 2-1 2 23 25 27

    MAPE (%)

    λ

    LSM-RN-AllLSM-RN-Inc

    (a) Prediction MAPE withk (b) Prediction MAPE withλ

    Figure 16: Effect of varying k and λ on prediction accuracy, wherek is number of latent attributes, and λ is the graph regularization pa-rameter.

    that as the time gap between two snapshots increases, the perfor-mance declines. This is because whenspan increases, the evolv-ing process of underlying traffic may not stay smooth, the transitionprocess learned in the previous snapshot are not applicablefor thenext prediction. Fortunately the sensor dataset usually have high-resolution, therefore it is always better to use smaller span to learnthe latent attributes. In addition, span does not affect therunningtime of both algorithms.

    6.7.3 Effect of varyingk andλFigure 16 (a) shows the effect of varyingk. We have two main

    observations from this experiment: (1) we achieve better resultswith increasing number of latent attributes; (2) the performancestays stable whenk ≥ 20. This indicates a low-rank latent spacerepresentation can already capture the latent attributes of the trafficdata. In addition, our results show that when the number of latentattributes is small (i.e.,k ≤ 30), the running time increased withkbut does not change much when we varyk from 5 to 30. Therefore,settingk as 20 achieves a good balance between computational costand accuracy.

    Figure 16 (b) depicts the effect of varyingλ, which is the regular-ization parameter for our graph Laplacian dynamics. We observethat the graph Laplacian has a larger impact on LSM-RN-All algo-rithm than that on LSM-RN-Inc. This is becauseλ controls howthe global structure similarity contributes to latent attributes andLSM-RN-All jointly learns those time-dependent latent attribute,thusλ has larger effect on LSM-RN-ALL. In contrast, LSM-RN-Inc adaptively updates the latent positions of a small number ofchanged vertices in limited localized view, and thus is lesssensi-tive to the global structure similarity than LSM-RN-ALL. Intermsof parameters choices,λ = 2 andλ = 8 yields best results forLSM-RN-All and LSM-RN-Inc, respectively.

    7. CONCLUSIONIn this paper, we studied the problem of real-time traffic pre-

  • diction using real-world sensor data for road networks. We pro-posed LSM-RN, where each vertex is associated with a set of la-tent attributes that captures both topological and temporal proper-ties of road networks. We showed that the latent space modeling ofroad networks with time-dependent weights accurately estimatesthe traffic patterns and their evolution over time. To efficientlyinfer these time-dependent latent attributes, we developed an in-cremental online learning algorithm which enables real-time trafficprediction for large road networks. With extensive experiments weverified the effectiveness, flexibility and scalability of our model.For example, our incremental learning algorithm is orders of mag-nitude faster than the global learning algorithm, which takes lessthan 1 seconds to incorporate real-time feedback information forlarge road network.

    For future work, we plan to embed the current framework intoreal applications such as ride-sharing or vehicle routing system,to enable better navigation using accurate time-dependenttrafficpatterns. Another interesting direction is to incorporateother datasources (e.g., GPS, incidents) for more accurate traffic prediction.

    8. REFERENCES

    [1] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup.Scalable tensor factorizations for incomplete data.Chemometrics and Intelligent Laboratory Systems,106(1):41–56, March 2011.

    [2] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolboxversion 2.6. Available online, February 2015.

    [3] M. Blondel, Y. Kubo, and U. Naonori. Onlinepassive-aggressive algorithms for non-negative matrixfactorization and completion. InAISTATS, pages 96–104,2014.

    [4] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularizednonnegative matrix factorization for data representation.IEEE Trans. Pattern Anal. Mach. Intell., 33(8):1548–1560,2011.

    [5] H. Cheng and P.-N. Tan. Semi-supervised learning with datacalibration for long-term time series forecasting. InProceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages133–141. ACM, 2008.

    [6] F. C. T. Chua, R. J. Oentaryo, and E.-P. Lim. Modelingtemporal adoptions using dynamic matrix factorization. InICDM, pages 91–100. IEEE, 2013.

    [7] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, andY. Singer. Online passive-aggressive algorithms.The Journalof Machine Learning Research, 7:551–585, 2006.

    [8] C. Ding, T. Li, W. Peng, and H. Park. Orthogonalnonnegative matrix t-factorizations for clustering. InKDD,pages 126–135. ACM, 2006.

    [9] J. Haworth and T. Cheng. Non-parametric regression forspace–time forecasting under missing data.Computers,Environment and Urban Systems, 36(6):538–550, 2012.

    [10] HERE. https://company.here.com/here/.[11] T. Idé and M. Sugiyama. Trajectory regression on road

    networks. InAAAI, 2011.[12] H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou,

    J. M. Patel, R. Ramakrishnan, and C. Shahabi. Big data andits technical challenges.Communications of the ACM,57(7):86–94, 2014.

    [13] J. Kwon and K. Murphy. Modeling freeway traffic withcoupled hmms. Technical report.

    [14] R. Lambiotte, J.-C. Delvenne, and M. Barahona. Laplaciandynamics and multiscale modular structure in networks.arXiv preprint arXiv:0812.1770, 2008.

    [15] D. D. Lee and H. S. Seung. Algorithms for non-negativematrix factorization. InNIPS, pages 556–562, 2001.

    [16] B. Pan, U. Demiryurek, and C. Shahabi. Utilizing real-worldtransportation data for accurate traffic prediction. ICDM,pages 595–604, Washington, DC, USA, 2012.

    [17] L. Qu, Y. Zhang, J. Hu, L. Jia, and L. Li. A bpca basedmissing value imputing method for traffic flow volume data.In Intelligent Vehicles Symposium, 2008 IEEE, pages985–990. IEEE, 2008.

    [18] G. Ristanoski, W. Liu, and J. Bailey. Time series forecastingusing distribution enhanced linear regression. In J. Pei,V. Tseng, L. Cao, H. Motoda, and G. Xu, editors,Advancesin Knowledge Discovery and Data Mining, volume 7818 ofLecture Notes in Computer Science, pages 484–495.Springer Berlin Heidelberg, 2013.

    [19] R. A. Rossi, B. Gallagher, J. Neville, and K. Henderson.Modeling dynamic behavior in large evolving graphs. InWSDM, pages 667–676. ACM, 2013.

    [20] A. Saha and V. Sindhwani. Learning evolving and emergingtopics in social media: a dynamic nmf approach withtemporal regularization. InWSDM, pages 693–702. ACM,2012.

    [21] F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding. Communitydiscovery using nonnegative matrix factorization.DataMining and Knowledge Discovery, 22(3):493–521, 2011.

    [22] Y. Wang, Y. Zheng, and Y. Xue. Travel time estimation of apath using sparse trajectories. InProceedings of the 20thACM SIGKDD international conference on Knowledgediscovery and data mining, pages 25–34. ACM, 2014.

    [23] L. Wu, X. Xiao, D. Deng, G. Cong, A. D. Zhu, and S. Zhou.Shortest path and distance queries on road networks: Anexperimental evaluation.VLDB, 5(5):406–417, 2012.

    [24] J. Xu, D. Deng, U. Demiryurek, C. Shahabi, and M. van derSchaar. Mining the situation: spatiotemporal trafficprediction with big data.

    [25] B. Yang, C. Guo, and C. S. Jensen. Travel cost inferencefrom sparse, spatio temporally correlated time series usingmarkov models.Proceedings of the VLDB Endowment,6(9):769–780, 2013.

    [26] Y. Zhang and D.-Y. Yeung. Overlapping communitydetection via bounded nonnegative matrix tri-factorization.In KDD, pages 606–614. ACM, 2012.

    [27] J. Zheng and L. M. Ni. Time-dependent trajectory regressionon road networks via multi-task learning. InAAAI, 2013.

    [28] J. Zhou and A. K. Tung. Smiler: A semi-lazy time seriesprediction system for sensors. InProceedings of the 2015ACM SIGMOD International Conference on Management ofData, SIGMOD ’15, pages 1871–1886, 2015.

    [29] L. Zhu, A. Galstyan, J. Cheng, and K. Lerman. Tripartitegraph clustering for dynamic sentiment analysis on socialmedia. InProceedings of the 2014 ACM SIGMODinternational conference on Management of data, pages1531–1542. ACM, 2014.

    [30] L. Zhu, D. Guo, J. Y. G. V. Steeg, and A. Galstyan. Scalablelink prediction in dynamic networks via non-negative matrixfactorization.CoRR, abs/1411.3675, 2014.

    9. APPENDIX

    https://company.here.com/here/

  • 9.1 Derivatives ofL with respect toUt in equa-tion 7.

    The objective ofL could be rewritten as follows:

    L = J1 + J2 + J3 +T∑

    t=1

    Tr(ψtUt)

    where:

    J1 =T∑

    t=1

    Tr((Yt ⊙ (Gt − UtBU

    Tt )

    )(Yt ⊙ (Gt − UtBU

    Tt )

    )T)

    J2 =

    T∑

    t=1

    λTr(UtLUTt )

    J3 =T∑

    t=2

    Tr((Ut − Ut−1A)(Ut − Ut−1A)

    T)

    (16)J1 could also be rewritten as follows:

    J1 =T∑

    t=1

    Tr((Yt ⊙ (Gt − UtBU

    Tt )

    )(Yt ⊙ (Gt − UtBU

    Tt )

    )T)

    =

    T∑

    t=1

    Tr((Yt ⊙Gt)

    T (Yt ⊙Gt)− 2(YTt ⊙G

    Tt )(Yt ⊙ UtBU

    Tt )

    + (Y Tt ⊙ UtBTUTt )(Yt ⊙ UtBU

    Tt )

    )

    = const − 2T∑

    t=1

    Tr((Y Tt ⊙G

    Tt )(Yt ⊙ UtBU

    Tt )

    )

    +T∑

    t=1

    Tr((Y Tt ⊙ UtB

    TUTt )(Y ⊙ UtBU

    Tt )

    )

    (17)The second item of equation 17 could be transformed by:

    O1 =

    T∑

    t=1

    Tr((Y Tt ⊙G

    Tt )(Yt ⊙ UtBU

    Tt )

    )

    =T∑

    t=1

    n∑

    k=1

    ((Y Tt ⊙G

    Tt )(Yt ⊙ UtBU

    Tt )

    )kk

    =

    T∑

    t=1

    n∑

    k=1

    m∑

    i=1

    (Y Tt ⊙GTt )ki(Yt ⊙ UtBU

    Tt )ik

    =T∑

    t=1

    m∑

    i=1

    n∑

    k=1

    (Yt ⊙Gt ⊙ Yt ⊙ UtBUTt )ik

    =T∑

    t=1

    Tr((Y Tt ⊙G

    Tt ⊙ Y

    Tt )UtBU

    Tt

    )

    (18)

    Now J1 could be written as follows:

    J1 = const− 2T∑

    t=1

    Tr((Y Tt ⊙G

    Tt ⊙ Y

    Tt )UtBU

    Tt

    )

    +T∑

    t=1

    Tr((Y Tt ⊙ UtB

    TUTt )(Y ⊙ UtBU

    Tt )

    )(19)

    We now take the derivative ofL in respect ofUt:

    ∂L

    ∂Ut=∂J1

    ∂Ut+∂J2

    ∂Ut+∂J3

    ∂Ut+∂∑Tt=1 Tr(ψtUt)

    ∂Ut(20)

    The derivative of∂J1∂Ut

    now could be calculated as follows:

    ∂J1

    ∂Ut=− 2(Yt ⊙Gt ⊙ Yt)(UtB

    T + UtB)

    +∂∑Tt=1 Tr

    ((Y Tt ⊙ UtB

    TUTt )(Y ⊙ UtBUTt )

    )

    ∂Ut

    (21)

    SupposeO2 =∑Tt=1 Tr

    ((Y Tt ⊙ UtB

    TUTt )(Y ⊙ UtBUTt )

    ),

    the derivative ofO2 could be written as:

    ∂O2

    ∂Ut(pq)=∂∑nk=1

    ∑mi=1(Y

    Tt ⊙ UtB

    TUTt )ki(Y ⊙ UtBUTt )ik

    ∂Ut(pq)

    =∂∑mi=1

    ∑nk=1(Yt ⊙ Yt ⊙ UtBU

    Tt ⊙ UtBU

    Tt )ik

    ∂Ut(pq)

    =∂∑mi=1

    ∑nk=1 Y

    2t (ik)(UtBU

    Tt )

    2(ik)

    ∂Ut(pq)(22)

    Because only thepth row ofUtBUTt is related with∂O2

    ∂Ut(pq), we

    have the following:

    ∂O2

    ∂Ut(pq)=∂∑nk=1 Y

    2t (pk)(UtBU

    Tt )

    2(pk)

    ∂Ut(pq)

    = 2

    n∑

    k=1

    (Y 2t )pk(UtBUTt )pk

    ∂(UtBUTt )pk

    ∂(Ut)pq

    = 2n∑

    k=1

    (Y 2t )pk(UtBUTt )pk(UtB

    T + UtB)kq

    (23)

    The matrices derivation is then expressed as:

    ∂O2

    ∂Ut= 2(Yt ⊙ Yt ⊙ UtBU

    Tt )(UtB

    T + UtB)

    = 2(Yt ⊙ UtBUTt )(UtB

    T + UtB)

    (24)

    Now the derivative of∂J1∂Ut

    is as follows:

    ∂J1

    ∂Ut=− 2(Yt ⊙Gt ⊙ Yt)(UtB

    T + UtB)

    + 2(Yt ⊙ UtBUTt ⊙ Yt)(UtB

    T + UtB)

    (25)

    Similarly, we could calculate the derivatives of∂J2∂Ut

    = 2λLUt,∂J3∂Ut =

    2(Ut−Ut−1A)+2(UtAAT −Ut+1A

    T ), and∂∑

    T

    t=1 Tr(ψtUT

    t)

    ∂Ut=

    ψt, we have the following:

    ∂L

    ∂Ut=− 2(Yt ⊙Gt)(UtB

    T + UtB) + 2(Yt ⊙ UtBUTt )(UtB

    T + UtB)

    + 2λLUt + 2(Ut − Ut−1A) + 2(UtAAT − Ut+1A

    T ) + ψt(26)

    9.2 Update rule ofA and BSimilar with the derivation ofUt, we add Lagrangian multiplier

    with φ ∈ Rk×k andω ∈ Rk×k, and calculate the derivatives ofLin respect ofA andB :

    ∂L

    ∂B= −2

    T∑

    t=1

    UTt (Yt ⊙G)Ut + 2

    T∑

    t=1

    UTt (Yt ⊙ UtBU

    Tt )Ut + φ

    ∂L

    ∂A= −2

    T∑

    t=2

    UTt−1Ut + 2

    T∑

    t=2

    UTt−1Ut−1A+ ω

    (27)Using the KKT conditionsφijBij = 0 andωijAij = 0, we get

    the following equations forBkk, andAkk:

  • −( T∑

    t=1

    UTt (Yt ⊙G)Ut

    )ijBij +

    ( T∑

    t=1

    UTt (Yt ⊙ UtBU

    Tt )Ut

    )ijBij = 0

    (28)

    −( T∑

    t=2

    UTt−1Ut

    )ijAij +

    ( T∑

    t=2

    UTt−1Ut−1A

    )ijAij = 0 (29)

    These lead us to the following update rules:

    Bij ← Bij( [

    ∑Tt=1 U

    Tt (Yt ⊙Gt)Ut]ij

    [∑Tt=1 U

    Tt (Yt ⊙ (UtBU

    Tt ))Ut]ij

    )(30)

    Aij ← Aij( [

    ∑Tt=1 U

    Tt−1Ut]ij

    [∑Tt=1 U

    Tt−1Ut−1A]ij

    )(31)

    9.3 Extra experiment results based on RMSEIn this set of experiments, we show the experiment results ac-

    cording to the measurement of RMSE, which indicates how closestpredictions are to true observations. The definition of RMSEis asfollows:

    RMSE =

    √√√√ 1N

    N∑

    i=1

    (yi − ŷi)2

    Figure 17 shows the experiment results for missing value com-pletion. The one-step ahead prediction results are shown inFig-ure 18, six-step ahead prediction results are depicted in Figure 19.Figure 20 shows the experiment results of online setting. The re-sults based on RMSE are similar with those based on MAPE. Wealso observe that the predicated value by our methods deviate asmall range (e.g., 4 to 11 mph) compared with the ground truthvalue.

    LSM-RN-IncLSM-RN-Naive LSM-RN-AllKNN

    5

    7

    9

    11

    13

    15

    7:007:05

    7:107:15

    7:207:25

    7:307:45

    8:008:05

    Time (am)

    RMSE (mph)

    5

    7

    9

    11

    13

    15

    2:002:05

    2:102:15

    2:202:25

    2:302:45

    3:003:05

    Time (pm)

    RMSE (mph)

    (a) Rush hour (b) Non-Rush hour

    Figure 17: Missing value completion rmse on SMALL

    9.4 Effect of missing dataIn this set of experiment, we analyze the effect of missing data

    on the training dataset for the time series prediction techniques (i.e.,ARIMA and SVR). The results are shown in Figure 21. As shownin Figure 21 (a) and (b), the prediction error for both approaches in-creases with more number of noise. Similar to the effect of missingvalue on the prediction stages shown in Figure 9, ARIMA is lessrobust than SVR because of its linear model. One interestingob-servation is that ARIMA performs better than SVR if the missingratio is less than10%, this indicates ARIMA is a good candidatefor accurate traffic condition under the presence of complete data,this also conforms with the experiment results on [16]. However,

    LSM-RN-INC

    SVRARIMA ARIMA-SP

    LSM-RN-NAIVE LSM-RN-ALL

    SVR-SP

    4

    8

    12

    16

    20

    7:007:15

    7:307:45

    8:008:15

    8:308:45

    9:009:15

    Time (am)

    RMSE (mph)

    4

    8

    12

    16

    20

    2:002:15

    2:302:45

    3:003:15

    3:303:45

    4:004:15

    Time (pm)

    RMSE (mph)

    (a) Rush hour (b) Non-Rush hour

    Figure 18: Edge traffic prediction rmse one step on SMALL

    LSM-RN-INC

    SVRARIMA ARIMA-SP

    LSM-RN-NAIVE LSM-RN-ALL

    SVR-SP

    7 10 13 16 19 22 25

    7:007:15

    7:307:45

    8:008:15

    8:308:45

    9:009:15

    Time (am)

    RMSE (mph)

    7 10 13 16 19 22 25

    2:002:15

    2:302:45

    3:003:15

    3:303:45

    4:004:15

    Time (pm)

    RMSE (mph)

    (a) Rush hour (b) Non-Rush hour

    Figure 19: Edge traffic prediction rmse 6 step ahead on SMALL

    LSM-RN-IncLSM-RN-Naive LSM-RN-AllOld

    4 7

    10 13 16 19 22

    7:007:05

    7:107:15

    7:207:25

    Time (am)

    RMSE (mph)

    4 7

    10 13 16 19 22

    2:002:05

    2:102:15

    2:202:25

    Time (pm)

    RMSE (mph)

    (a) Rush hour (b) Non-Rush hour

    Figure 20: Online Prediction rmse on SMALL

    0.1 0.2 0.3 0.4 0.5 0.6

    0 0.1 0.2 0.3 0.5

    MAPE (%)

    Missing rate

    ARIMASVR

    0.1 0.2 0.3 0.4 0.5 0.6

    0 0.1 0.2 0.3 0.5

    MAPE (%)

    Missing rate

    ARIMASVR

    (a) SMALL (b) LARGE

    Figure 21: Missing rate during training stages for SVR and ARIMA

    ARIMA is sensitive to the missing values during both training andprediction stages, which renders poor performance with incompletedataset.

    1 Introduction2 Background and Related Works2.1 Traffic analysis2.2 Latent space model and NMF

    3 Problem definition4 Latent Space Model for Road Networks (LSM-RN)4.1 Topology in LSM-RN 4.2 Time in LSM-RN4.2.1 Temporal effect of latent attributes4.2.2 Transition matrix

    4.3 LSM-RN Model4.3.1 Edge traffic prediction with missing data

    5 Inference of LSM-RN5.1 Global Learning Algorithm5.1.1 Update rule of Ut5.1.2 Update rule of B and A

    5.2 Incremental learning algorithm5.2.1 Framework of incremental algorithm5.2.2 Topology-aware incremental update

    5.3 Real-time forecasting

    6 Experiment6.1 Dataset6.2 Experimental setting6.2.1 Algorithms6.2.2 Configurations and measures.

    6.3 Comparison for missing value completion6.4 Comparison with edge traffic prediction6.4.1 One-step ahead prediction6.4.2 Multi-steps ahead prediction

    6.5 Scalability of different methods6.6 Comparison for Real-time Forecasting6.7 Varying parameters of our methods6.7.1 Effect of varying T6.7.2 Effect of varying span6.7.3 Effect of varying k and

    7 Conclusion8 References9 Appendix9.1 Derivatives of L with respect to Ut in equation ??.9.2 Update rule of A and B9.3 Extra experiment results based on RMSE9.4 Effect of missing data