Investigate graph and network algorithms in transport ...eu-maths-in.se/wp-content/uploads/SSGMI_2015_report_Scania.pdfScania’s customers. 2 Work ow The work ow of the main analysis

Investigate graph and networkalgorithms in transport vehicle

GPS data to detect andquantify hubs and flow

Swedish study group Mathematics in Industry 2015

Catarina Dudas1, Christopher Engstrom2, JohanKarlsson1, Frida Nellros3, Luyuan Qi-Gautier4, Sergei

Silvestrov2, and Jesper Ulke3

1Fraunhofer Chalmers Research Centre IndustrialMathematics

2Malardalen University3Scania

4KTH Royal Institute of Technology

September 4, 2015

1

1 Background

The aim of this project is to investigate how graph and network algorithmstogether with vehicle GPS data can be used efficiently, and in the longerrun support the Scania’s customers with route optimization and logisticsplanning. Clustering methods and graph methods will be studied to iden-tify appropriate methods that fulfill the requirements of the application atScania. These methods can be used to identify bottlenecks or other pointsof interest in transportation networks, to find more optimal routes, and/oranomaly detection.

The motivation is to apply the results to transport planning tasks inthe future. Some examples are in improved driver coaching by having morerelevant comparison between vehicles or route planning taking into accounttraffic flows, popular stops, fuel consumption, etc. which could be of use forScania’s customers.

2 Workflow

The workflow of the main analysis conducted during the workshop can bedescribed by the flow-diagram in Fig. 1. We will start by describing thedata supplied by Scania in Sect. 2.1. Next we will describe how we definea ”stop” and how to find those in Sect. 2.2 which we later cluster intowhat we call hubs using DBSCAN in Sect. 2.3. For each hub we calculatea number of features which are then used to cluster these hubs themselvesinto different types of hubs using the EM-algorithm in Sect. 2.4. At the endin Sect. 2.5 we show how we clustered vehicles by first calculating how ofteneach vehicle travels between every pair of hubs compared to every other pairand used this to cluster the vehicles using DBSCAN.

2.1 Data description

The data consists of GPS positional data (longitude, latitude) of Scaniavehicles over time. Most vehicles report their positional approximately everyminute while turned on, but some only report their position every 4 hours.For every observation the position, time, boolean denoting if the vehicle isstanding still and vehicle-id is stored, but also a large number of featuressuch as velocity, heading, trigger type (why position is reported), fuel level,etc. is stored as well and could be used in a more in depth analysis.

It is worth to note that the data is generally of rather good quality butthere are some errors to take into account:

2

Figure 1: Overall workflow during the workshop.

• Sometimes the data indicates the vehicle stands still as well as havinga non-zero velocity, usually happens while the vehicle is standing still(1-2 such observations surrounded by data indicating ”stop” and zerovelocity), not a big problem to correct, but definitely something tokeep in mind, especially if the length of time a vehicle is stationary isneeded.

• Some big outliers in recorded positional data (usually happens whenthe vehicle is turned on) such as the data indicating the vehicle is inthe north sea all of a sudden. The really big errors are easy to findsimply by considering the difference in position and time between twoobservations and check if there is a to large jump in the position ofthe vehicle.

• For some of the features there are a lot of missing data or the estimatesare unreliable (the features used during the workshop did not havemuch problems with this).

• Missing observations because of disturbance, hardware failure, etc.

A plot of the ”raw” data points over Stockholm can be seen in Fig. 2.

3

There was also a couple other datasets over varying time periods we lookedat: one over a larger part of Sweden and another over Germany, however wehave chosen to only present the results for the Stockholm dataset.

Figure 2: Plot of all datapoints in the Stockholm area over one year.

2.2 Stop finding

In order to estimate when a vehicle is standing still we defined what wecall a stop as a period in time during which the vehicle is standing still inthe same position. To find all these stops for a single vehicle we sorted allobservations for the vehicle in time and then looked at sequences where thevehicle is stationary. The length of the stop is the difference in time betweenthe first and last observation in which the vehicle is standing still while forthe position of the stop we simply picked the position of the first observationin the sequence. This finds both stops where the vehicle is turned on andoff (since these should be surrounded by stationary observations), howeverthere are some small errors which should be handled when there is moretime than we had during the workshop.

• Single observations which indicate the vehicle is moving in betweenstops at the same location and within a short time interval are likelyerrors which will split up a longer stop into multiple shorter stops.

• If the vehicle somehow moves while ”turned off” (such as while on aferry, or during heavy disturbance) we would get a single long stop

4

while at the first location, while in reality we would like to split it intotwo separate stops.

• This method of finding stops does not work well on vehicles whichonly reports their position every 4 hours, fortunately these are easy toidentify and are usually mostly stationary anyway.

After doing this we get a large number of stops as seen in Fig. 3 wherewe have data on a part of Stockholm during a single year.

Figure 3: Plot of all stops in the Stockholm area over one year.

To every stop a number of features was calculated such as the length ofthe stop and time of day when the stop started with many more potentialfeatures being available for amore in depth analysis such as weight, fuellevel, odometer, etc. unfortunately we did not have time to consider theseadditional features during the workshop.

2.3 Clustering of stops

In order to find places where vehicles stop we clustered the stops usingthe density based clustering method DBSCAN [4]. To do this we usedthe DBSCAN Matlab routine written by M. Daszykowski [1] available athttp://www.chemometria.us.edu.pl

In order to cluster points DBSCAN classifies every point into one of threetypes, core points, edge points and outliers. For this end two parameters αand ε are used as follows.

5

http://www.chemometria.us.edu.pl

• Core points: A point is considered a core point if there is at least αnumber of points within distance ε of the point.

• Edge points: A point is considered an edge point if it is not a corepoint, but there is at least one core point within a distance ε of thepoint.

• Outliers: A point is considered an outlier if it is neither a core pointnor an edge point (less than α number of points within distance ε andno core point within distance ε).

An image describing the different types of points can be seen in Fig. 4.

e c c e o

e

Figure 4: Visualization of different types of points using α = 4 and theirneighborhoods of radius ε where core points (red), edge points (blue) andoutliers (black).

To form a cluster all core points are merged into the same cluster asall other core points within distance ε after which all edge points withindistance ε to any of the core points of the cluster is added to the clusteras well. For example in Fig. 4 first the two core points (red) would forma cluster since they are within each other’s neighborhood, after which thethree edge points (blue) would all be added to the cluster since they are allwithin the neighborhood of at least one core point. The last point (black) isan outlier since it is not within the neighborhood of any of the core pointsof the cluster.

After clustering the stops we found the hubs as seen in Fig. 5 Most hubscreated seems reasonable, such as the city terminal (middle large hub) andthe nearby city center. Others corresponds to fire stations, industry areas(large west cluster), or other less obvious hubs such as E4, distribution

6

Figure 5: Plot of all hubs in the Stockholm area using α = 100 and ε =0.0001 for the one year data.

centers or other parts of Stockholm with generally large traffic (similar tothe city terminal).

DBSCAN have a number of nice properties making it suitable for ourtype of analys but also a couple of potential problems that would have tobe looked at as seen below.

• Can be adapted to different scales such as on the country or city levelby changing parameters α and ε.

• Robust to noise (important given the high amount of outliers whenclustering the stop data).

• Number of clusters does not need to be pre-determined and clusterscan have varying shapes and densities.

• The time complexity of the method is O(n2), which is a problem asthe size of the data increases (more on this later).

• Two parameters need to be specified and while they are both easily in-terpret, the ”best” parameters differ between datasets hence choosingthe right parameters can be a problem.

After having clustered the stops into hubs it is easy to give a visualizationof the traffic between the hubs by counting the number of trips betweenstops belonging to hubs. After doing this and letting the width of the edge

7

between two trajectories be proportional to the number of trips in eitherdirection we got the graph seen in Fig. 6.

Figure 6: Plot of all hubs and amount of traffic between them in the Stock-holm area using α = 100 and ε = 0.0001 for the one year data.

We can see heavy traffic between the city terminal and city center whichis to be expected. The heavy traffic in the northern area is not as obvious butit is likely to be busses going between st Eriksplan, Odenplan and Sveavagen,although it is worth to note that there is also a fire station at one of thesehubs which could explain part of the traffic.

2.4 Hub clustering

While we would have liked to do hub classification, since we do not haveany ground truth we clustered hubs in order to primary determine if thereare different types of hubs visible in the date with just a few of the possiblefeatures.

In order to cluster the hubs we used a Gaussian mixture model clusteredusing the Expectation-Maximization algorithm (EM-algorithm) [2] in Mat-labs Statistics and Machine Learning Toolbox [5]. The following featureswas considered.

• Area of hub: Calculated as the convex hull of all stops belonging tothe hub.

• Number of stops made in the hub throughout the whole time period.

8

• Density of stops: Calculated as the number of stops/area of hub.

• Length of stops: Mean length of all stops belonging to the hub, inap-propriate because of a large number outliers (very long stops). Stoplengths overall followed approximately aheavy-tailed exponential dis-tribution.

• Corrected length of stops: Calculated as the median of the log of thestop lengths. Median in order to be more robust to outliers, logarithmin order to get an as close as possible transform of the data to theGaussian distribution for it to be appropriate for the EM algorithm.

• Time of day: Average time of the day in which stops in the hubstarts, calculated using the circular mean. For the clustering taskcorresponding point on the unit disc was used instead in order to takeinto account the variance within the hub as well.

• Many other features where considered but we had no time to test them,such as fuel level, weight,

Overall it would probably be best not to aggregate the 1 minute and 4hour data as we have done since they differ a lot between each other bothin how much they move as well as stop length and their distribution. Fromthe distribution of the logarithm of the stop lengths there is a strong hintthat it should in fact be two separate distribution rather than one as seenin Fig. 7.

Figure 7: Histogram of the logarithm of the stop length for all stops

9

The clustering algorithm works by first assigning a random multivariateGaussian distribution depending on selected features to each cluster (in ourcase 3). The EM-algorithm then iterate over two steps: 1) Calculate thelikelihood for each point to belong to each cluster and assign each pointto the cluster with the highest likelihood, 2) Estimate new parameters forthe Gaussian distributions of each cluster using the points belonging torespective cluster.

The results after clustering the hubs using three features: time of day,corrected length of stops and area of hub into 3 clusters we plotted theircluster belonging and position in the part of the feature space spanned bytime of day (x-,y-axis represented as a complex number) and the log of thelength of stops in Fig. 8a as well as the EM-scores in 8b.

(a) (b)

Figure 8: (a) Plot of all hubs and their cluster belonging in feature spaceand (b) Plot of the EM-scores for the different hubs.

From the data it’s quite easy to see the purple cluster of hubs, but it isharder to say if the blue and red cluster have much in common or are simplyoutliers. Regardless there are large differences between clusters which shouldbe useful for classification if looking for specific types of clusters.

Data describing the different types of hubs found can be seen in Table1 and the geographical position of the clusters can be seen in Fig. 9. Asseen the first cluster is characterized by longer average stops and that moststops start around 7:30 in the morning. The second cluster have some largerclusters such as the city center as well as having no clear time of day forwhich vehicles are standing still. The last clusters located in the bottomleft corner are similar to the first, but stops start even earlier around 6:30in the morning and are both very small. It is interesting to note that even

10

Table 1: Mean and variance of length of stops in days, mean and varianceof area in lat/long, average time of arrival during the day and length ofcorresponding vector (closer to one corresponds to lower variance) for thehubs in the three hub clusters.

mean(stop l.) var(stop l.) mean(area) var(area) Arrival (hour) Arrival l.

0.044 2.42 10−6 · 0.52 10−13 7.5 0.71

0.019 17.40 10−6 · 1.34 10−13 9.9 0.38

0.012 1.89 10−6 · 0.157 10−17 6.5 0.67

Figure 9: Plot of all hubs and their cluster belonging.

by doing a simple clustering task we see that all three hubs correspondingfire stations are in the same hub-type (blue) although there are also 3 otherhubs in the same cluster. The two bottom red hubs corresponds to E4 anda distribution center with heavy with heavy traffic between them. Theseare characterized by an early start of stops meaning that vehicles probablyoften start their morning by picking up their load at the eastern hub andthen shortly have to stop while entering the E4 before leaving this part ofthe map for delivery elsewhere.

2.5 Trajectory clustering

After having located the main hubs it is also possible to cluster the trajec-tories themselves in order to find groups of vehicles that behave similarly.Since we do not have a good way (yet) to find where a ”job” starts and ends,we choose to instead looked at the whole trajectory over the time period and

11

checked how often a vehicle travels between every pair of hubs.The trajectory clustering is done through the following steps.

1. For each vehicle: calculate the number of trips made between everypair of hubs (without passing through any other hub in between) andthen normalize the result by dividing with the total number of suchtrips to get the proportion of times a vehicle travels between everypair of hubs. This gives a vector of values for every vehicle.

2. Use the vectors calculated in the previous step to cluster vehicles us-ing DBSCAN and some distance measure between such probabilitydistributions. During the workshop we used the euclidean distance,but trying another distance measure made for comparing probabilitydistributions might be more appropriate would have been interestingto try given more time.

3. After all vehicles have been clustered using DBSCAN, the average overall vehicles in the cluster is calculated in order to be able to comparethe ”distance” between any vehicles and said cluster.

The resulting clusters of trajectories using DBSCAN for the previous hubscan be seen in Fig. 10.

Figure 10: Plot of trajectory clusters using DBSCAN with α = 5 and ε =0.27. Large cluster (left) and multiple small cluster (right).

As seen we got one large cluster of similar vehicles (we suspect busses)as well as a couple of smaller clusters. Notably is that we once again see theconnection between the two southern hubs which was clustered together inSect. 2.4 as well. Another thing to note is that most of the small clustershave one end in the E4 hub, probably because vehicles moving out of Stock-

12

holm (long haulage / long distance busses) mostly take the same routes outof Stockholm even if they end up at completely different destinations.

3 Other approaches

Apart from the main work we also had a couple of side-projects which wedid not have time to look into more in depth but is something we wouldlike/need to do if working with the data further.

3.1 Hierarchical clustering

Hierarchical clustering is a class of clustering algorithms that rather thanpartitioning the points into clusters creates a whole hierarchy of clustersand clusters within clusters. This cam be useful in order to find clusters atdifferent levels e.g. first find all cities and then find all major hubs withineach city.

This can be done in several ways but the most common methods can besummarized in three choices:

• Agglomerative (iteratively combine clusters into larger clusters) or di-visive (iteratively divide clusters into smaller clusters).

• Choice of similarity or distance used to compare different points witheach other.

• Choice of linkage criterion (how we compare clusters with each other),most commonly single (min. distance between points in the two clus-ters), mean (distance between centroids) or complete linkage (max.distance between points in the two clusters).

To do some preliminary tests we used the Matlab package PRTools [3]developed at Delft University of Technology. In order to have a datasetcovering a larger area we instead used a dataset of Germany over one dayinstead of the previous Stockholm dataset. For the actual clustering taskwe used agglomerative clustering together with complete linkage (penalizesnon circular clusters) as well as the usual euclidean distance between points.The results for three levels (5, 50 and 100 clusters respectively) can be seenin Fig. 11

While the results look ok at least in the 50 and 100 cluster examples,the method have large problems with outliers. In order to avoid clusteringoutliers other hierarchical clustering methods could be used or to a certain

13

Figure 11: Clustering using hierarchical clustering with 5, 50 and 100 clus-ters respectively.

extent another linkage criterion / distance and careful choice of when to ”cutthe tree” in choosing end clusters. One good alternative would probably beto use available hierarchical implementations of DBSCAN instead since theyare as we have already seen much more robust to outliers in the data.

3.2 DBSCAN speedup of implementation

During the workshop, our group mainly worked with the classic DBSCANclustering algorithm. Generally speaking, the clusters obtained from clas-sic DBSCAN algorithm contain the core and the reachable(edge) points asshown in Fig. 4 earlier.

To identify the core/ reachable / outliers, (1) the distance between thecandidate point and the other points should be calculated, and, (2) thenumber of the points surrounding the candidate point should be counted.This is a problem when the number of points become large because the firststep is in O(n2). Thus the classic DBSCAN algorithm can be extremelyslow. Already for some of the datasets we tried some of the laptops took avery long time or crashed during computations.

In order to improve the classical DBSCAN algorithm, another potentiallymore efficient implementation would be to adopt the “grid-to-grid distance”instead of the “point-to-point distance”. Our first idea is illustrated inFig. 12. Instead of directly calculating point-to-point distance, we firstlyrasterize the data region, and remove the sparse areas in the grid. Then wecluster the areas themselves by merging the neighboring non-sparse areas.Assuming the number of grids is much less than that of the original pointsthe proposed algorithm is expected to perform better for large datasetsalthough some information will be lost unless a very fine grained grid is

14

used.

Figure 12: The illustration of the proposed algorithm

While the method used during the workshop was the classical algorithm,it is worth to note that it can be implemented more efficiently (averageruntime of O(n log n)) using a more efficient data representation such asR-trees which can be used to only calculate the distance to points that aresomewhat ”close” rather than for all points in the dataset in step (1). Thiswould be another thing to look into in a real application.

4 Conclusions and future work

We have seen that it is possible to find reasonable hubs in a network bylooking at where vehicles stop and cluster these points. DBSCAN seems towork well for this kind of application given that it is (or can be) reasonablyfast as well as being very robust to noise (something of which there is alot of), considering that there are also hierarchical variants of the methodwhich should prove useful when looking at different scales such as on thecountry or city level.

We could also see that it is possible to find different types of hubs andsome rough identification of some of these types. In reality classificationrather than clustering of hubs would likely be the best approach when youknow what kind of hubs you are looking for. Since we could already seesome clear differences between hubs using only a couple of features thisgives a clear indication that conducting a classification of hubs should givereasonable results as well for most types of hubs.

Some more work would need to be done in order to evaluate the vehicleclustering, but the approach looks reasonable from the limited experiments

15

conducted during the workshop. Including information at/between hubs orwhat type of hubs that are visited would probably be a good next step.

While we have identified a couple of problems (errors in data, parameterselection and speed of DBSCAN, etc.) most of these are either not essentialto the rest of the analysis or not to hard to solve given a little more timewith solutions available already but not yet implemented in our analysis.

During workshop we have also discussed a number of things which couldbe useful from Scania’s point of view which we believe should be possible todo as well given more time. Below follows a couple of examples.

• Fleet management: After finding hubs and classifying those tasks suchas finding bottlenecks, calculate optimal routes and find vehicle devia-tions from normal behavior could be done better. For example takinginto consideration popular places to stop, congestion spots at differenttimes of the day, etc. when planning your route. You would also bemore able to give more accurate comparisons for driver coaching by forexample only compare a vehicle with other vehicles that drive similarroutes rather than the whole population.

• Business intelligence: On the higher level, by using a hub/graph repre-sentation it would be possible to conduct flow analysis and predictionon the overall network much easier than it would be on the originaldata given the size of the original dataset. This could for example beused to detect and/or predict changes in flow of goods or traffic on alarger scale.

Acknowledgments

A great thanks to the main organizers of the workshop at KTH, as well asInstitute Mittag-Leffler for providing a very nice environment and venue forthe week’s work. We would also like to show our appreciation for CIAM-center for Industrial and Applied Mathematics for sponsoring the workshopas well as Scania for providing a nice problem to work with during theworkshop.

16

References

[1] M. Daszykowski, B. Walczak, and D. Massart. Looking for natural pat-terns in data: Part 1. density-based approach. Chemometrics and Intel-ligent Laboratory Systems, 56(2):83 – 92, 2001.

[2] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood fromincomplete data via the em algorithm. JOURNAL OF THE ROYALSTATISTICAL SOCIETY, SERIES B, 39(1):1–38, 1977.

[3] R. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D. Tax,and S. Verzakov. Pr-Tools4.1, a matlab toolbox for pattern recognition,2007. http://prtools.org.

[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithmfor discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996.

[5] MATLAB. Statistics and Machine Learning Toolbox, version (R2015a).The MathWorks Inc., Natick, Massachusetts, 2015.

17

http://prtools.org

Documents

Investigate graph and network algorithms in transport ...eu-maths-in.se/wp-content/uploads/SSGMI_2015_report_Scania.pdfScania’s customers. 2 Work ow The work ow of the main analysis