1 Have we met? Trying to find similar events in an archival database in the context of travel time estimation Rafael J. Fernández-Moctezuma April 3, 2007

1

Have we met?

Trying to find similar events in an archival database in the context of travel time estimation

Rafael J. Fernández-Moctezuma

April 3, 2007

2

Acknowledgements

• Kristin Tufte for introducing me to the “fetch similar” problem – and helping bring the problem to smaller pieces to solve.

• Umut Ozertem (OGI) for valuable discussion from a Machine Learning perspective.

3

Contents

• Issues in estimating travel time– “Instantaneous estimate”– A posteriori estimate

• Combining archived data efficiently

• Imposing structure to minimize the search space

• Preliminary results

• Concurrent / Future work

4

Motivation: Travel time

Direction of flow

5

Travel time

A B C

Direction of flow

MA MB MC

Inductive loop detectors measure speed, occupancy, and volume. These are the three fundamental quantities used to reason theoretically about traffic flow.

6

Travel time

A B C

Direction of flow

MA MB MC

Region of Influence

7

Travel time

A B C

Direction of flow

VMS

α Ω

MA MB MC

Region of Influence

A Variable Message Sign could inform drivers with estimated travel times. Useful before intersections, alternate routes may be selected if considerable delay is ahead.

8

Travel time

A B C

Direction of flow

VMS

α ΩAB BC

MA MB MC

Region of Influence

9

Instantaneous estimate

A B C

Direction of flow

VMS

α ΩAB BC

MA MB MC

At time t0, calculatesthe travel time from αto Ω as the sum of timesbetween the regions(i.e., time from α to AB +time from AB to BC +time from BC to Ω).

However, by the timethe vehicle arrives to AB,conditions measuredat B may have changed.

t0 tf

10

A posteriori estimate

A B C

Direction of flow

VMS

α ΩAB BC

MA MB MC

At time t0, calculatesthe travel time from αto AB (say, t1.). At time t1, look at the condition reported by B, and calculate the time between AB and BC (and so on.)

This is closer to the travel time a vehicle experienced, but this estimate cannot be computed online at α, for the complete segment, since we cannot see the future.

t0 t1 t2 tf

11

Archived data is useful

• It is possible to compute a posteriori estimates for previously observed measurements.

• This opens the possibility for incorporating previously seen travel times (associated with instantaneous measurements) for online estimation.

12

Identical vs. Similar

• We cannot guarantee that all possible combinations of measured values have been observed already.

• We would also like the recall of relevant data point to be a fast process – we don’t want to go through the entire history at every refresh.

• ML people constantly complain about “not having enough data”. In this case, we have a lot of data and we wish to quickly extract a representative sample.

13

Let’s step out and think in general terms: A model system

Data stream of measurements

Archive

Selectionmechanism

similar historical measurementsFusion

mechanism

Estimation

14

A model system


Archive

Selectionmechanism


mechanism

Estimation

How can we do this efficiently?

15

A model system


Archive

Selectionmechanism


mechanism

Estimation

What is a reasonable strategy?

16

A model system


Archive

Selectionmechanism


mechanism

Estimation

Our effort so far has concentrated in

this section.

17

Impose structure in the data archive

• Databases are very efficient when we know what to ask (e.g., “value >= 20” benefits greatly from index lookup, if index exists)

• Can we index “similarity”?• Consider imposing structure on previously

seen information. We can be clever about what to index and reduce the search space on the fly.

18

Similar as “close” in a vector space

Retrieve the 3 nearest points to the new point

For n existing points in the space, we need n comparisons.

This problem is referred to as k-nearest neighbors.

We can define “Close” in terms of Euclidean distance.

19

What if we can prune off some points?

Suppose we are given regional boundaries. It is now feasible to look first at which one is the closest region, and then perform the search within it.

The outcome of clustering can give us such boundaries.

20

K-means

One of the simplest algorithms for clustering, attempts to minimize the variance within members of each cluster, while maximizing the variance between clusters.

It finds the centroids of k clusters. These points may not be observed points.

For this example, an initial comparison with three centroids reduces the search space to ~ 1/3.

21

K-meansThe random start property of k-means implies several limitations, three of which stand out:

(1) For a particular choice of k, there can be one or more solutions.(2) It is possible to end up with empty clusters.(3) Initial choice of centroids can be problematic (bad derivative)

We can cope with these limitations doing several Monte Carlo runs.

22

May not be a small enough set for KNN

What if we could further reduce the search within a cluster?

Suppose a torus is defined in terms of distances to the cluster centroid – we have already pre-computed them in pre-processing, and we have computed the novelty point’s distance when classifying as well.

Only do KNN with points within the two radiuses (d + λ and d – λ).

23

Back to transportation

• Input vector

for a particular time t, in a segment s that contains n sensor stations

),,,,,,,(),( )()()()1()1()1( traveltimeoccupancyvolumespeedoccupancyvolumespeedstv nnn

• What are we clustering?

– The input vectors looking only at the n fundamental measurements.

– Travel time is our measure of interest, i.e., target for prediction. Clustering is “blind” to it.

24

Considerations

• I said that clustering is “blind” to the a posteriori estimate of travel time

• This much is true, but we want clusters that help us predict travel time.

• The core assumption is that the fundamental measurements at a time t are related to the travel time (they are, we just haven’t expressed that yet).

25

Considerations

• Look at the difference in travel times among members of a cluster. This helps us choose a suitable number of clusters.

• We could use variance, but (1) it grows quadratically, and (2) is not intuitive.

• Proposed error function should decrease smoothly as the number of clusters increases.

• The error function is saying “+/- 3 is all the same to me.” . of choice

eachfor clusters over Average

1

time. travelis where

elements, with cluster aFor

||1

1

)()(

1

)()()(

K

K

Kerror

KError

x

nj

xxN

error

i

ik

N

i

jji

j

26

Prototype implementation

• Looked at US 26 E sub-segment• Morning period that includes peak: 06:00 –

11:00• Treated one day as historical, tested on another

day (Oct. 9 2006 and Oct. 12 2006)• Careful: if we just shuffle points to estimate

performance, we are fooling ourselves – the fitting process may have seen past and future. For this domain, if we are to simulate data loss, always leave the test day(s) out.

27

US 26 E

Image from http://maps.google.com/

Looked at three stations:CornellMurrayCedar Hills

Hypothetical VMS between185th and Cornell, with target destination betweenCedar Hills and Parkway.

Segment length: 3.7 miles.

Believe it or not, it can take up to 30 minutes during rush hour (been there, done that).

28

Choice of k

• As expected, error function drops

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12x 10

4

Number of clusters

Erro

r

Choice of k

29

Choice of k

• As expected, error function drops

2 4 6 8 10

100

200

300

400

500

600

Number of clusters

Erro

r

Choice of k

suitable

30

Simplifying criteria

• Radiuses determining the torus centered around the centroid:

R1 = d/2

R2 = 3d/2

Where d is the distance from the novelty to the centroid.

• “Fusion mechanism” is the average of 3-nearest neighbors within the torus.

31

Experimental resultsThe trends during the peak period are followed correctly.

Unsurprisingly, the early peak is somewhat captured – the fitting set did not have one. Still, ups and downs are discovered.

06:00 07:00 08:00 09:00 10:00 11:002

4

6

8

10

12

14

16

18

Time

Trav

el ti

me

(in m

inut

es)

Measured vs. Estimated travel times

a posteriori estimateon-line cluster based estimate

ERRATA: LABELSREVERSED

32

Fitting set timeseries

6:00 7:00 8:00 9:00 10:00 11:000

5

10

15

20

25

30

Time

Trav

el ti

me

(min

utes

)

A posteriori travel time estimates for 10/9/2007

33

Ongoing work (suggested by Kristin)

• Looking at probe runs and comparing the measured times with a posteriori estimates – previous efforts were made with instantaneous estimates only. Curious as to whether the a posteriori estimate is significantly different than the instantaneous one.

• Pick a larger dataset – OR 217 has better sensor density. Test over one month or so.

34

Future work

• Current prototype is in MATLAB – should I start getting familiar with Niagara?

• Any better ideas for the “fusion”? Should this just be an extra parameter? (“pick k nearest neighbors”)

• The radius estimate can be a potential problem. Any suggestions? Should this just be one more parameter to find during fitting?

35

Future work

• Is a torus the right shape? How about a hypercone? Could it be easily derived on the fly from pre-computed information?

• We have considered expanding the feature vector (temperature, precipitation, etc.) These measurements are updated hourly, and sometimes available the next day. Any other sources that may make sense?

Documents

1 Have we met? Trying to find similar events in an archival database in the context of travel time estimation Rafael J. Fernández-Moctezuma April 3, 2007