Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of...

Preview:

Citation preview

Outline

•Introduction

•Background and Related Work

•Method and Discussions

•Experiments and Results

•Summary

Introduction

3

Taxi trip records•~300 million trips in about two years•~170 million trips (300 million passengers) in 2009•1/5 of that of subway riders and 1/3 of that of bus riders in NYC •The dataset is not perfect...

•13,000 Medallion taxi cabs

•License priced at $600, 000 in 2007

•Car services and taxi services are separate

•Only taxis with Medallion license are for hail (the rule could be under changing outside Manhattan...)

Introduction

Medallion#Shift#Trip#

Trip_Pickup_DateTimeTrip_Dropoff_DateTime

Trip_Pickup_LocationTrip_Dropoff_Location

Start_LonStart_LatEnd_LonEnd_Lat

Payment_TypeSurchargeTotal_AmtRate_Code

Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time

Trip_Distance

vendor_namedate_loadedstore_and_forward

time_between_servicedistance_between_service

Start_Zip_CodeEnd_Zip_Code

start_xstart_yend_xend_y (local projection)

1

2

3

4

5

6

78 9

1110

Meshed up on purpose due to privacy concerns

Introduction• In addition:

– Some of the data fields are empty– Pickup and drop-off locations can

be in Hudson River – The recorded trip distance/duration

can be unreasonable– ...

Outlier detections for data cleaning are needed

Mission can be easier to handle 170 million trips with the help of U2SOD-DB

Background and Related Work

•Existing approaches for outlier detection for urban computing

•Thresholding: e.g. 200m < dist < 30km

•Locating in unusual ranges of distributions

•Spatial analysis: within a region or a land use type

•Matching trajectory with road segments – treat unmatched ones as outliers

•Some techniques require complete GPS traces while we only have O-D locations

•Large-scale Shortest path computing has not been used for outlier detection

Background and Related Work• Shortest path computation

– Dijkstra and A* – New generation algorithms– Contraction Hierarchy (CH) based

•Open source implementations of CH:

•MoNav

•OSRM

•Much faster than ArcGIS NA module

Background and Related Work

• Network Centrality (Brandes, 2008)

Vtvs st

stB

vvC

)(

)(Node based

Edge based

Vtvs st

stB

eeC

)(

)(

•Can be easily derived after shortest paths are computed

•Mapping node/edge between centrality can reveal the connection strengths among different parts of cities

Method and DiscussionsRaw Taxi trip data

Match pickup/drop-off point locations to street

segments within Distance D0

CD >D1 AND CD>W*RD?

Assign pickup/drop-off nodes by picking closer ones

Type I outlier (spatial analysis)

Compute shortest path

CD: Compute shortest distanceRD: Recorded trip distance

Type II outlier (network analysis)

Successful?

Aggregate unique (sid,tid) pairs

Update centrality measurements

Method and Discussions

• The approach is approximate in nature– Taxi drivers do not always follow shortest path– Especially for short trips and heavily congested areas – But we only care about aggregated centrality

measurements and the errors have a chance to be cancelled out by each other

• Increasing D0 will reduce # of type I outliners, but the locations might be mismatched with segments

• Reducing D1 and/or W will increase # of type II outliers but may generate false positives.

Experiments and Results

Count-Distance Distribution

0

5000000

10000000

15000000

20000000

<= 0

.0

( 0

.8,

1.0]

( 1

.8,

2.0]

( 2

.8,

3.0]

( 3

.8,

4.0]

( 4

.8,

5.0]

( 5

.8,

6.0]

( 6

.8,

7.0]

( 7

.8,

8.0]

( 8

.8,

9.0]

( 9

.8, 1

0.0]

( 10

.8, 1

1.0]

( 11

.8, 1

2.0]

( 12

.8, 1

3.0]

( 13

.8, 1

4.0]

( 14

.8, 1

5.0]

( 15

.8, 1

6.0]

( 16

.8, 1

7.0]

( 17

.8, 1

8.0]

( 18

.8, 1

9.0]

( 19

.8, 2

0.0]

Trip Distance (mile)

Cou

nt

Count-Time Distribution

0

5000000

10000000

15000000

20000000

<=

0

.0

( 2

.0,

3.0

]

( 5

.0,

6.0

]

( 8

.0,

9.0

]

( 1

1.0

, 1

2.0

]

( 1

4.0

, 1

5.0

]

( 1

7.0

, 1

8.0

]

( 2

0.0

, 2

1.0

]

( 2

3.0

, 2

4.0

]

( 2

6.0

, 2

7.0

]

( 2

9.0

, 3

0.0

]

( 3

2.0

, 3

3.0

]

( 3

5.0

, 3

6.0

]

( 3

8.0

, 3

9.0

]

( 4

1.0

, 4

2.0

]

( 4

4.0

, 4

5.0

]

( 4

7.0

, 4

8.0

]

> 5

0.0

TripTime (Minute)

Co

un

t

Count-Speed Distribution

0

5000000

10000000

15000000

20000000

<= 0

.0( 1

.0, 2

.0]( 3

.0, 4

.0]( 5

.0, 6

.0]( 7

.0, 8

.0]( 9

.0, 10

.0]( 1

1.0, 1

2.0]

( 13.0

, 14.0

]( 1

5.0, 1

6.0]

( 17.0

, 18.0

]( 1

9.0, 2

0.0]

( 21.0

, 22.0

]( 2

3.0, 2

4.0]

( 25.0

, 26.0

]( 2

7.0, 2

8.0]

( 29.0

, 30.0

]( 3

1.0, 3

2.0]

( 33.0

, 34.0

]( 3

5.0, 3

6.0]

( 37.0

, 38.0

]( 3

9.0, 4

0.0]

( 41.0

, 42.0

]( 4

3.0, 4

4.0]

( 45.0

, 46.0

]( 4

7.0, 4

8.0]

( 49.0

, 50.0

]

Speed (MPH)

Coun

t

Count-Fare Distribution

05000000

1000000015000000200000002500000030000000

<= 0

.0(

1.0,

2.0

](

3.0,

4.0

](

5.0,

6.0

](

7.0,

8.0

](

9.0,

10.

0]( 1

1.0,

12.

0]( 1

3.0,

14.

0]( 1

5.0,

16.

0]( 1

7.0,

18.

0]( 1

9.0,

20.

0]( 2

1.0,

22.

0]( 2

3.0,

24.

0]( 2

5.0,

26.

0]( 2

7.0,

28.

0]( 2

9.0,

30.

0]( 3

1.0,

32.

0]( 3

3.0,

34.

0]( 3

5.0,

36.

0]( 3

7.0,

38.

0]( 3

9.0,

40.

0]( 4

1.0,

42.

0]( 4

3.0,

44.

0]( 4

5.0,

46.

0]( 4

7.0,

48.

0]( 4

9.0,

50.

0]

Fare ($)

Coun

t

Over all distributions of trip distance, time, speed and fare

Experiments and Results

Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map

•D0=200 feet, D1=3 miles, W=2

•166 million trips, 25 million unique

•~2.5 millions (1.5%) type I outliers

•~ 18,000 type II outliers

•Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2.26 GHZ)

Experiments and Results

Examples of Detected Type II Outliers

Experiments and ResultsMapping Betweenness Centralities (All hours)

Experiments and Results

00H 02H 04H 06H 08H 10H

12H 14H 16H18H

20H 22H

Legend:

Mapping Betweenness Centralities (bi-hourly)

Summary

• Large-scale taxi trip records are error-prone due to a combination of device, human and information system induced errors – outlier detection and data cleaning are important preprocess steps.

• Our approach detects outliers that can not be snapped to street segments (through spatial analysis) and/or have significant differences between computed shortest distances and recorded trip distances (network analysis)

• The work is preliminary - a more comprehensive framework is needed (e.g., incorporating pickup and drop-off times, trip duration and fare information)

• It would be interesting to generate dynamics of betweenness maps at different traffic conditions, e.g., peak/non-peak, morning/afternoon and weekdays/weekends, and explore connection strengths among NYC regions.

Q&A

jzhang@cs.ccny.cuny.edu

Recommended