Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of...

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip

Records: A Case Study of NYC

Jianting Zhang

Department of Computer Science

The City College of New York

Outline

•Introduction

•Background and Related Work

•Method and Discussions

•Experiments and Results

•Summary

Introduction

Taxi trip records•~300 million trips in about two years•~170 million trips (300 million passengers) in 2009•1/5 of that of subway riders and 1/3 of that of bus riders in NYC •The dataset is not perfect...

•13,000 Medallion taxi cabs

•License priced at $600, 000 in 2007

•Car services and taxi services are separate

•Only taxis with Medallion license are for hail (the rule could be under changing outside Manhattan...)

Introduction

Medallion#Shift#Trip#

Trip_Pickup_DateTimeTrip_Dropoff_DateTime

Trip_Pickup_LocationTrip_Dropoff_Location

Start_LonStart_LatEnd_LonEnd_Lat

Payment_TypeSurchargeTotal_AmtRate_Code

Passenger_CountFare_AmtTolls_AmtTip_AmtTrip_Time

Trip_Distance

vendor_namedate_loadedstore_and_forward

time_between_servicedistance_between_service

Start_Zip_CodeEnd_Zip_Code

start_xstart_yend_xend_y (local projection)

Meshed up on purpose due to privacy concerns

Introduction• In addition:

– Some of the data fields are empty– Pickup and drop-off locations can

be in Hudson River – The recorded trip distance/duration

can be unreasonable– ...

Outlier detections for data cleaning are needed

Mission can be easier to handle 170 million trips with the help of U2SOD-DB

Background and Related Work

•Existing approaches for outlier detection for urban computing

•Thresholding: e.g. 200m < dist < 30km

•Locating in unusual ranges of distributions

•Spatial analysis: within a region or a land use type

•Matching trajectory with road segments – treat unmatched ones as outliers

•Some techniques require complete GPS traces while we only have O-D locations

•Large-scale Shortest path computing has not been used for outlier detection

Background and Related Work• Shortest path computation

– Dijkstra and A* – New generation algorithms– Contraction Hierarchy (CH) based

•Open source implementations of CH:

•MoNav

•OSRM

•Much faster than ArcGIS NA module

Background and Related Work

• Network Centrality (Brandes, 2008)

Vtvs st

)(Node based

Edge based

Vtvs st

•Can be easily derived after shortest paths are computed

•Mapping node/edge between centrality can reveal the connection strengths among different parts of cities

Method and DiscussionsRaw Taxi trip data

Match pickup/drop-off point locations to street

segments within Distance D0

CD >D1 AND CD>W*RD?

Assign pickup/drop-off nodes by picking closer ones

Type I outlier (spatial analysis)

Compute shortest path

CD: Compute shortest distanceRD: Recorded trip distance

Type II outlier (network analysis)

Successful?

Aggregate unique (sid,tid) pairs

Update centrality measurements

Method and Discussions

• The approach is approximate in nature– Taxi drivers do not always follow shortest path– Especially for short trips and heavily congested areas – But we only care about aggregated centrality

measurements and the errors have a chance to be cancelled out by each other

• Increasing D0 will reduce # of type I outliners, but the locations might be mismatched with segments

• Reducing D1 and/or W will increase # of type II outliers but may generate false positives.

Experiments and Results

Count-Distance Distribution

5000000

10000000

15000000

20000000

Trip Distance (mile)

Count-Time Distribution

5000000

10000000

15000000

20000000

TripTime (Minute)

Count-Speed Distribution

5000000

10000000

15000000

20000000

.0]( 3

.0]( 5

.0]( 7

.0]( 9

.0, 10

.0]( 1

1.0, 1

( 13.0

, 14.0

5.0, 1

( 17.0

, 18.0

9.0, 2

( 21.0

, 22.0

3.0, 2

( 25.0

, 26.0

7.0, 2

( 29.0

, 30.0

1.0, 3

( 33.0

, 34.0

5.0, 3

( 37.0

, 38.0

9.0, 4

( 41.0

, 42.0

3.0, 4

( 45.0

, 46.0

7.0, 4

( 49.0

, 50.0

Speed (MPH)

Count-Fare Distribution

05000000

1000000015000000200000002500000030000000

Fare ($)

Over all distributions of trip distance, time, speed and fare

Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map

•D0=200 feet, D1=3 miles, W=2

•166 million trips, 25 million unique

•~2.5 millions (1.5%) type I outliers

•~ 18,000 type II outliers

•Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2.26 GHZ)

Examples of Detected Type II Outliers

Experiments and ResultsMapping Betweenness Centralities (All hours)

00H 02H 04H 06H 08H 10H

12H 14H 16H18H

20H 22H

Legend:

Mapping Betweenness Centralities (bi-hourly)

Summary

• Large-scale taxi trip records are error-prone due to a combination of device, human and information system induced errors – outlier detection and data cleaning are important preprocess steps.

• Our approach detects outliers that can not be snapped to street segments (through spatial analysis) and/or have significant differences between computed shortest distances and recorded trip distances (network analysis)

• The work is preliminary - a more comprehensive framework is needed (e.g., incorporating pickup and drop-off times, trip duration and fare information)

• It would be interesting to generate dynamics of betweenness maps at different traffic conditions, e.g., peak/non-peak, morning/afternoon and weekdays/weekends, and explore connection strengths among NYC regions.

jzhang@cs.ccny.cuny.edu

Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of...

Documents

CyberInfrastructure (GeoTECI) Lab Dr. Jianting Zhang ...jzhang/geoteci_intro.pdfBig Geospatial Data Challenges –Event Locations, trajectories and O-D data •E.g., Taxi trip records

outlier detection

Online Outlier Detection for Time Series - UCSD …t8zhu/talks/online outlier detection.pdf · Online Outlier Detection for Time Series Tingyi Zhu September 14, ... portfolio selection

Combining Cluster and Outlier Analysis - univie.ac.at€¦ · and outlier analysis. 2.1. Algorithms for Cluster and Outlier Analysis A variety of clustering algorithms exist, just

Outlier Detection - My Webspace fileswebspace.ship.edu/pgmarr/Geo441/Lectures/OPT 1 - Outlier Detection.… · Outlier Detection. Outlier detection is both easy and difficult. •

OUTLIER, HETEROSKEDASTICITY,AND NORMALITY

Outlier Detection for Temporal Data · • Outlier Detection for Temporal Networks • Applications of Temporal Outlier Detection Techniques • Summary, Q&A gupta58@illinois.edu,

Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma

Web Content Outlier Mining

SOFTWARE FOR MULTIVARIATE OUTLIER DETECTION IN SURVEY · PDF filesoftware for multivariate outlier detection in survey data ... todorov, templ, ... software for multivariate outlier

Outlier Powerpoint Template

Outlier Detection for Temporal Data Outlier D for Temp Outlier for Tem

Outlier Detection Techniques - LMU Munichzimek/publications/KDD2010/kdd10-outlier... · 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Outlier Detection Techniques

Class Outlier Mining

Outlier Treatment in HCSO

About Social Outlier

dwc Official Medical Fee Schedule-Hospital Outpatient ... · PDF file(rt) “Outlier Threshold” means the Medicare outlier threshold used in determining high cost outlier payments

Unusal Behavior for Outlier

Local Outlier Factor

Outlier Solutions Creative Agency