50
Analyzing NYC Transit Data: Taxis, Ubers, and Citi Bikes Todd Schneider April 8, 2016 [email protected]

Analyzing NYC Transit Data

Embed Size (px)

Citation preview

Page 1: Analyzing NYC Transit Data

Analyzing NYC Transit Data:Taxis, Ubers, and Citi Bikes

Todd SchneiderApril 8, 2016

[email protected]

Page 2: Analyzing NYC Transit Data

Where to find me

toddwschneider.com

github.com/toddwschneider

@todd_schneider

toddsnyder

Page 3: Analyzing NYC Transit Data

Things I’ll talk about

• Taxi, Uber, and Citi Bike data

• Medium data analysis tools and tips

• Where does R fit in?

Page 4: Analyzing NYC Transit Data

Taxi and Uber Data

http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Page 5: Analyzing NYC Transit Data

Citi Bike Data

http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

Page 6: Analyzing NYC Transit Data

NYC Taxi and Uber Data

• Taxi & Limousine Commission released public, trip-level data for over 1.1 billion taxi rides 2009–2015

• Some public Uber data available as well, thanks to a FOIL request by FiveThirtyEight

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Page 7: Analyzing NYC Transit Data

Citi Bike Data

• Citi Bike releases monthly data for every individual ride

• Data includes timestamps and locations, plus rider’s subscriber status, gender, and age

https://www.citibikenyc.com/system-data

Page 8: Analyzing NYC Transit Data

Generic Analysis Overview

1. Get raw data

2. Write code to process raw data into something more useful

3. Analyze data

4. Write about what you found out

Page 9: Analyzing NYC Transit Data

Analysis Tools• PostgreSQL

• PostGIS

• R

• Command line

• JavaScript

https://github.com/toddwschneider/nyc-taxi-data https://github.com/toddwschneider/nyc-citibike-data

Page 10: Analyzing NYC Transit Data

Raw data processing goals• Load flat files of varying file formats into a unified,

persistent PostgreSQL database that we can use to answer questions about the data

• Do some one-time calculations to augment the raw data

• We want to answer neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts

Page 11: Analyzing NYC Transit Data

Processing raw data:The reality

• Often messy, raw data can require massaging

• Not fun, takes a while, but is essential

• Specifically: we have to plan ahead a bit, anticipate usage patterns, questions we’re going to ask, then decide on schema

Page 12: Analyzing NYC Transit Data

Raw Data

Page 13: Analyzing NYC Transit Data

Specific issues encountered with raw taxi data

• Some files contain empty lines and unquoted carriage returns 😐

• Raw data files have different formats even within the same cab type 😕

• Some files contain extra columns in every row 😠

• Some files contain extra columns in only some rows 😡

Page 14: Analyzing NYC Transit Data

How do we load a bunch of files into a database?

• One at a time!

• Bash script loops through each raw data file, for each file it executes code to process data and insert records into a database table

https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh

Page 15: Analyzing NYC Transit Data

How do we map latitude and longitude to census tracts?

• PostGIS!

• Geographic information system (GIS) for PostgreSQL

• Can do calculations of the form, “is a point inside a polygon?”

• Every pickup/drop off is a point, NYC’s census tracts are polygons

Page 16: Analyzing NYC Transit Data

NYC Census Tracts

• 2,166 tracts

• 196 neighborhood tabulation areas (NTAs)

Page 17: Analyzing NYC Transit Data

Shapefiles

• Shapefile format describes geometries like points, lines, polygons

• Many shapefiles publicly available, e.g. NYC provides a shapefile that contains definitions for all census tracts and NTAs

• PostGIS includes functionality to import shapefiles

Page 18: Analyzing NYC Transit Data

Shapefile Example

Page 19: Analyzing NYC Transit Data

PostGIS: ST_Within()

• ST_Within(geom A, geom B) function returns true if and only if A is entirely within B

• A = pickup or drop off point

• B = NYC census tract polygon

Page 20: Analyzing NYC Transit Data

Spatial Indexes

• Problem: determining whether a point is inside an arbitrary polygon is computationally intensive and slow

• PostGIS spatial indexes to the rescue!

Page 21: Analyzing NYC Transit Data

Spatial indexes in a nutshell bounding box

Bounding box

Census tract

Page 22: Analyzing NYC Transit Data

Spatial Indexes• Determining whether a point is inside a rectangle is easy!

• Spatial indexes store rectangular bounding boxes for polygons, then when determining if a point is inside a polygon, calculate in 2 steps:

1. Is the point inside the polygon’s bounding box?

2. If so, is the point inside the polygon itself?

• Most of the time the cheap first check will be false, then we can skip the expensive second step

Page 23: Analyzing NYC Transit Data

Putting it all together• Download NYC census tracts shapefile, import

into database, create spatial index

• Download raw taxi/Uber/Citi Bike data files and loop through them, one file at a time

• For each file: fix data issues, load into database, calculate census tracts with ST_Within()

• Wait 3 days and voila!

Page 24: Analyzing NYC Transit Data

Analysis, a.k.a.“the fun part”

• Ask fun and interesting questions

• Try to answer them

• Rinse and repeat

Page 25: Analyzing NYC Transit Data

Taxi maps

• Question: what does a map of every taxi pickup and drop off look like?

• Each trip has a pickup and drop off location, plot a bunch of dots at those locations

• Made entirely in R using ggplot2

Page 26: Analyzing NYC Transit Data

Taxi maps

Page 27: Analyzing NYC Transit Data

Taxi maps preprocess

• Problem: R can’t fit 1.1 billion rows

• Solution: preprocess data by rounding lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215

Page 28: Analyzing NYC Transit Data

Render maps in Rhttps://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R

Page 29: Analyzing NYC Transit Data

Data reliability

Every other comment on reddit:

Page 30: Analyzing NYC Transit Data

• Map the position of every Citi Bike over the course of a single day

• Google Maps Directions API for cycling directions

• Leaflet.js for mapping

• Torque.js by CartoDB for animation

Citi Bike Animation

Page 31: Analyzing NYC Transit Data

• Google Maps cycling directions have strong bias for dedicated bike lanes on 1st, 2nd, 8th, and 9th avenues

• Not necessarily true!

Citi Bike Assumptions

Page 32: Analyzing NYC Transit Data

Modeling the relationship between the weather and Citi Bike ridership

Page 33: Analyzing NYC Transit Data

Modeling the relationship between the weather and Citi Bike ridership

• Daily ridership data from Citi Bike

• Daily weather data from National Climatic Data Center: temperature, precipitation, snow depth

• Devise and calibrate model in R

Page 34: Analyzing NYC Transit Data

Modeling the relationship between the weather and Citi Bike ridership

Page 35: Analyzing NYC Transit Data

Model specification

Page 36: Analyzing NYC Transit Data

Calibration in R

• Uses nlsLM() function from minpack.lm package for Levenberg–Marquardt algorithm to minimize nonlinear squared error

https://gist.github.com/toddwschneider/bac3350f84b2ff99969d

Page 37: Analyzing NYC Transit Data

Model Results

Page 38: Analyzing NYC Transit Data

Airport traffic

• Question: how long will my taxi take to get to the airport?

• LGA, JFK, and EWR are each their own census tracts

• Get all trips that dropped off in one of those tracts

• Calculate travel times from neighborhoods to airports

Page 39: Analyzing NYC Transit Data

Airport traffic

Page 40: Analyzing NYC Transit Data

More fun stuff in the full posts

• On the realism of Die Hard 3

• Relationship between age, gender, and cycling speed

• Neighborhoods with most nightlife

• East Hampton privacy concerns

• What time do investment bankers arrive at work?

Page 41: Analyzing NYC Transit Data

“Medium data” analysis tips

Page 42: Analyzing NYC Transit Data

What is “medium data”?No clear answer, but my rough thinking:

• Tiny: fits in spreadsheet

• Small: doesn’t fit in spreadsheet, but fits in RAM

• Medium: too big for RAM, but fits on local hard disk

• Big: too big for local disk, has to be distributed across many nodes

Page 43: Analyzing NYC Transit Data

Use the right tool for the jobMy personal toolkit (yours may vary!):

• PostgreSQL for storing and aggregating data. Geospatial calculations with PostGIS extension

• R for modeling and plotting

• Command line tools for looping through files, loading data, text processing on input data with sed, awk, etc.

• Ruby for making API calls, scraping websites, running web servers, and sometimes using local rails apps to organize relational data

• JavaScript for interactivity on the web

Page 44: Analyzing NYC Transit Data

R + PostgresSQL• The R ↔ Postgres link is invaluable! Use R and

Postgres for the things they’re respectively best at

• Postgres: persisting data in tables, rote number crunching

• R: calibrating models, plotting

• RPostgreSQL package allows querying Postgres from within R

Page 45: Analyzing NYC Transit Data

Tip: pre-aggregate• Think about how you’re going to access the data, and

consider creating intermediate aggregated tables which can be used as building blocks for later analysis

• Example: number of taxi trips grouped by pickup census tract and date/time truncated to the hour

• Resulting table is only 30 million rows, easier to work with than full trips table, and can still answer lots of interesting questions

Page 46: Analyzing NYC Transit Data

Pre-aggregating exampleCREATE TABLE hourly_pickups AS SELECT date_trunc('hour', pickup_datetime) AS pickup_hour, cab_type_id, pickup_nyct2010_gid, COUNT(*) FROM trips WHERE pickup_nyct2010_gid IS NOT NULL GROUP BY pickup_hour, cab_type_id, pickup_nyct2010_gid;

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38

Page 47: Analyzing NYC Transit Data

How to get people to read your work

• It has to be interesting. If you’re not excited, probably nobody else is either

• Most people are distracted, and they read things in “fast scroll” mode. Optimize for them

• The questions you ask are more important than the methods you use to answer them

Page 48: Analyzing NYC Transit Data

Specific tips

• Write in short paragraphs with straightforward language

• Use plenty of section headers

• Good ratio of pictures to text

• Avoid the dreaded “wall of text”

Page 49: Analyzing NYC Transit Data

Above all…

• Have fun!

• Keep an inquisitive mind. Observe stuff happening around you, ask questions about it, try to answer those questions