Analyzing NYC Transit Data

Analyzing NYC Transit Data:Taxis, Ubers, and Citi Bikes

Todd SchneiderApril 8, 2016

todd@toddwschneider.com

Where to find me

toddwschneider.com

github.com/toddwschneider

@todd_schneider

toddsnyder

Things I’ll talk about

• Taxi, Uber, and Citi Bike data

• Medium data analysis tools and tips

• Where does R fit in?

Taxi and Uber Data

http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Citi Bike Data

http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

NYC Taxi and Uber Data

• Taxi & Limousine Commission released public, trip-level data for over 1.1 billion taxi rides 2009–2015

• Some public Uber data available as well, thanks to a FOIL request by FiveThirtyEight

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Citi Bike Data

• Citi Bike releases monthly data for every individual ride

• Data includes timestamps and locations, plus rider’s subscriber status, gender, and age

https://www.citibikenyc.com/system-data

Generic Analysis Overview

1. Get raw data

2. Write code to process raw data into something more useful

3. Analyze data

4. Write about what you found out

Analysis Tools• PostgreSQL

• PostGIS

• Command line

• JavaScript

https://github.com/toddwschneider/nyc-taxi-data https://github.com/toddwschneider/nyc-citibike-data

Raw data processing goals• Load flat files of varying file formats into a unified,

persistent PostgreSQL database that we can use to answer questions about the data

• Do some one-time calculations to augment the raw data

• We want to answer neighborhood-based questions, so we’ll map latitude/longitude coordinates to NYC census tracts

Processing raw data:The reality

• Often messy, raw data can require massaging

• Not fun, takes a while, but is essential

• Specifically: we have to plan ahead a bit, anticipate usage patterns, questions we’re going to ask, then decide on schema

Raw Data

Specific issues encountered with raw taxi data

• Some files contain empty lines and unquoted carriage returns 😐

• Raw data files have different formats even within the same cab type 😕

• Some files contain extra columns in every row 😠

• Some files contain extra columns in only some rows 😡

How do we load a bunch of files into a database?

• One at a time!

• Bash script loops through each raw data file, for each file it executes code to process data and insert records into a database table

https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh

How do we map latitude and longitude to census tracts?

• PostGIS!

• Geographic information system (GIS) for PostgreSQL

• Can do calculations of the form, “is a point inside a polygon?”

• Every pickup/drop off is a point, NYC’s census tracts are polygons

NYC Census Tracts

• 2,166 tracts

• 196 neighborhood tabulation areas (NTAs)

Shapefiles

• Shapefile format describes geometries like points, lines, polygons

• Many shapefiles publicly available, e.g. NYC provides a shapefile that contains definitions for all census tracts and NTAs

• PostGIS includes functionality to import shapefiles

Shapefile Example

PostGIS: ST_Within()

• ST_Within(geom A, geom B) function returns true if and only if A is entirely within B

• A = pickup or drop off point

• B = NYC census tract polygon

Spatial Indexes

• Problem: determining whether a point is inside an arbitrary polygon is computationally intensive and slow

• PostGIS spatial indexes to the rescue!

Spatial indexes in a nutshell bounding box

Bounding box

Census tract

Spatial Indexes• Determining whether a point is inside a rectangle is easy!

• Spatial indexes store rectangular bounding boxes for polygons, then when determining if a point is inside a polygon, calculate in 2 steps:

1. Is the point inside the polygon’s bounding box?

2. If so, is the point inside the polygon itself?

• Most of the time the cheap first check will be false, then we can skip the expensive second step

Putting it all together• Download NYC census tracts shapefile, import

into database, create spatial index

• Download raw taxi/Uber/Citi Bike data files and loop through them, one file at a time

• For each file: fix data issues, load into database, calculate census tracts with ST_Within()

• Wait 3 days and voila!

Analysis, a.k.a.“the fun part”

• Ask fun and interesting questions

• Try to answer them

• Rinse and repeat

Taxi maps

• Question: what does a map of every taxi pickup and drop off look like?

• Each trip has a pickup and drop off location, plot a bunch of dots at those locations

• Made entirely in R using ggplot2

Taxi maps

Taxi maps preprocess

• Problem: R can’t fit 1.1 billion rows

• Solution: preprocess data by rounding lat/long to 4 decimal places (~10 meters), count number of trips at each aggregated point

https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215

Render maps in Rhttps://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R

Data reliability

Every other comment on reddit:

• Map the position of every Citi Bike over the course of a single day

• Google Maps Directions API for cycling directions

• Leaflet.js for mapping

• Torque.js by CartoDB for animation

Citi Bike Animation

• Google Maps cycling directions have strong bias for dedicated bike lanes on 1st, 2nd, 8th, and 9th avenues

• Not necessarily true!

Citi Bike Assumptions

Modeling the relationship between the weather and Citi Bike ridership

• Daily ridership data from Citi Bike

• Daily weather data from National Climatic Data Center: temperature, precipitation, snow depth

• Devise and calibrate model in R

Modeling the relationship between the weather and Citi Bike ridership

Model specification

Calibration in R

• Uses nlsLM() function from minpack.lm package for Levenberg–Marquardt algorithm to minimize nonlinear squared error

https://gist.github.com/toddwschneider/bac3350f84b2ff99969d

Model Results

Airport traffic

• Question: how long will my taxi take to get to the airport?

• LGA, JFK, and EWR are each their own census tracts

• Get all trips that dropped off in one of those tracts

• Calculate travel times from neighborhoods to airports

Airport traffic

Analyzing NYC Transit Data

Data & Analytics

Station Development Peer Exchange: Federal Perspective€¦ · City Station/Project Program Amount San Francisco Transbay Transit Center HSIPR $400,000,000 NYC Moynihan Station Various

Bus Rapid Transit in New Jersey - NYU Wagner … Rapid Transit or Rail system to NYC or Philadelphia 0 ... bus routes, featuring some BRT elements ... Bus Rapid Transit was identified

Transit's Dirty Little Secret: Analyzing Patterns of Transit Use

Analyzing Transit Travel Time Performanceonlinepubs.trb.org/Onlinepubs/trr/1983/915/915-001.pdf · Analyzing Transit Travel Time Performance HERBERT S. LEVINSON A detailed analysis

Mass Transit Noise Levels and Rider Characteristics in NYC Preliminary Findings Richard Neitzel, PhD, CIH Robyn Gershon, DrPH University of Washington

Digital Printing Company NYC - Neon Signs NYC - Awnings NYC - Banners

Analyzing Transit Service Reliability Using Detailed Data from …tram.mcgill.ca/Research/Publications/AVL_reliability_MN.pdf · 2014-08-27 · Analyzing Transit Service Reliability

transit BRT Makes its Overdue NYC Debut transit advocates that New York City would see a Second Avenue subway before it saw the first Bus Rapid Transit ... implementation and additional

THE NYC GEODATABASE · 2017-09-19 · NYC GDB •Collection of city-level features and data tables for thematically mapping and analyzing neighborhood data, updated bi-annually, detailed

Howard H. Roberts, Jr. President 2 Broadway - nyc. · PDF fileHoward H. Roberts, Jr. President MTA NYC Transit 2 Broadway New York, NY 10004 Dear Mr. Roberts, Jr.:

NYC Mta Transit Code of Ethics

SPOCK My decisions are completely logical. Patient…Elliot analyzing analyzing analyzing analyzing analyzing analyzing

31 December 1999 Giuliani, Democrats Strong-Arm NYC ... · ~ 31 December 1999 Giuliani, Democrats Strong-Arm NYC Transit Union e We Need a WorkersParly! Break witll tile Democrats!

POST Travel and Transit Sheet€¦ · NJ Transit Trip Planner Citi Bike Info Citi Bike Map MetLife Parking Info . POST WRESTLING’s Listener NYC Transit and Travel Guide 2019 POST

NYC TRANSIT SURPLUS MATERIAL SALES MC04045 FOR …web.mta.info › nyct › materiel › collectsales › pdfs › mc04045.pdf · 2020-04-10 · FOR IMMEDIATE SALE. STATION SIGNS;

Center of Excellence in Underground Construction ...€¦ · • NYC Metro, DCWater, LA Metro, Sound Transit, Urban Drainage, military, etc. ... • 2) numerical modeling, such as

CONTENTS November 3 | 14files.ctctcdn.com/f76445a8001/a32d3ede-8354-4584-8f94-dd6cdba5… · NYC Employment & Training Coalition November 3 | 14 PAGE 6 Federal Transit Administration

What’s New?...Transit Oriented Development (TOD) – Family projects in close proximity to Metropolitan Transit Authority (MTA) rail stations outside NYC, or which are in communities

NYC TRANSIT POLICE

Report D - · PDF file7.9 Report D New York City Transit SERVICE CHANGES: Service Issue NYC TRANSIT COMMITTEE NOTIFICATION MT A BUS OPERATIONS COMMITTEE NOTIFICATION: