39

Click here to load reader

Capital Bikeshare Presentation

Embed Size (px)

Citation preview

Page 1: Capital Bikeshare Presentation

Capital BikeshareMAY 2, 2015GEORGETOWN UNIVERSITYSCHOOL OF CONTINUING STUDIESCERTIFICATE IN DATA ANALYTICSCAPSTONE PROJECT

SELMA ORRRYAN DONAHUEODETTE RIVERA

NORA GOEBELBECKERKARINA HIDALGO

Page 2: Capital Bikeshare Presentation

Problem StatementCities want to build bike systems for economic development and sustainability, but they face serious fiscal constraints.

Page 3: Capital Bikeshare Presentation

Problem StatementAs a result, the public has little tolerance for error – even though usage, and therefore profitability, is highly variable.

Most popular bikeshare station, 2014:

Union Station – 131,700 trips

Least popular bikeshare station, 2014:

34th St & Minnesota Ave SE – 112 trips

For roughly the same fixed costs, there are more than 1,000 times as may riders each year using the Union Station location as many others.

Page 4: Capital Bikeshare Presentation

Currently, there is no standardized, rigorous methodology for accurately predicting which stations will be most heavily used.

Problem Statement

Page 5: Capital Bikeshare Presentation

Goal Develop a model, based on Washington DC but applicable to other U.S. cities, that will predict the popularity of bikeshare stations based on characteristics of the area surrounding each station.

Such a model could be used to increase the popularity of new and existing bikeshare systems, making them more financially sustainable.

Page 6: Capital Bikeshare Presentation

Hypotheses• Bike share station popularity is influenced by station location– specifically, the economic, demographic, and geographic characteristics of surrounding neighborhoods.

• Certain determinants of bikeshare station popularity hold true across cities, allowing for the construction of a model that could accurately predict the popularity of bikeshare stations in other cities.

•Regression models, such as linear regression, are a good fit to predict station popularity.

Page 7: Capital Bikeshare Presentation

Project Background This project builds primarily upon two previous analyses of Washington, DC:

1. Maximizing Bicycle Sharing: An Empirical Analysis of Capital Bikeshare Usage • Multivariate regression to identify five factors that influenced bikeshare station popularity: population

20-39, non-white population, retail proximity, Metro proximity, and distance from system center.

2. Predicting the Popularity of Bicycle Sharing Stations: An Accessibility-Based Approach Using Linear Regression and Random Forests • Linear regression and random forest analysis to understand how job and residential proximity

influenced station popularity. Although the study attempted to extend the model to San Francisco and Minneapolis, it found that the model was a poor predictor of station popularity in those cities.

Why revisit this topic? • Our team identified other characteristics that might drive usage.• The bikeshare system has expanded considerably since those studies, offering a larger and more varied

sample.

Page 8: Capital Bikeshare Presentation

Data Sources and Ingestion

Data Variable Type Source Year Geography FormatBikeshare trips Dependent Capital Bikeshare 2014 N/A CSV

Bikeshare stations Dependent DC Open Data 2014 Point CSV/shapefile

Population- age Independent U.S. Census/ACS 2013 Block Group CSV

Population- race Independent U.S. Census /ACS 2013 Block Group CSV

DC liquor license locations Independent DC Open Data 2014 Point CSV/shapefile

DC Metro stations (train and bus)

Independent WMATA 2014 Point CSV/shapefile

Parks (DC, National) Independent DC Open Data 2014 Polygon CSV/shapefile

Campuses (college, university) Independent DC Open Data 2014 Polygon CSV/shapefile

Historic landmarks Independent DC Open Data 2014 Polygon CSV/shapefile

Starbucks and McDonald’s locations

Independent Various online sources

2014 Point CSV

Page 9: Capital Bikeshare Presentation

Data Sources and Ingestion

Data Variable Type Source Year Geography FormatTotal trips per bike per year Dependent Capital Bikeshare 2014 N/A CSV

Bikeshare stations Dependent DC Open Data 2014 Point CSV/shapefile

Population- age Independent U.S. Census/ACS 2013 Block Group CSV

Population- race Independent U.S. Census /ACS 2013 Block Group CSV

DC liquor license locations Independent DC Open Data 2014 Point CSV/shapefile

DC Metro stations (train and bus)

Independent WMATA 2014 Point CSV/shapefile

Parks (DC, National) Independent DC Open Data 2014 Polygon CSV/shapefile

Campuses (college, university) Independent DC Open Data 2014 Polygon CSV/shapefile

Historic landmarks Independent DC Open Data 2014 Polygon CSV/shapefile

Starbucks and McDonald’s locations

Independent Various online sources

2014 Point CSV

Data Variable Type Format

Amenities within walking distance from bikeshare stations (sum)

Independent CSV

Distance from bikeshare station to each type of amenity (closest amenity within walking distance of station-- -.5 miles or less)

Independent CSV

Socioeconomic characteristics of the population sharing the same census block group as the bikeshare station

Independent CSV/shapefile

Page 10: Capital Bikeshare Presentation

Capital Bikeshare Data• Data attributes for bike trip: Trip Duration, Start/End Station (address), Start/End Date Time, Bike Number, and Member Type

• Divided by time period: A year of trip data downloaded from the Capital Bikeshare website was divided in four files with each file representing a quarter.

•Separate dataset with coordinates of bikeshare station’s locations was obtained from DC Open Data website

• How to measure popularity? Trips leaving, trips arriving, total trips? Different capacity at each station, so popularity = total trips (arrive + depart)/bike/year

Page 11: Capital Bikeshare Presentation

Census Data• Socioeconomic and demographic data collected for all Census blocks within DC, in the form of CSV files downloaded from American Factfinder

• Census blocks are the smallest geographic area for which sample data is collected (typically 600 to 3,000 residents)

• Challenge 1: how to link stations (discrete points) with block groups (boundaries)?oSolution: ArcGIS was able to identify block groups by lat/long of bikeshare station

• Challenge 2: how to deal with missing data?oFor missing rent (4 instances): impute by calculating average rent/income ratio across cityoFor missing income (2 instances): impute by averaging two adjacent block groupsoFor missing population (10 instances, national park areas): leave blank

Page 12: Capital Bikeshare Presentation

Nearby Amenity Data • Selection of amenities based largely on past studies and common-sense drivers of bikeshare usage: metro and bus stations, college campuses, DC and national parks, entertainment, restaurants, bars (proxied by liquor licenses)

• Two ways to think about importance of amenities as drivers of usage: o How close the single closest location of each type of amenity is (likely most

important for metro stations)o How many locations of each type of amenity there are within a half mile

(likely most important for restaurants, bars)

• One challenge: whenever there wasn’t a single location of an amenity within a half mile (common result for metro stations), ArcGIS identified distance as “0”

Page 13: Capital Bikeshare Presentation

Data Wrangling

The primary challenge in the data wrangling process was to create an architecture that links each individual station with its census block group (and associated socioeconomic characteristics) as well as with distances to surrounding physical amenities.

Page 14: Capital Bikeshare Presentation

Spatial Analysis

Page 15: Capital Bikeshare Presentation

Spatial Analysis

Page 16: Capital Bikeshare Presentation

Spatial Analysis

Union Station 34th and Minnesota Ave SE

Page 17: Capital Bikeshare Presentation

Data Wrangling All four files with quarterly ridership data were loaded into a PostgreSQL table to be merged to the Census/Bike station.csv file to add the stationID, latitude and longitude of the stations.

The resulting file contained all the trips in 2014 with reference information about the station and created a common field uniquely linking the records between the files.

Page 18: Capital Bikeshare Presentation

Data ExplorationCorrelated dependent variables with one another to anticipate collinearity. Shown here are metro station proximity vs. density, and bar proximity vs. single households.

Page 19: Capital Bikeshare Presentation

Data ExplorationA few variables that appear to have little correlation to station popularity: metro station proximity and % of households in Census block that commute by public transit.

Page 20: Capital Bikeshare Presentation

Data ExplorationA few variables that seem to have correlation to station popularity: % of population in Census block that drives to work, and % of population in Census block with a college degree.

Page 21: Capital Bikeshare Presentation

Data Exploration: Correlations with y

DRIVE

N_PARK_N

LIQUOR_N

WHITE

LANDMARK_D

MCDON_D

LANDMARK_N

WALK AGE BA

STARB_D

BUS_N

METRO_N

STARB_N

DENSIT

Y

MCDON_NREN

TSIN

GLE

CAMPUS_D

N_PARK_D

METRO_D

LIQUOR_D

INCOME

CAMPUS_N

RENTE

RBUS_D

TRANSIT

DC_PARK_D POP

DC_PARK_N

-0.600

-0.400

-0.200

0.000

0.200

0.400

0.600

Correlation

Page 22: Capital Bikeshare Presentation

2014 Capital Bikeshare Member Survey

“Compared with all commuters in the region, they were, on average, considerably younger, more likely to be male, Caucasian, and slightly less affluent.”

“Two-thirds (64%) of respondents said that at least one of the Capital Bikeshare trips they made last month either started or ended at a Metrorail station and 21% had used bikeshare six or more times for this purpose. About a quarter (24%) of respondents used Capital Bikeshare to access a bus in the past month.”

Page 23: Capital Bikeshare Presentation

Data Exploration: Excerpt from cross-correlation matrix

Page 24: Capital Bikeshare Presentation

Study Methodology: Feature Selection or better said “Feature Wrangling”

Step 1 : Ran a full Ordinary Least Squares Regression with all thirty independent variables using StatsModels in Python.

◦ R-squared: 0.636◦ Adjusted R-squared: 0.577◦ Eight variables with significant p-values included: DRIVE, WHITE, DENSITY, AGE, WALK, BUS_N, TRANSIT, MCDON_N◦ As expected, very large conditions number, 1.11e+06 indicating strong multicollinearity◦ F-statistic: 10.73 and Prob(F-statistic): 3.42e-25◦ Designated this as Model 2

Step 2 : Beginning with the variable DRIVE, the variable with the highest linear correlation to y, sequentially added the other variables to the OLS regression according to correlation

◦ Any variable that triggered a multicollinearity warning was left out◦ Any variable without a significant p-value was left out◦ Seven variables with significant p-values included: DRIVE, WHITE, LIQUOR_N, BUS_N, MCDON_N, SINGLE, CAMPUS_N◦ Designated this as Model 1◦ Note: Experimented quite a bit with k features module but this adds features sequentially using descending correlations, but does not

take account of multicollinearity

Page 25: Capital Bikeshare Presentation

Definitions of relevant featuresCensus:DRIVE: share of population in Census block that drives to work (Model 1 and Model 2)WALK: share of population in Census block that walks to work (Model 2)TRANSIT: share of population in Census block that takes transit to work (Model 2)WHITE: white share of population in Census block (Model 1 and Model 2)DENSITY: population density in Census block (Model 2)AGE: median age in Census block (Model 2)SINGLE: share of households in block group that are single (i.e., non-family) (Model 1)

Amenities:BUS_N: number of bus stations within half mile (Model 1 and Model 2)MCDON_N: number of McDonald’s within half mile (Model 1 and Model 2)LIQUOR_N: distance to nearest establishment with a liquor license (Model 1 and Model 2)CAMPUS_N: number of college campuses within half a mile (Model 1)

Page 26: Capital Bikeshare Presentation

OLS Output for Model 1

Page 27: Capital Bikeshare Presentation

Model 1 Results: y -Actual versus y-Pred

Page 28: Capital Bikeshare Presentation

OLS Output for Model 2

Page 29: Capital Bikeshare Presentation

Model 2 Results: y -Actual versus y-Pred

Page 30: Capital Bikeshare Presentation

Study Methodology: Machine Learning

Step 1 : Selected the following regression types for Machine Learning on Model 1 and Model 2◦ OLS◦ Ridge◦ RidgeCV◦ Lasso◦ LassoCV◦ Decision Tree◦ Random Forest

Step 2 : Prepared the data◦ Because there were only 201 stations (rows of data) opted against the K-fold cross-validation◦ Used Repeated Random sub-sampling validation with 20% splits for testing and 80% splits for training◦ Iterated for each regression type for n=15 times and averaged the results for the 15 trials

Page 31: Capital Bikeshare Presentation

Study Methodology: Machine LearningModel 1 R-SquaredAverages

Ex. 1R-Squared AveragesEx. 2

R-Squared AveragesEx. 3

OLS 0.466372594903 0.472123692839 0.399698146173

Ridge 0.469910211714 0.46817885095 0.406793762824

Ridge CV 0.466525097058 0.472095157445 0.399928079436

Decision Tree (depth = 2) 0.383743276391 0.395203148425 0.359422535555

Decision Tree (depth = 5) 0.396411487916 0.399304877207 0.396627898939

Lasso 0.213675679287 0.192417630297 0.20648584883

Lasso CV 0.466325652215 0.472022317547 0.399672003362

Random Forest 0.513764682165 0.521753949359 0.510745984545

Page 32: Capital Bikeshare Presentation

Model 1 Random Forest Results: y-Actual versus y-Predict

Page 33: Capital Bikeshare Presentation

Study Methodology: Machine LearningModel 2 R-Squared Averages

Ex. 1R-Squared AveragesEx. 2

R-Squared AveragesEx. 3

OLS 0.520828019361 0.540831229205 0.513422473234

Ridge 0.497919255131 0.516490091569 0.506607344645

Ridge CV 0.516873054267 0.54031441364 0.515242580429

Decision Tree (depth = 2) 0.32349467429 0.311292315645 0.285600505202

Decision Tree (depth = 5) 0.353623405764 0.435172086349 0.350751443572

Lasso 0.191216516526 0.217208742758 0.196249828561

Lasso CV 0.52129767449 0.541115629648 0.506104546918

Random Forest 0.468915321109 0.533377083505 0.42975292961

Page 34: Capital Bikeshare Presentation

Model 2 Results Ridge CV:y -Actual versus y-Pred

Page 35: Capital Bikeshare Presentation

Model 2 Results Lasso CV:y -Actual versus y-Pred

Page 36: Capital Bikeshare Presentation

Data Product Groundwork: By analyzing the correlation between the factors such as bikeshare’s stations location, geographic and

demographic information, we obtained results that allow us to create a data product that predicts the likelihood of

success or failure of a new Bikeshare station prior to implementation in the DC area

Results: Our models succeed in explaining about half of the variance in the Bikeshare Station popularity as measured by

our utilization factor. This should at least help in predicting the potential popularity of a station based on the

combination of demographic and geographic factors we identified as significant.

Further applications: In addition to identifying promising locations for new Bikeshare stations in DC, the results may also

generalize to other cities. By using data on the demographic and geographic factors we identified as significant, it could

allow a user to predict promising locations for bike stations during an initial roll-out ,thus enhancing the overall success

of a new project without costly experimentation

Page 37: Capital Bikeshare Presentation

What worked?GIS was critical to creating effective architecture: linked stations to amenities by distance, and to Census blocks and associated data. Since the data volume that we handled was small, using local machine rather than a powerful data base or cloud environment helped to achieve faster results.

Spending time exploring the variables before beginning analysis. This, plus domain knowledge, allowed team to identify and address data issues manually that the software didn't calculate accurately from the beginning.

Trying many different types of regressions using different variables (including different forms of independent variable – log, natural log).

Page 38: Capital Bikeshare Presentation

What didn’t work? And lessons learned.Data sample was relatively small – started at roughly 350 stations, but shrank to roughly 200 once MD and VA locations were removed from the analysis.

Data and feature wrangling take a long time (80% of the process); domain expertise makes this easier.

Would have been harder to detect and address anomalies and missing values in data had the sample been larger; familiarity with DC allowed us to understand why data missing for national parks or institutional land uses.

Couldn’t do k-fold cross validation given small sample size.

Decision tree model didn’t work for our analysis.

Page 39: Capital Bikeshare Presentation

ConclusionModel offers a good starting point for assessing likely popularity of station locations, using data that is readily available for most major U.S. cities. Some subjective decision-making will still be required around major parks (i.e., National Mall) or institutional land uses (campuses, hospitals).

Bikeshare continues to gain momentum across the country. Future studies should:

• Use a larger sample• Idea: use DC as model, test against NYC and Chicago. Instead: use DC, NYC, and Chicago to build model with larger

sample, test against fourth city.• Categorize stations by function within network• Stations have different functions: residential feeders to metro stations, tourism. With a larger sample, these station

types could be separated and the drivers of popularity independently determined.

Important to note that there are valid reasons other than current popularity that should determine station placement (i.e., equity, driving changes in travel behavior). This model helps ensure financial viability so that these outcomes can be pursued.