Analysis on Bike Rental Data to Predict Future Use

MA 575

Analysis on Bike Rental Data to Predict Future Use

By: Miles Avila, Kevin Choi, JungTak Joo, Kimberly Nguyen, Tianyuan Zhou

12/9/2014

Casual Model Building: J.J., T.Z. Registered Model Building: K.C., J.J. K.N. Introduction & Background: M.A. Modeling and Analysis: K.N. Prediction & Discussion: T.Z. Proofread & formatting: M.A., K.C., K.N

Analysis on Bike Rental Data to Predict Future Use Abstract The goal of this analysis is to predict the number of bike users on any given day in a year

using linear model techniques. Due to the increasing popularity of bike sharing and the amount of

available data, predictive models and analysis are seemingly more important to better understand

bike users and programs. Our analysis begins with exploratory data analysis techniques including

scatterplots of the original data. The exploratory analysis provided preliminary insight about our

dataset, which helped us create our early models. We proceeded to improve our models using

variable selection, transformation, comparison, and testing for non-constant variance. Our final

predictive model is divided into two separate models: casual and registered bike users. The final

casual model includes bias from the bike user population, due mostly to increases in bike users in

2012, and the registered model, after using the mean shift, shows unbiasedness and large variance.

Our predictive models suggests that the worst predictions for both models occurred around holidays

and during extreme weather conditions. Introduction

Bike sharing is an innovative transportation program, ideal for short distance point-to-point

trips providing users the ability to pick up a bicycle at any self-serve bike-station and return it to any

other bike-station located within the system's service area. These systems have become popular in

major metropolitan areas around the world. Currently, there are over 500 bike-sharing programs

worldwide, which is composed of over 500 thousand bicycles. Today, there exists great interest in

these systems due to their important role in traffic, environmental, and health issues. The way in

which these bikes may be rented is automated, which, when coupled with other sensor data such as

temperature and weather characteristics, facilitates the process of predicting use of the bikes in the

future. From the perspective of the companies that own these systems, it is of interest to create

accurate models in order to predict bike use on any given day. In contrast to other methods of

transportation, such as bus or subway, the duration of travel, departure and arrival positions are

explicitly recorded in these systems. This is a unique feature that lets the bike sharing system act as a

virtual sensor network that can be utilized as a tool for sensing mobility in the city. It may be

possible, even, to detect which events are most important in a city by monitoring these data. Background

In this study, we are creating a model that predicts the number of bike-sharing users on any

given day in a particular year to the same day in a different year. In general, predictions are difficult

because there are many variables that are unaccounted for in our dataset. These include, but are not

limited to, business affairs among bike-sharing companies, an increase in popularity among the

services (i.e. has bike-sharing become a societal trend), and especially cost fluctuations of the

services. The data here are a mix of numerical and categorical variables. These include the count of

users on any given day, split by casual and registered users, along with the state of the weather

(measured by temperature, actual temperature (feeling temperature), humidity, wind speed, and

weather sit), and finally in conjunction with categorical variables describing what kind of day it was

(weekday, holiday, season, and month). The data set is collected from the years 2011 and 2012 in

Washington, D.C.

The core data set is related to the two-year historical log corresponding to years 2011 and

2012 from Capital Bikeshare system, Washington D.C., USA which is publicly available in

http://capitalbikeshare.com/system-data. UCI Machine Learning Repository aggregated the data into

two hourly and daily basis datasets, and added the corresponding weather and seasonal information.

Weather information are extracted from http://www.freemeteo.com. The essential goal of this study was to create a linear model that predicts the amount of bike

users on a given day with constant variance and minimal residual values. Modeling & Analysis

The first step we took in this process was to examine a scatterplot matrix in order to

understand the correlations among the variables (A1).

…………………………………………………………………………………………………………

…. From here, we created an initial model with Count (cnt) as the predictor and we included all

the variables in the dataset as the regressors (A2). To assess our model, we first tested whether or not

our model violates the assumption of constant variance (A3). At a significance level of .05, we can

barely conclude that this model has constant variance. The non-constant variance test shows our p-

value is 0.05701636. Nonetheless, from the residual plot we can conclude that this model is linear

(A4).

http://capitalbikeshare.com/system-data

http://www.freemeteo.com/

Next, we chose to transform the response variable with a logarithmic transformation, by

convention (A5). We tested once more for constant variance, and contrary to our expectations, this

model was far from having constant variance (A6). We also found that this model is not linear in

nature, based on the residual plot (A7).

Understanding that neither of these are the best model, we chose to utilize the AIC tool to

determine which variables should be included in order to obtain the best model. We conducted AIC

in the backward directions (A8). Running a linear model on this data we obtain the following model

(A9): cnt=1975.08+424.48*season2+850.09*season3+1151.59*season4+185.36*month2+354.96*

month3+897.26*month4+1637.06*month5+1337.41*month6+573.99*month7+699.64*month8+112

5.10*month9+960.06*month10+552.16*month11+495.29*month12-386.34*holiday+3084.11*temp-

1330.70*humidity-2015.69*windspeed-280.06*weathersit2-1596.43*weathersit3

We also test for constant variance, and the p-value is large enough to fail to reject the null hypothesis

at .05 (A10). Having met the assumptions of constant variance and normality, we decided to use the

preceding model to predict the 2012 bicycle data.

We found that on average, our predictions were lower than the actual value of the cnt of users

in 2012 on any given day (A11).

In an attempt to explain this result, we hypothesized that this may be due to the different

behaviors that casual and registered users display towards the bike sharing service, given the

different factors. For example, on an extremely cold day, a casual user may decide to take their car

rather than use the bike share service, where a registered user may decide to use the bike sharing

service despite the bad weather, because they have already paid for their account. Also, we thought

advertisement would have different impact on casual and registered users. This led us to the decision

of creating separate models for casual and registered users in an attempt to obtain smaller residuals

when predicting 2012 data.

We started by creating a model for just casual users. Having run a backward selection on all

our variables, we obtained the following model from our backward selection (A12):

casual= 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+

1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984*

month9+960.0629*month10+552.1595*month11+495.2866*month12-

280.0560*weathersit2 -

1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 -

1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp

However, the backward selection model violates the assumptions of constant variance (A13)

and linearity (A14). In order to fix these violations and improve the linearity of the model, we ran a

Box-Cox method and chose to transform the response variable to the power of .4 (A15). The chosen

power transformation makes sense because the inverse response plot showed a slight square root

relation between number of casual users and the chosen regressors.

The model for casual users after the power transformation is (A16):

casual0.4 = 1975.0791+ 185.3567*month2 +354.9600*month3+897.2602*month4+

1637.0600*month5+1337.4082*month6+573.9875*month7+699.6399*month8+1125.0984*

month9+960.0629*month10+552.1595*month11+495.2866*month12-

280.0560*weathersit2 -

1596.4329*weathersit3+424.4753*season2+850.0882*season3+1151.5869*season4 -

1330.7019*humidity-2015.6888 *windspeed-386.3378*holiday +3084.1052*temp

Furthermore, we checked for linearity (A17) and non-constant variance (A18) for the above model.

Our tests yielded the following results:

In comparison to the original backward selected model for causal users (A14), our model

with the Box-Cox method shows more linearity. In addition, the p-value from the non-constant

variance test in the transformed model, in comparison to the original backward selected model,

shows more constant variance. The p-value went from 3.42573E-05 (A13) in the original model to

0.006777438 in the transformed model (A18). Clearly, the transformed model using the Box-Cox

method is better for casual users.

We tried to further improve our variance for the casual model by removing outliers. Utilizing

the outlier test, we removed two potential outliers. We re-ran the transformed backward selected

model but it did not improve the constancy of our variance. Therefore, we reverted back to the

transformed causal model above (A16) to predict the 2012 bicycle dataset.

The mean of the residuals of the actual number of casual users is approximately

300 However, the mean of the residuals of our 2012 data using the transformed model is

approximately 1.92 (A19). Although the prediction results are not ideal, we decided we’ll leave the

model for now and go on to the registered users and see if we’ll get better behavior from that group

and then possibly (figure out) why our predictions have large residuals. In addition to the causal model, we also created a model for registered users. Initially, we put all the

variables into a backward selection algorithm in order to decide which variables are most significant

(A20). Running a linear model on the significant variables yields the following model (A21): registered =

1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1

29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6

11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12-

853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee

kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3

+1952.3536*atemp-906.3608*hum-961.8719*windspeed

The test of nonconstant variance yielded a p-value of 0.7760796, leading us to conclude that our

model has constant variance(A22). In addition, the model fulfills the linear assumption (A23):

I

The mean of the residuals from this model is 1765 (A24). Like the casual model, the registered

model is also underestimating. Before considering any transformations to fix the underestimations in

our models, we decided to take a second look at our data to figure out if there was another cause. We

noticed that the numbers of both registered and casual users in 2012 seem to be much larger than

those numbers in 2011, so we calculated average numbers of registered and casual users in both years.

We found that on average, there is a mean increase of 342 casual users and a mean increase of 1859

registered users in 2012 from 2011 (A25). At the same time, temperatures, humidity, and weather

situations overall didn’t change significantly (month, week of days, and holidays don’t change either,

obviously). Therefore, we have strong evidence to believe that these increases are not due to any of

the variables that are available to us in the dataset, but due to other factors that we do not have

information about such as increasing popularity of the system or advertisement. In order to capture

these increases, we applied a mean shift to the model for the registered users. In other words, our

model for the registered users have now become (A24): registered =

1071.5075+380.9067*season2+736.0770*season3+1135.6424*season4+131.9103*month2 +1

29.0006*month3+534.9357*month4+1220.9336*month5+1121.6707*month6+404.6830*month7+6

11.2622*month8+891.2946*month9+563.7859*month10+342.5887*month11+457.4749*month12-

853.9828*holiday+714.6518*weekday1+816.7486*weekday2+803.8267*weekday3+771.9176*wee

kday4+ 728.6853*weekday5+119.0951*weekday6 -207.3430*weathersit2-1335.0019*weathersit3

+1952.3536*atemp-906.3608*hum-961.8719*windspeed + 1764.549*year In the model above, we added a “year” variable, and we obtained the coefficient of this variable from

the mean residuals of our predicted values of 2012 data. However, we decided against applying a

similar mean shift to the casual data because our casual users model has a transformed response

variable. The transformed response variable affects the mean shift and hinders its predictability and

interpretability. Prediction

From our constructed model using 2011 data, we were able to explain a fair amount of

variability in both registered and casual users of capital bikeshare system in 2012. (R2 of around .66

in both cases) . The casual user model has a bias due to the underestimated amount of users in 2012.

The underestimated amount of users could account for many different factors including bicycle

trends and advertisement, but these factors are not included in our dataset. However, the variance of

the casual user model is rather small, with a MSE of 8.78 . Our registered user model is unbiased

after the mean shift where the mean residual is basically zero. However, due to large amount of

registered users, (and thus large fluctuations of data) our estimation of registered users in 2012 have

large variance, with an MSE of 754653. Overall, the worst predictions for both models occurred

around holidays where there were either a lot of people or very little people using bikes, and in

extreme weather conditions (such as when hurricane Sandy hit in October 2012) where very few, if

any users were using the bike system. Nonetheless, our model predicted well (A26). Discussion

One should note that the mean shifts we applied to our registered model is a special case to

this project. In this project, we had the luxury of observing the 2012 data and knowing about this

average increase and therefore able to make the proper adjustment for our model. However, in most

real life situation, we would be using the data we have to create a model that predicts future

outcomes, in these situations we would not know the future value of response variables ahead of time.

Therefore, we need to be especially careful when we build these models. We need to gather as much

information as possible to maximize our chance to capture all the predicting variables. Furthermore,

for the dataset that are likely to see an increase in values (both predictor and response) we should

monitor the data closely and update it frequently and quickly after we’ve received new information

regarding the data. Finally, for data that shows a strong and clear trend or pattern related to time,

other statistical technique such as time series modeling would be more appropriate to use and results

in better prediction of the data.

Appendix 1:pairs(~cnt+season+mnth+holiday+weekday+workingday+weathersit+temp+atemp+hum+windspeed) 2:lm1<-lm(cnt~factor(season)+factor(mnth)+holiday+factor(weekday)+workingday+factor(weathersit)+temp+atemp+hum+windspeed) 3:ncvTest(lm1) 4:plot(TestingSet$cnt, resid(lm1)) 5:logcnt<-log(cnt) lm2<-lm(logcnt~factor(season)+factor(mnth)+holiday+factor(weekday)+workingday+factor(weathersit)+temp+atemp+hum+windspeed) summary(lm2) 6: ncvTest(lm2) 7:plot(TestingSet$cnt, resid(lm2)) 8: starting.model <- lm(cnt ~ 1, data=TestingSet) step(starting.model, scope = ~factor(season) + factor(mnth) + holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum + windspeed, direction = "forward") backward.model <- step(lm1, scope = ~1, direction = "backward") 9: summary(backward.model) 10: ncvTest(backward.model) 11: fit1 <- predict(backward.model, TestingSet) residuals1 <- TestingSet$cnt-fit1 plot(residuals1 ~ TestingSet$instant) mean(residuals1) 12: starting.casual1 <- lm(casual ~ factor(season) + factor(mnth) + holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum + windspeed) step(starting.casual1, scope = ~ 1, direction ="backward") backwardCasual <- lm(casual ~ factor(mnth) + holiday + factor(weekday) + factor(weathersit) + temp + hum + windspeed) 13:ncvTest(backwardCasual) 14: plot(backwardCasual) 15: invResPlot(backwardCasual) 16: backwardCasual3 <- lm((casual)^0.4 ~ factor(mnth) + holiday + factor(weekday) + factor(weathersit) + temp + hum + windspeed) 17: plot(backwardCasual3) 18: ncvTest(backwardCasual3) 19: fitTCasual <- predict(backwardCasual3, TestingSet) residualTCasual <- (TestingSet$casual)^0.4 - fitTCasual) mean(residualTCasual) fitCasual <- (fitTCasual)^(5/2) residualCasual <- (TestingSet$casual - residualCasual) mean(residualCasual) 20:starting.registered1 <- lm(registered ~ factor(season) + factor(mnth) + holiday + factor(weekday) + workingday + factor(weathersit) + temp + atemp + hum + windspeed) step(starting.registered1, scope = ~ 1, direction ="backward")

21:backwardRegistered<-lm(formula = registered ~ factor(season) + factor(mnth) + holiday + factor(weekday) + factor(weathersit) + atemp + hum + windspeed) summary(backwardRegistered) 22: ncvTest(backwardRegistered) 23: plot(backwardRegistered) 24: fitregistered <- predict(backwardRegistered, TestingSet) residualRegistered <- (TestingSet$registered - fitregistered) mean(residualRegistered) 25: mean(TestingSet$Casual)-mean(TrainingSet$Casual) mean(TestingSet$Registered)-mean(TrainingSet$Registered) 26:RSSCasual <- sum(((TestingSet$casual)^0.4 - fitTCasual)^2) MSECasual <- SSECasual / 341 SYYCasual <- sum(((TestingSet$casual)^0.4 - mean(TestingSet$casual)^0.4)^2) SSRegCasual <- SYYCasual - RSSCasual R2Casual <- SSRegCasual/SYYCasual RSSRegistered <- sum((TestingSet$registered - fitregstered)^2) MSERegistered <- RSSRegistered / 337 SYYRegistered <- sum((TestingSet$registered -mean(TestingSet$registered )^2) SSRegRegisteredl <- SYYRegistered - RSSRegisteredl R2Registeredl <- SSRegRegistered/SYYRegistered

Documents

Analysis on Bike Rental Data to Predict Future Use