Upload
eric-eun-seuk-choi
View
76
Download
0
Embed Size (px)
Citation preview
Eun Seuk Choi (Eric)
Statistical Methods & Data Analytics
Final Project
Professor Alan Huebner
December 10, 2015
<Analysis on NBA Real Plus-Minus for 2014-2015 Regular Seasons>
Table of Contents
1. Introduction
a. Describe data
b. About variables
c. Purpose of analysis
2. Data
a. More details about data
b. The source of data
3. Regression Analysis
a. Exploratory data analysis
i. Scatterplots of each of X variables vs. Y variable
ii. Most highly correlated X variables
b. Linear Regression Analysis
i. Fit a full model and report theR2
ii. Conduct one F-test to test for the removal of a subset of variables
iii. Use all stepAIC()
iv. Find outliers
v. Choose the “final” model
vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot)
vii. Validate the model by cross validation or bootstrapping
4. Results
a. Three inferences about the final model and importance of each inference
i. A confidence interval for a fitted value
ii. A prediction interval for a fitted value
iii. A confidence interval for one or more slope parameters
5. Conclusion
a. How well the model describes Y variable
b. Factors that can improve the predictive power of the model
1. Introduction
a. Describe data
The file “NBA real plus-minus for 2014-2015 regular seasons” contains the data
extracted from ESPN.com, about an individual NBA player’s influence on his team’s wins by
analyzing the number of games he played during the season, the number of minutes he
played on each game, on-court team offensive performance, and on-court team defensive
performance. Data consists of 474 NBA players who played for at least one game for 2014-
2015 regular seasons.
b. About variables
There are 5 variables in total: GP, M, ORPM, DRPM, and WINS. While WINS is the
response variable, all the other 4 variables are predictor variables. GP is the number of
games played for 2014-2015 regular seasons out of 82 games. M is minutes per game for
each player. ORPM is player’s estimated on-court impact on team’s offensive performance,
measured in points scored per 100 offensive possessions, while DRPM is player’s estimated
on-court impact on team’s defensive performance. WINS provides an estimate of the
number of wins each player has contributed to his team’s win total on the season. WINS
includes the player’s Real Plus-Minus and his number of possessions played.
c. Purpose of analysis
By interpreting the result of linear regressions on those 5 variables (WINS for the
response variable and the other four for predictor variables), I want to find out primary
factors that positively affect WINS. I will find the optimal model to predict WINS by
conducting F-test to remove a subset of variables from the model, observe outliers within
data, perform model diagnostics on my final model, and validate it using cross validation.
Based on these interpretations, I will make inferences pertinent to my topic by using
combinations of a confidence interval for a fitted value and a confidence interval for slope
parameters. With an evaluation about my final model, I will finish this project by finding a
way to improve the predictive power of the model.
2. About Data
a. More details about data
The data was extracted from ESPN.com website. Original data includes 6 variables,
which include 5 variable mentioned above plus RPM, but I excluded it since RPM is just
ORPM+DRPM. RPM has a perfect correlation with ORPM+DRPM, so there is no need to
include RPM on my model.
b. The source of data
The source of data is Basketball-Reference.com. It provided play-by-play data to
ESPN and Data Analysts on ESPN assembled play-by-play data to construct ORPM and
DRPM data with their own ways for 2014-2015 regular seasons. According to ESPN, the
ORPM and DRPM model sifts through more than 230,000 possessions each NBA season to
tease apart the real plus-minus effects attributable to each player.
3. Regression Analysis
a. Exploratory data analysis
i. Scatterplots of each of X variables vs. Y variable
RPM1<-read.table("NBARPM.txt",header=T)
attach(RPM1)
0 20 40 60 80
-50
510
1520
GP
WIN
S
0 10 20 30 40
-50
510
1520
M
WIN
S
-4 -2 0 2 4 6 8
-50
510
1520
ORPM
WIN
S1) plot(GP,WINS) 2) plot(M,WINS)
3) plot(ORPM,WINS)
ii. Most highly correlated X variables
cor(cbind(GP,M,ORPM,DRPM))
According to the correlation matrix, GP and M are most highly correlated X variables (With
cor = 0.66)
b. Linear Regression Analysis
i. Fit a full model and report the R2
mod.RPM<-lm(WINS~GP+M+ORPM+DRPM)
summary(mod.RPM)
R^2 = 0.8575, Adjusted R^2 = 0.8563
ii. Conduct one F-test to test for the removal of a subset of variables
Given mod.RPM is a full model, I want to find out if the set of three variables, M,
ORPM, DRPM can be removed in my model by conducting F-test for comparing nested
models.
mod.reduced<-lm(WINS~GP)
summary(mod.reduced)
SSE.r<-sum(mod.reduced$residuals^2)
SSE.c<-sum(mod.RPM$residuals^2)
F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1)))
#F=763.4384
pf(763.4384,3,470,lower.tail=F) # very low p-value
Given very low p-value, we reject the null and cannot remove a group of 3 predictors from
the model.
iii. Use all stepAIC()
library(MASS)
optimal.bp <- stepAIC(mod.RPM)
optimal.bp$anova
-5 0 5 10
-4-2
02
46
mod.RPM$fitted.values
mod
.RP
M$r
esid
uals
Initial Model : WINS~GP+M+ORPM+DRPM
Final Model : WINS~GP+M+ORPM+DRPM
iv. Find outliers
rstandard(mod.RPM)
I found out that two players, Draymond Green (121st value) and Stephen Curry (421st
value), have z-score >3. They are outliers.
v. Choose the “final” model
I chose the intial model (WINS~GP+M+ORPM+DRPM) to be the final model since it
has the highest adjusted R^2 among combinations of other variables.
Adjusted R^2 for WINS~M+GP+ORPM+DRPM = 0.8563
Adjusted R^2 for WINS~M+GP+ORPM = 0.6335
Adjusted R^2 for WINS~M+GP+DRPM = 0.5572
Adjusted R^2 for WINS~GP+ORPM+DRPM = 0.8502, and so on. The initial model
has the highest adjusted R^2. In addition, according to StepAIC function, the initial model is
the optimal model for this data.
vi. Perform model diagnostics (a residuals vs. fitted plot and a normal plot)
plot(mod.RPM$fitted.values,mod.RPM$residuals)
-5 0 5 10
-4-2
02
46
mod.RPM$fitted.values
mod
.RP
M$r
esid
uals
-5 0 5 10
-4-2
02
46
mod.RPM3$fitted.values
mod
.RP
M3$
resi
dual
s
Since the plot does not have a random pattern, I changed the model, reflecting the
result in plot(GP,WINS) and plot(M,WINS). Since those two plots have quadratic pattern, I
tried with GP^2 and M^2 for the new model.
GP1<-GP^2
M1<-M^2
mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM)
summary(mod.RPM3)
plot(mod.RPM3$fitted.values,mod.RPM3$residuals)
However, I got the similar graph as above, meaning that the assumption that
residuals are normal might not hold for my model. Additionally, I tried to obtain a better
plot by trying quadratic, log, exponential transformation on my parameters, but I could not
-3 -2 -1 0 1 2 3
-4-2
02
46
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Histogram of mod.RPM$residuals
mod.RPM$residuals
Freq
uenc
y
-4 -2 0 2 4 6 8
050
100
150
find a better one than the original model. Therefore, I decided to stick with my original
model.
qqnorm(mod.RPM$residuals)
On the other hand, qqnorm(mod.RPM$residuals) has approximately linear
increasing function (nearly straight line), which indicates that residuals might be normal.
hist(mod.RPM$residuals)
In addition, the histogram of residuals has approximately a bell shape, supporting
the claim that residuals are normal.
vii. Validate the model by cross validation
Using cross validation (code is attached on Appendix), rsquared.Group2 = 0.851 and
rsquared.Group1=0.845. Since the mean of those two values = 0.848 is close to the
R^2=0.8575 of the final model, I concluded that this model is valid.
4. Results
a. Three inferences about the final model and importance of each inference
i. A confidence interval for a fitted value
I chose to compute a 95% confidence interval for the mean WINS for all players who
has ORPM=0, DRPM=0, M=20.43, GP=54.29.
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi
dence",level=0.95)
The result demonstrates that the mean WINS for all players having ORPM=0,
DRPM=0, M=20.43, and GP=54.29, falls within [2.705, 2.968] with 95% confidence. I chose
ORPM=0, DRPM=0 since it is greater than mean(ORPM)=-0.646 and mean(DRPM)=-0.278
and it is where each player breaks even (when ORPM=DRPM=0) in his offensive and
defensive contribution to the team. Mean(M)=20.43 and mean(GP)=54.29 were chosen for
fitted values for M and GP so that I can better compare WINS value with ORPM and DRPM
values. I can conclude that the mean WINS for all players with ORPM=0, DRPM=0, and
average for GP and M values, who performs better than the average on both ends of the
floor, falls within [2.705, 2.968].
ii. A prediction interval for a fitted value
I chose to compute a 95% prediction interval for the mean WINS for a “new"player
who has ORPM=0, DRPM=0, M=20.43, GP=54.29.
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="pred
iction",level=0.95)
The result indicates that the mean WINS for a “new” players having ORPM=0,
DRPM=0, M=20.43, and GP=54.29, falls within [0.183,5.490] with 95% confidence. For the
same reason as the confidence interval for a fitted value, I chose ORPM=0, DRPM=0,
M=20.43, and GP=54.29. I can conclude that the mean WINS for a new player with ORPM=0,
DRPM=0, and average for GP and M values, who performs better than the average on both
ends of the floor, falls within [0.183,5.490].
iii. A confidence interval for one or more slope parameters
I chose to compute a 95% confidence interval for the ORPM variable.
Lower = 1.142995-1.96*0.03653 = 1.071396
Upper= 1.142995+1.96*0.03653 = 1.214594
Therefore, I am 95% confident that the value of ORPM falls within [1.071396,
1.214594]. Since this interval does not contain 0, I can conclude that ORPM variable is a
significant predictor of this model. This can also be verified with the low p-value for the
ORPM variable.
5. Conclusion
a. How well the model describes Y variable.
In general, I found that my model satisfactorily describes my response variable
(WINS), as the model has about 0.85 value for R^2 and adjusted R^2. Especially, results are
consistent with my intuition that WINS increases as GP, M, ORPM, and DRPM increase, but
the amount of increase in WINS is the most significantly affected by ORPM and DRPM, as
they have bigger slopes than GP and M.
b. Factors that can improve the predictive power of the model
It would be better if I could find a model having a random pattern on its fitted values
vs. residuals plot. I manipulated some of my predictor variables to fit the better model, but I
could not find the better model than my final model.
Additionally, if I had PER (Player Efficiency Rating) as one of my predictor variables,
the predictive power of the model might have increased, since PER also has a positive
correlation with WINS. If I find that PER variable does not have significant correlation with
my original predictor variables, I would be able to interpret how each player’s performance
affects WINS better with PER added as an additional variable.
<Appendix>
#attach data
RPM1<-read.table("NBARPM.txt",header=T)
attach(RPM1)
#scatterplots of each of predictor variables
plot(GP,WINS)
plot(M,WINS)
plot(ORPM,WINS)
#correlation matrix among predictor variables
cor(cbind(GP,M,ORPM,DRPM))
#Fit full model using all X’s and report R^2
mod.RPM<-lm(WINS~GP+M+ORPM+DRPM)
summary(mod.RPM)
#Use a reduced model to conduct F-test
mod.reduced<-lm(WINS~GP)
summary(mod.reduced)
SSE.r<-sum(mod.reduced$residuals^2)
SSE.c<-sum(mod.RPM$residuals^2)
F<-((SSE.r-SSE.c)/(4-1))/(SSE.c/(474-(4+1)))
F
pf(763.4384,3,470,lower.tail=F)
#pf value=1, which seems wrong here. However, if you turn off R, reopen, and paste the
code pf(763.4384,3,470,lower.tail=F), you will get 3.28*e^-180, the true value. Thank you!
#Use StepAIC to find final (optimal) model
library(MASS)
optimal.bp <- stepAIC(mod.RPM)
optimal.bp$anova # display results
#Find out outliers
rstandard(mod.RPM)
#model diagnostics (fitted values vs residuals and normal plot)
plot(mod.RPM$fitted.values,mod.RPM$residuals)
qqnorm(mod.RPM$residuals)
hist(mod.RPM$residuals)
#model diagnostics with changed model (fitted values vs residuals and normal plot)
GP1<-GP^2
M1<-M^2
mod.RPM3<-lm(WINS~(GP1)+(M1)+ORPM+DRPM)
summary(mod.RPM3)
plot(mod.RPM3$fitted.values,mod.RPM3$residuals)
qqnorm(mod.RPM3$residuals)
#validate the final model by using cross validation
set.seed(5)
#obtain total sample sizen<-dim(RPM1)[1]Group1.index<-sample(1:n,round(n/2),replace=F)Group2.index<-setdiff(1:n,Group1.index)Group1<-RPM1[Group1.index,]Group2<-RPM1[Group2.index,]#Fit a linear model on Group1 and a separate one on Group2mod.Group1<-lm(WINS~GP+M+ORPM+DRPM,data=Group1)mod.Group2<-lm(WINS~GP+M+ORPM+DRPM,data=Group2)
###Compute fitted values on Group2 using model fit on Group1fitted.Group2<-NULLfor (i in 1:dim(Group2)[1]){
fitted.Group2<-c(fitted.Group2,(mod.Group1$coef[1]+mod.Group1$coef[2]*Group2$GP[i] +mod.Group1$coef[3]*Group2$M[i]
+mod.Group1$coef[4]*Group2$ORPM[i]
+mod.Group1$coef[5]*Group2$DRPM[i]
))}
##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula 1 - (SSE/SSTo)rsquared.Group2 <- 1 - sum((Group2$WINS-fitted.Group2)^2)/sum((Group2$WINS-mean(Group2$WINS))^2)rsquared.Group2
###Compute fitted values on Group1 using model fit on Group2fitted.Group1<-NULLfor (i in 1:dim(Group1)[1]){
fitted.Group1<-c(fitted.Group1,(mod.Group2$coef[1]+mod.Group2$coef[2]*Group1$GP[i] +mod.Group2$coef[3]*Group1$M[i]
+mod.Group2$coef[4]*Group1$ORPM[i]
+mod.Group2$coef[5]*Group1$DRPM[i]
))}
##Now, compute R^2 comparing these Group2 fitted values to the Group2 y's. Use formula 1 - (SSE/SSTo)rsquared.Group1 <- 1 - sum((Group1$WINS-fitted.Group1)^2)/sum((Group1$WINS-mean(Group1$WINS))^2)rsquared.Group1
###Compute mean of both R^2mean(c(rsquared.Group2,rsquared.Group1))
#A confidence interval for a fitted value
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="confi
dence",level=0.95)
#A prediction interval for a fitted value
predict(mod.RPM,data.frame(ORPM=0,DRPM=0,M=20.43,GP=54.29),interval="prediction",
level=0.95)
# A confidence interval for one or more slope parameters is calculated manually by looking
at summary(mod.RPM)