NBA Total Rebound Prediction

NBA Total Rebound Prediction

December 4, 2015

Carter, JonEspinola, DavidHaddock, Sierra

Table Of Contents

Introduction -- 3

Descriptive Statistics -- 3

Variable Pre-Processing -- 10

Residual Analysis -- 13

Model Exploration -- 21

Model Description -- 31

Conclusion and Additional Exploration -- 39

Appendix -- 43

Introduction:We are all Golden State Warriors fans, so we decided to analyze NBA data for our group project. Our data was taken from pre-draft measurements of NBA players who have been drafted to the league within the past five years. Our data collection began with the 2010 pre-draft measurements and athletic testing measurements, and ended with the 2014 pre-draft measurements and athletic testing measurements. We did not use the 2015 pre-draft measurements because this season is in progress, and those measurements could skew our data because the amount of minutes those players have played create insufficient predictions. The goal of our data analysis is to predict NBA player rebounds per 36 minutes from African heritage (categorical), height (inches), weight (pounds), wingspan (inches), standing reach (inches), body fat (percent), no step vertical (inches), no step jump height (inches), maximum vertical (inches), maximum jump height (inches), lane agility drill (seconds), 3/4 court sprint (seconds), hand length (inches), hand width (inches), players who play point guard (binary), players who play shooting guard (binary), players who play small forward (binary), players who

play power forward (binary), players who play center (binary). For clarity, throughout this report whenever we refer to “total rebounds” or “rebounds,” we are referring to total rebounds per 36 minutes of playing time, which is the average playing time for any given player in the NBA. We believe that height and wingspan will have a high positive correlation with rebounds due to the fact that taller players are more likely to reach the ball. We believe that weight will have a positive correlation with rebounds because heavier players have the ability to push other players out of the way. We believe that African heritage may not be a necessary factor in our predictions, but are interested to see if running different models will prove otherwise. We believe that our binary measurements of position will have a positive correlation with rebounds (depending on the position) because players who are placed closer to the basket can get to the ball quicker than players positioned near the 3-point line. We are interested to see which predictor variables are necessary to our model, which may or may not interact, and which could possibly be autocorrelated.

Descriptive Statistics: Below is a scatterplot matrix of all of our variables, as well as singular scatterplots of the different variables. Looking at the data the ones that stand out right now seem to be height, weight, wingspan, standing vertical height, hand size, and whether or not the player can play the position of center. These variables stand out due to our previous knowledge of the NBA, larger R-squared values compared to the other variables, and less variance around the prediction line compared to the other variables. The variable of age seemed to have next to no predictive ability since our data is from pre-draft measurements and athletic testing, so all players were around the same age when they were tested. We go more in depth about our findings in the singular scatterplots below.

Looking at the graph between Height and Rebounds per 36 minutes we can see that there is a fairly strong correlation between the two. The R-squared value tells us that 55.77% of the variation in total rebounds per 36 minutes can be explained by the variable height. The points also appear a little curved, which suggests that we should try transforming the variable.

Looking at weight and rebounds, we can see that there is a fairly strong linear correlation. The R-squared tells us that 62.79% of the variation in rebounds is explained by the variable of weight. Looking at this plot, there may be some outliers that could be influencing the line.

Wingspan and rebounds also have a fairly strong correlation. The r-squared value tells us that 66.82% of variation in rebounds is explained by the variable of wingspan. There may be an outlier influencing the output at the bottom of the plot.

Standing reach regressed on rebounds shows that there is a linear relationship between the two with a tail towards the lower and upper end of standing reach almost indicating a quadratic relationship. We plan to further explore this curvature. The r-squared value tells us that 62.97% of the variation in rebounds is explained by the variable standing reach.

When looking at the scatterplot, there appears to be a weak correlation between body fat and rebounds. This can also be exemplified by the small r-squared value of 3.6%. There appears to be a major outlier near the end of the x-axis.

No step vertical compared to rebounds looks very weakly correlated, if at all correlated. This is shown in the r-squared telling us that a mere 7.1% of the variation in rebounds is explained by the no step vertical. This is interesting because we originally anticipated that there would be a high positive correlation since we assumed that the higher you can jump, the more likely you will be to rebound.

No step jump appears to have a relatively strong correlation with rebounds. The r-squared value tells us that 51.03% of the variation in rebounds is explained by the variable of no step jump. The no step jump height measurement takes into account a player’s vertical, wingspan, and height - hence why it is correlated - while no step vertical just includes the vertical measurement and appears less correlated.

The max vertical scatter plot shows a slight negative linear association. The r-squared value tells us that of 21.61% of the variation in rebounds is explained by max vertical, which is not a very large value. There may be some outliers and large residuals that we should investigate.

The max jump height appears to have a positive linear association with rebounds when looking at the scatterplot. There appears to be one major outlier influencing the data near the end of the x-axis. The r-squared value tells us 33.00% of the variation in rebounds is explained by max jump height. We were surprised by this small r-squared value, but we think it will change once we address this major outlier.

The lane agility drill scatterplot looks incredibly scattered and only slightly positively correlated, if at all. The r-squared value tells us that 16.97% of the variation in rebounds is explained by the lane agility drill variable. We believe that this r-squared value is so small because speed is not always necessary when grabbing a rebound even though it could sometimes help.

The ¾ court sprint also does not appear to have a strong linear correlation with rebounds. The R-squared value tells us that only 13.38% of the variation in rebounds is explained by ¾ court sprint. This makes sense because we believe that speed is not necessarily indicative of rebounding ability.

Hand length and rebounds are a little harder to analyze since many players share the same hand length. There does not appear to be much of a positive linear association with this variable. The points are scattered and there is not much of a trend. The r-squared value tells us that only 35.66% of the variation in rebounds is explained by hand length.

Hand width and rebounds are also very weakly correlated. This can be seen in the R-squared value telling us that 20.02% of the variation in rebounds can be explained by hand width. The data is scattered even though there is a slight positive trend.

Looking at the scatterplot of NBA player race and rebounds, we can see that there is a wide range between the differing races. This graph is a little more difficult to analyze, but the small R-squared value tells us that only a mere 2.21% of the variation in rebounds is explained by race, which leads us to believe that NBA player race is not as critical as other variables when predicting rebounds.

From this plot of point guards and rebounds, we see that the point guard column is not spread and they do not get as many rebounds compared to other positions. The R-squared value is also small and explains only 23.42% of the variation in rebounds from this position.

This is a plot of shooting guards and rebounds, and gives us a similar results as point guards. The r-squared value tells us that 27.42% of the variation in rebounds in explained by this position.

By looking at this graph we can also see that the small forward variable is also not highly correlated with rebounds. In fact, the miniscule R-squared value tells us that only 2.46% of the variation in rebounds is explained by this position variable.

The power forward position also has a very low R-squared value, which tells us that only 28.72% of the variation in rebounds is explained by this position variable. However, the data is spread out between power forwards, and there are some players with high rebounds and some with low rebounds.

The center position has the highest R-squared value between all the positions, which tells us that 47.18% of the variation in rebounds is explained by this position variable. This is likely because the center position player tends to have large values for many of our highest correlated variables such as wingspan and height.

Variable Pre-Processing: There are 19 different variables that we gathered on our search for the perfect model. Some variables were more promising (height, wingspan, etc.), than others (hand width, lane agility drill, etc.). Using a best subset model, we were able to minimize BIC to approximately 360 for the model. In doing this, we eliminated 9 variables. Two of these variables were categorical, African Heritage and players who can play PG (point guard), and the other seven variables were

standing reach, no step vertical, no step jump height, lane agility, ¾ court sprint, hand length, and hand width. Some of these dropped variables were not a surprise to us since we had addressed concerns about some of them while examining their scatterplots; however, we expected standing reach, no step vertical, no step jump height, and ¾ court sprint to be correlated with rebounds. What we failed to realize was that these values would also be correlated with the other explanatory variables. Through initial exploration we found that many of the variables had high VIFs, which signals that there is multicollinearity present. There were 6 variables with VIFs higher than 4, and this is excluding the zeroed variables. This is why the best subset dropped four promising variables. For example standing reach is highly correlated with wingspan; however, wingspan was initially more promising, with an R-squared 66% versus 62%. The same effect can be seen by the other three variables that were removed (no step vertical, no step jump height, and ¾ court sprint).

Above-Parameter estimates for the 19 variables

Above-Best subsets for the 19 variables with response variable(Rebounds per 36 minutes of play)

First Model:

Above- Initial model 10 variables This is our initially proposed model which incorporates 10 different variables. As you can see, the R-squared is quite large and shows that 86.45% of the variation in rebounds is accounted for by this model. The RMSE or s value is 1.199 which means that the root average of the SD for the residuals is around 1.199. This is found by taking the root of 1.437 (the MSE).The BIC for this model is 360.372. In addition, we also looked at the parameter estimates. It was fairly surprising to find that height had a negative coefficient of -0.247. That coefficient means that for every increase in inch there is a 0.247 decrease in rebounds. We can also say from the confidence interval that we are 95% confident that for every 1 inch increase in height, holding all other variables constant, there should be between a 0.44 and 0.05 decrease in the expected number of rebounds. That being said, there could be some interactions amongst the variables. Also, many of the variables (height, weight, wingspan, etc.) have high VIF values, so we will need to continue to improve this model.

Residual Analysis:Our first goal was to check the residual normality of our initial model. We looked at four different outputs to determine this: a histogram of the residuals, a box-plot of the residuals, a normal quantile plot of the residuals, and a Shapiro-Wilk W Test. We can see by both the histogram and the normal quantile plot that normality may not be met. The histogram appears to be left skewed and the normal quantile plot appears to not follow the line all too closely. In looking at the box plot we confirmed the skewing in the normality of the data, and we also discovered an outlier. The outlier, Hassan Whiteside, now starting center for the Miami Heat, was located in row 7. We removed this data point. Finally, we looked at the Shapiro-Wilk W Test which had a P-value of 0.0419. With a Ho: this is a normal distribution and an Ha: this is not a normal distribution, we rejected the Ho, there is evidence of a normal distribution.

Above-a histogram of the residuals, a box-plot of the residuals, and a normal quantile plot of the residuals

Above-Shapiro-Wilk W TestSo, based on the above data, we confirmed the strong possibility of a lack of normality in the residuals. The best way to deal with this problem is to transform our response/y variable, rebounds. We used a logarithmic transformation on rebounds. The data for the new model is as follows.

Second Model

Above- Second model log transformation and removal of row 7The logarithmic transformation of rebounds and dropping the outlier created this new model that has an R-squared value of 84.6%. Although this R-squared value is lower than the original model, it is not comparable because dropping the outlier changed our response variable. The new RMSE is 0.189, which is expected since the log of rebounds will be smaller than just rebounds itself. The BIC is -5.44. Also, a few of the coefficients and their corresponding p-values have changed. Players who can play SF (small forward) no longer looks relevant to our model since it has a slope of 0.0065 and a p-value of 0.790. In addition, max vertical and height seem to have lost relevance in this new model. They both have coefficients closer to 0 and larger p-values that our original model. In addition, the VIF values tell us that there is still a lot of multicollinearity present.

Above- a histogram of the residuals, a box-plot of the residuals, a normal quantile plot of the residuals and a Shapiro-Wilk W Test

In this new Shapiro-Wilk W Test a p-value of 0.6673 appears to have solved our problem of normality of residuals.

The next thing we need to look at is autocorrelation and independence. To do this, we examined a time series of the residuals for the previous model. By doing this we were able to test for autocorrelation. The first order lag is -0.0701, which is extremely low. We also looked at the residuals versus time, or, in this case, row, to find any sort of pattern. It appears that the residuals tend to slope down over time, but just barely. This may be the reason that the Durbin-Watson statistic, the second thing we looked at, is 2.084. This statistic is extremely close to 2, the number that signifies no autocorrelation. The p-value for positive autocorrelation is 0.66, so we fail to reject our Ho that there is no autocorrelation. From the same Durbin-Watson statistic we can confirm that residuals are independent and they do not depend on any previous residuals.

Above- Durbin Watson statistic and Time Series for Residuals

Above-Residuals versus fits When testing for linearity of the residuals, we can also look at the same graph of residuals versus predicted; however, when looking for linearity, we are not looking for fanning, instead we are looking for a lack of pattern and a lack of curvature. Fortunately, this graph does not have a pattern and does not exhibit curvature in the data. Unfortunately, in our analysis, we could not get lack of fit because we had no replication of data points. Had we used a probability tree to run the model then we would get a lack of fit, but we addressed this later in our analysis.

This data tells us that there is no pattern. The R-squared value of 0 tells us that there is no pattern between the residuals and the predicted equation. The parameter estimates tell us that there is a slope that is approximately zero. So, despite the fact that there is no lack of fit test through running this additional model, we were fairly certain that the linearity condition was met for this second model.

The final condition that we checked was unequal variance of the residuals. Although our initial analysis led us to believe that there was no fanning in the data, thorough exploration proved otherwise. By taking the distributions of the two separate halves we were able to test whether the standard deviations were equal. They were not. The side with the lower x values had smaller standard deviations for the residuals (0.15 on average), and the higher x values had higher residuals (0.20 on average).

Above- One way analysis for unequal variance.

Above- Distribution numbers and F-tests for significance.

By looking at the O’Brien, Brown-Forsythe, Levene, Bartlett, and 2-sided F-test statistics, we can see that all the P-values are fairly small. They aren't extremely small, hovering around the .10 p-value mark, but this is significant enough for us to have strong doubts about the data. This requires us to transform the y variable again. We discussed our options, and decided that we might have over transformed the y variable when trying to fix normality so we went with the new transformation of Y^½.

Above-This is the new one way analysis of the residuals

Above-These are the new distributions of the the residuals as well as O’Brien, Brown-Forsythe, Levene, Bartlett, and 2-sided F-test

This new analysis shows an improvement in the data, exemplified in the extremely high p-values (0.60-0.90) and the standard deviations (0.2128 and 0.2098). This solves the problem of unequal variance of the residuals; however, we need to check the other conditions again.

Above- Normality is met. High P-value

Above-Linearity is met. No pattern.

Above-Independence is met.

Third Model

Above- Data from the third model with a y transformation of y^½

Above- Parameter estimates from the third model

This third model meets all the conditions. It has an uncomparable new R-squared value of 86.43%, a RMSE of 0.221892, a BIC of 26.33. That being said, the parameter estimates are still troubled. A few have significantly high p-values and there are still 5 variables with VIF values over 4. We decided to do another best subset to improve this third model.

Model Exploration: The first thing we did was create a best subset to eliminate possible noise in the data by minimizing the BIC value. We got the same result from restricting the p-value cutoff to 0.05. This best subset eliminated the height variable, which might be related to the wingspan variable. It eliminated max vertical, a measure of how high the player can get his feet off the ground. We assumed max vertical was correlated with body fat, which is a strong measure of athleticism. Max jump height, a measure of how high a player can reach when jumping, is basically just a combination of body fat and wingspan (how high a player can reach), so we dropped that variable from the model. The final variable that we dropped was the position variable of SF (small forward). We had anticipated dropping this variable from the model because SF is the middle position in basketball and players from that position are not significantly different than the average.

Above-Best subset

Fourth Model

Above-Fourth model data removed 4 variables

Above- Parameter estimates

This model is a slight improvement considering that the BIC went from 26 to 20. We got rid of all the multicollinearity except for weight, which is barely above 4. All the p-values are extremely significant in this model unlike the last few models that we have tried. The downside of this

model, however, is a slightly higher RMSE (0.23 versus 0.22) and an R-squared of 84.67% instead of 86% (not a large change).

We decided that we wanted to continue to improve our model, so we ran another best subset with all possible interactions and quadratic terms. By minimizing BIC, we found that the best model, in addition to the initial variables, included (weight)^2, wingspan*players who play(SG), and players who can play PF*players who can play C.

Above- Best subset to minimize BIC

Fifth Model-potential

Above- Data for model potential model 5

Above- Parameter estimates for potential model 5

This model improved in a few ways compared to model 4. For example, the R-squared increased from 84.6% to 85.7%. This means that the regression line accounts for 85.7% of the variability in rebounds. The RMSE is also lower (0.226 compared to 0.231). However, that does not mean that the model improved. For example, the BIC went up from 20 to 26. The VIFs went up, pushing three variables over the 4 threshold, and two variables do not appear significant with large p-values of 0.67 and 0.90 and extremely small coefficients. This is not an improvement to the data; however, we were curious about how the interaction between wingspan and players who can play SG looked considering it was the best interaction the statistics package could produce.

Above- Surface plot with graphed residuals.

This 3 dimensional graph shows the expected amount of rebounds giving wingspan and the binary variable players who can play SG. The residuals appear on the right hand side of the surface planes. Players who can play SG appear on the left plane, and players who can not play this position appear on the right plane. This shows a slight interaction between wingspan and SG. The slope of the left plane(SG players) is marginally steeper than the slope of the right plane.

We decided to run our best subset in a different way. This time, we looked at a p-value threshold of 0.10 and examined variables that were under that threshold. The only variable that JMP added was weight^2.

Sixth Model

Above- Data for the 6th model

Above- Parameter estimates

This model appears to be a lot better than the previous two. The R-squared value of 85.6% is basically as good as the value from model 5 (85.7%). The RMSE is 0.224, which is better than the previous 0.226, and much better than model 4’s 0.231. In addition to that, the BIC is now lower (17.80 as opposed to 26 in model 5 and 20 in model 4). Model 6, unlike model 5, has only low p-values for its estimates, so we know that all of these variables are statistically significant within this model. The VIF of weight is still a little bit high, but there is much less multicollinearity than there was in model 5. We believe that there is not a significant amount of multicollinearity to warrant another change. We also noticed that none of the coefficients for the variables switched signs in the 95% confidence interval output.

Above-Factor profiler for weight

Above-surface profiler for weight and wingspan.

These two graphs illustrate the quadratic effects of weight. The first one shows the factor profiler, which holds all other variables at certain values and allows you to change one variable at a time. The second graph is a 3 dimensional surface profiler that looks at how wingspan and weight are associated with rebounds^½. We can see by this graph that wingspan has a positive linear relationship with rebounds^½., but once you enter the axis with weight, the surface begins to curve in a quadratic motion in congruence with the weight*weight terms.

Now that we have decided to include weight^2 in model 6, while eliminating other key variables, we need to recheck the LINE assumptions.

Above- This goodness-of-Fit test produces a p-value of 0.4434 which is high enough that normality is not a problem. It also shows us we could potentially exclude two more outliers. Normality is met.

Above- This Durbin Watson statistic is 2.0017, which is extremely close to 2. The -0.0114 is an insignificant measure on autocorrelation, and the p-value of 0.4954 tells us that there is no autocorrelation. Independence is met.

Above- Similar to earlier problems, we can not get a lack of fit test, so we resort to looking at patterns in the residual graph. We do not see any, so we conclude that the linearity condition is met.

Above- We split the data into two fairly equal groups, one with higher predicted rebounds and one with lower predicted rebounds. The standard deviation is 0.2062 in the first group and 0.2264 in the second group. We can see by the Levene p-value of 0.6470 as well as the other unequal variance tests that the difference in standard deviation and variance is not significant.

Model Description:

Model 7(Edited to disinclude the three outliers row 7, 8, and row 18)

Above- Actual by predicted graph for the final model.

Above- Data for the final model

Above- Parameter estimates for the final model.

By removing the two new outliers - Celtics SG Evan Turner and Nuggets PF Kenneth Faried - we vastly improved our data. This model now excludes row 7, 8, and 18. By doing this we were able to create a final model with an R-squared of 87.7%. This means that 87.70% of the variance in rebounds is accounted for by the regression model. This is higher than the 86.56% found from model 6, which is the same as this model, except this model excludes the three outliers. The RMSE in this model went down to 0.224 which is 0.207. This is the measure of the root of the mean standard deviation or standard error of the residuals. The BIC also went down in this problem to a value of 2.68 which is lower than the 17.80 seen before. In addition, all of the parameter estimates are significant with a max p-value of 0.0201. This is the best model that we have produced.

There is one problem with this model that will lead us to make a simpler exploratory model. Player position can change when new players move from college basketball into the NBA, and sometimes player position is not known when measurements are taken. This rarely happens, but we created a smaller model that excluded position just to explore.

Above- Data for the smaller hypothetical model

Above- parameter estimates for the smaller hypothetical model

This would be useful in the absence of position data. In this model, you can see that the BIC has gone up to 38.612 the R-squared has decreased to 79.47%, and the RMSE is 0.2631. From this output, we deduced that this is a weaker model than the last one we produced.

Above- repeated parameter estimates for the final model.

WeightEquation- (rebounds per 36 minutes)^½ =.005853(weight)+8.1379E^-5(weight)^2

Weight was the first variable that we explored after looking at the final model. It is the only variable that also has a quadratic term involved. As you can see by the graph, the expected rebounds grew exponentially. The coefficient for weight is .005853, which means that for each 1 pound increase in weight, the expected number of rebounds^½ will increase by .05853. We could also say that we are 95% confident that for each one pound increase in weight, holding the other variables fixed, is associated with an increase of .002626 to .00908 units in the expected number of rebounds^½ . However, that is not taking into account the quadratic term (8.1379E^-5). This means that for each 1 pound increase in weight we expect the slope of the relationship between weight and rebounds^½ to increase by 8.1379E^-5. We could also say that we are 95% confident that each one pound increase in weight is associated with an increase of between .002626 to .00908 units in the expected number of rebounds^½ . Players at the lower end of the graph (pictured above) weighed between 180-200 pounds and averaged about 2.2 rebounds^½. When added, this turns out to be approximately 4.84 rebounds. Players on the high end of this variable (weights between 240-260 pounds) averaged around 2.6 rebounds^½. When added, this turns out to be approximately 6.76 rebounds.

Wingspan

Equation- (rebounds per 36 minutes)^½ =.0580778(wingspan)

The next variable we explored was wingspan. The coefficient for wingspan is 0.0580778, which means that for every inch increase in wingspan, the expected rebounds^½ will increase by 0.0580778. Also, we are 95% confident that for every inch increase in a player’s wingspan, holding all other variables constant, there is an increase of 0.040957 to 0.075198 in expected total rebounds^1/2. Wingspan had the highest coefficient and R-squared value of any of our variables (around 66%). For players at the lower end of this data set (wingspans around 75 inches) the expected rebounds^½ is about 1.9. This translates into 3.61 expected rebounds. The players at the higher end of the spectrum (wingspans around 90 inches) have expected rebounds^½ of 2.75. This translates into 7.5625 expected rebounds.

Body fat Equation- (rebounds per 36 minutes)^½ =-0.027279(bodyfat)

Body fat was the next variable that we looked into. The coefficient for body fat is -0.027279. This means that for every percentage point increase there will be a decrease of 0.027279 total rebounds^½. Also, our output tells us that we are 95% confident that for every percentage point

increase in a player’s body fat, holding all other variables constant, there is a decrease of -0.050183 to -0.004375 in expected total rebounds^1/2. Body Fat initially appeared to be a weak variable, but turned out to be a much stronger variable than we had anticipated when we created this final model. Most players had body fat percentages between 4 and 8, and the average player with a body fat of 4% gets approximately 2.5 total rebounds^½. This translates to 6.25 rebounds. The average player with an 8% body fat gets only 2.4 rebounds^½, converting to 5.76 rebounds.

Players who play SG Equation- (rebounds per 36 minutes)^½= Match(Players who can play SG, ("0" =0.109194240007158, "1"=-0.109194240007158))

Moving on to the categorical variables, we are now looking at player positions. The first position we looked at was Shooting Guard. This was a binary variable, so if a player could play SG, he would be valued at 1, if he could not, he would be valued at 0. The coefficient for this variable was -0.10919. This means that if a player could play shooting guard, there is an expected decrease of -0.10919 rebounds^½ collected. We are 95% confident that if a player can play shooting guard, holding all other variables constant, there is a decrease of -0.056179 to -0.1621166 in expected total rebounds^1/2. If the player can not play SG, the opposite is true, there is a .10919 increase in expected rebounds. This means that the SG players expect to get about .21838 less rebounds^1/2 compared to the rest of the positions.

Players who play PF equation- (rebounds per 36 minutes)^½=Match(Players who can play PF, ("0"=-0.0997105472445443,"1"=0.0997105472445443)

The second position we looked at was power forward. This was another binary term, so if a player could play power forward he would be valued at 1 and at 0 if he could not. The coefficient for this number is .0997. This means that if a player can play power forward there is an increase of 0.0997 expected rebounds^½. We are 95% confident that if a player can play power forward, holding all other variables constant, there is an increase of between 0.15835 to 0.04098 in expected total rebounds^1/2. If the player cannot play the PF position, then the opposite is true, and there is a .0997 decrease in the expected rebounds. This means that the PF players expect about to get .1994 more rebounds^1/2 than non PF players. This means that PF players expect to get .0398 more rebounds than non PF players.

Players who play C equation- (rebounds per 36 minutes)^½=Match(Players who can Play C, ("0"= -0.0993250331527991, "1"= 0.0993250331527991)

The third position we looked at was center. This was a binary term so if a player could play center he would be described by the number 1, and 0 if he could not play center. The coefficient for this variable is .0993. This means that if a player can play center there is an increase of 0.0993 expected rebounds^½. We are 95% confident that if a player can play power forward,

holding all other variables constant, there is an increase of between 0.1677 to 0.0309 in expected total rebounds^1/2. If the player cannot play the center position, the opposite is true, and there is a .0993 decrease in the expected rebounds. This means that the center players expect about to get .19864 more rebounds^1/2 than non-center players. This translated back into the normal measurements means that center players are expected to get .0394 more rebounds than non-center players.

InterceptOur intercept coefficients for this model is -3.38. This means that if all the explanatory variables are held at 0, we would predict a player to get -3.38 rebounds^1/2. Although this value appears concerning, that value is not possible since each players has a weight, a wingspan measurement, and a bodyfat percentage. This data was meant to fit a range of values, for example, the lowest weights are around 160 pounds, and the lowest wingspans are around 72 inches. The regression line was not meant for numbers that were lower than that, so the y-intercept does not have a lot of information to tell us. That being said, we get a p-value of <.0001, which means that this is a statistically significant statistic. We are 95% confident that the number of rebounds^½ would be between -4.152 and -2.254, if all the other variables were held at zero.

Looking at UsWe decided that it would be a good idea to see how many rebounds that we would be able to get with our current measurements in the model. These explanatory variables are interesting, because as college students we wanted to see if it was possible to predict how many rebounds we could get in an NBA game (and create our own little outlier values).

The confidence interval for a mean response value for our data would be:

Above- Prediction interval for David, Jon, Sierra

We are 95% confident that the interval for a mean response value of rebounds^½ would be between 1.184 and 2.260 for David Espinola. This means that David would grab anywhere between 1.184 rebounds and 2.260 rebounds^½. This translated back to rebounds would mean between we are 95% confident that David would get between 1.413 and 5.387 rebounds if he played in the NBA.

We are 95% confident that the interval for a mean response value of rebounds^½ would be between 1.217 and 2.1958 for Jon Carter. This means that Jon would grab anywhere between 1.219 rebounds and 2.228 rebounds^½. This translated back to just rebounds would mean between we are 95% confident that David would get between 1.486 and 4.963984 rebounds if he played in the NBA.

We are 95% confident that the interval for a mean response value of rebounds^½ would be between 0.720 and 2.054 for Sierra Haddock. This means that Sierra would grab anywhere between 0.720 rebounds and 2.054 rebounds^½. This translated back to rebounds would mean

that we are 95% confident that Sierra would get between .5184 and 4.218916 rebounds if she played in the NBA (ignoring her gender).

For David Espinola, we valued him at 135 pounds with a 72 inch wingspan and 9.5% body fat. We listed him as a point guard. This means that David Espinola in an NBA game would grab 1.755 rebounds^½. For Jon Carter, we valued him at 160 pounds with a 75 inch wingspan and 13% body fat. We also listed him as a point guard. The prediction formula gave us 1.724 rebounds^½. For Sierra Haddock, we put down her data at 130 pounds with a 69 inch wingspan and playing the point guard position. The prediction formula gave us 1.387 rebounds^½.

Ho: B1=B2=B3=B4=B5=B6=B7=0Ha: At least one of them differsThe residual conditions that we satisfied in model 6 are almost identical to model 7. All LINE conditions are met.Dof = (7,96) F-statistic is 90.6606 This gives us a P-value of <.0001. We reject the null and conclude that at least one variable does not equal 0, and that this model is significant.

Conclusion and Additional Exploration:

We were a little uncertain of whether or not we should have excluded the outliers that fell outside of our normality box plot. We have included an abbreviated model to show the difference between the two models. The final model can be seen on the right (N=97) and the full model can be seen on the left (N=100). The final model appears stronger than the full model, but not by much. We conclude this from the lower RMSE (.207 versus .232) and from the larger R-squared value (87.7% versus 85.35%). There is a possibility that we did not need to remove these outliers, but they were falling 1.5 times the distance of the IQR away from either Q1 or Q3, which is an acceptable reason to exclude a variable.

We also decided to test a regression tree, partially because we were having fun exploring our data, but also because we wanted to see if similar variables would come up. We also wanted to see if the tree could produce a better model than the one that we had settled with.

Above- Regression tree, 14 splits and 15 groups.

Above- Predictions for the rebounds per 36 min

Above- This is the summary of fit including the AICs.

The regression tree split the data 14 separate times. We noticed that the RMSE is .1835, which is an improvement from model 7’s .207146. Finally, the R-squared value is 89.68% which is higher than 87.70%, and all the models that we have run. Although this model provides more flexibility, our final model is easier to interpret. The final model has 3 categorical binary variables, 3 quantitative variables, and one quadratic term. It was simply and easy to interpret for our level of regression comprehension.

Above- This is the leaf report of how the groups were broken up.

The lowest group consists of players with below 84 inch wingspans who do not play PF, weigh less than 220, who can not jump 139 inches, and can not run the lane agility drill in 10.74 seconds. These players are expected to get an average of 1.74644 rebounds^1/2 . After translating this back to just rebounds, this would mean an average of 3.0500 rebounds.

The highest group consists of players with a wingspans larger than 84 inches who can not play SF, who have a standing reach of 109.5 inches, and who can run ¾ of the court in less than 3.35 seconds. These players are expected to get an average of 3.4855 rebounds^1/2 . This translated would be an average of 12.1487 rebounds.

We were also curious about who our model predicted was going to have, on average, the most rebounds and the least rebounds over their career. Shane Larkin was predicted to have 1.68681 rebounds^½ this means that this would predict him to have only 2.84532 rebounds per game. This is even less than the predictions made for Jon and David, which was at first surprising to us, but we then remembered that we had made up unrealistic body fat percentages for each other. Shane Larkin weighs 170.8 pounds, has a wingspan of 70.75 inches, a body fat of 3.8%, and does not play any of our binary positions, SG, PF, or C.

Andre Drummond was predicted to have the most rebounds. He was predicted to have 3.71915 rebounds^½. This means that he is expected to get 13.83207 rebounds. He weighs 278.6

pounds, has a wingspan of 90.25 inches, a body fat of 7.5% and plays the position of center. Andrew Drummond actually leads the league this season in rebounds.

Overall, we believe that the project was a success and that the model is extremely significant as you can see in the ANOVA test below and the p-value that is <.001.

Overall, this data set was exciting for us to work with partially because we are NBA fanatics, but mostly because it gave us so much insight and predictive ability. Sierra had a blast examining different variables, while Jon enjoyed producing and interpreting complex charts, plots, and graphs. David took an entirely different approach and loved playing around with the idea of (and prediction interval for) being an NBA player. The most grueling and difficult segment of this project was collecting all the data and inputting it into our data table. It took hours and we restarted our data collection three separate times. This could have been avoided if we had organized our project and the variables more before diving into data collection. We were just too excited to begin! The data performed as we predicted, for the most part, and it was pretty cool examining the outlier players because they are more than just another data point in our minds. We were rather surprised that we only ended up excluding three players, but happy that there were not too many extremely influential players in our data set. There are probably so few outliers because the NBA is extremely competitive, so many of the players have similar measurements and percentages. The major question we have left, that our data can not tell us, is whether or not there is a variable that we failed to account for that could have a huge influence when predicting rebounds.

Appendix

Electronic Copy:

Sources:

http://www.nbadraft.net/

http://www.nbadraft.net/2010-nba-draft-combine-official-measurements

http://www.nbadraft.net/nba-draft-combine-measurements-0

http://www.nbadraft.net/2012-nba-combine-measurements

http://www.nbadraft.net/2013-nba-draft-combine-measurements

http://www.nbadraft.net/2014-nba-draft-combine-measurements

http://www.nbadraft.net/nba-draft-combine-athleticism-test-results

http://www.nbadraft.net/2012-nba-combine-athleticism-results

http://www.nbadraft.net/forum/2013-nba-combine-athletic-testing-results

http://www.nbadraft.net/nba-draft-combine-athleticism-test-results-1

http://www.nba.com/2011/news/05/23/combine-measurements/index.html

Documents

NBA Total Rebound Prediction