15
12/8/2015 Turnovers Effect on Wins and Loses: Topic #4 2014 NFL TEAMS REGULAR AND POST SEASON DATA William Harris and Matthew McCormack AUBURN UNIVERISTY

Turnovers Effect on Wins and Loses: Topic #4€¦  · Web viewThe purpose of this document is to describe the analysis done by William Harris and Matthew McCormack on NFL team data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Turnovers Effect on Wins and Loses: Topic #4

EXECUTIVE SUMMARY

The purpose of this document is to describe the analysis done by William Harris and Matthew McCormack on NFL team data for the 2014 regular and post season, and more specifically how turnovers effect the outcome of a game. The analysis consisted of developing several predictive models and comparing them until the most accurate model was chosen. The models used variables such as: turnover total, interceptions caught, fumble for TD and interception for TD to look at the odds ratio table and see how they effected the outcome of the game.

The resulting analysis and model comparison with average square error as its selection criterion found that the neural network with the regression selected variables was the best model. From this we found that with the exception of outliers, turnovers will almost always decrease a team’s chances of winning. There was no contest on Kaggle.com for our data so we used the data for the project and both hope to use the data in the future to show potential employers our skills.

TABLE OF CONTENTS

Literature Review…………………………………………………………………………….3

Background and literature of the study

Introduction…………………………………………………………………………………….3

Description of Business Problem

Methodology…………………………………………………………………………………….3-4

Models and Rationale

Analysis Results……………………………………………………………………………….4-5

Conclusion……………………………………………………………………………………….5

Appendix…………………………………………………………………………………………5-12

LITERATURE AND BACKGROUND

The basis for the analysis performed comes from a community manager that works for IBM Watson Analytics who posted on the company website an article with data about NFL Teams and players from the 2014 regular and post season. The article talks about using the statistics for deeper analytics insight into the football world.

Football is a very complex sport. Finally analytics is getting involved and trying to dig deeper and find more and better information from the statistics already gathered. The statistic that we found most interesting was turnovers, so we decided to run an analysis on turnovers and how they affect the outcome of a game.

INTRODUCTION

The analysis conducted is an attempt to see how turnovers effect the outcome of a football game. We downloaded the data from IBM Watson Analytics and deleted several variables that were not relevant to turnovers and their effect on games. After manipulation our data set contained a target variable of outcome of game and the predictor variables are shown below

· Conference (NFC or AFC)

· Fumbles recovered for touchdowns

· Fumbles lost

· High amount of turnovers (4 or more turnovers)

· Medium amount of turnovers (3 turnovers)

· Low amount of turnovers (1-2 turnovers)

· No turnovers (zero turnovers)

· Sacks

· Venue surface (artificial or turf)

· Interception caught

· Interception caught and returned for touchdown

Using this data set we built several predictive models to estimate how much turnovers and their specific types increase or decrease a team’s chances of winning. The methodology used to do this, the results of the analysis, figures, tables and conclusions are all shown below.

METHODOLOGY

The first thing we did once we downloaded the data was manipulated and deleted data was wasn’t relevant to turnovers. We then grouped total amount of turnovers into 4 different groups: high (4 or more turnovers), medium (3 turnovers), low, (1-2 turnovers) and no turnovers to better represent the data. We then imported the file and ran in through an impute node to adjust for any missing values in the data. The first model we ran was a regression model and really looked at the odds ratio estimate. The next model we ran was a decision tree which is a supervised learned technique that is used to predict the binary outcome of a proposed question looks at variable importance. Once we split the nodes we had to go back and change the rule for interceptions from .5 to 1 because you can’t have half of an interception.

The next model was a neural network. Neural networks are capable of modeling almost any association, and we wanted to see if there was a difference between using the imputed variables and the regression selecting variable. So, we skipped the regression node and went straight from the impute node to the neural network node. Surprisingly, the next model we ran was a neural network that ran from the impute node through the regression node and then to the neural network. We then connected all of the models at a control point node to centralize them. Once they were centralized we ran them through the ensemble node to combine model predictions and then finally we ran it through the model comparison node to compare the individual models.

ANALYSIS RESULTS

Before comparing the models we stopped and looked and each individual model to see if there was any information that could be extracted from it. In the regression model with validation error as the selection criteria, Figure 2 shows that teams are more than six times more likely to win if they don’t have four or more turnovers. It also shows that teams are over three times more likely to win a game if they return a fumble for a touchdown.

The next model we tried was a decision tree. As you can see in Figure 4, interceptions caught was the most influential variable. The original splitting rule for interceptions caught was .5, but we changed it to a whole number of 1 because you can’t have half of an interception. This showed that teams have a 64% chance of winning with an interception, and only a 32% chance of winning without one.

A Neural Network was then created because they are capable of modelling almost any association between input and target variables. Average Error was chosen as the model selection criteria. The average squared error of the network was .1868, and the misclassification rate was .2689. The weights final plot shows the negative and positive relationship between the variables.

Another Neural Network model was then created with the regression model selecting the variables. Average error was also used as the selection criterion for this model. The iteration plot shows the iteration that has the best performance among the iterations, and the weights final plot shows the positive and negative relationship between the variables.

The final model created was an Ensemble model. The ensemble model used a combination of the previous four models to derive an answer. The default SAS settings were left for this model. The fit statistics show that the average square error was .1916 for the training data set and .2053 for the validation data set.

A model comparison was run using average square error as its selection criterion. The model comparison found that the neural network model with the regression selected variables was the best model according to this criterion. The ROC chart supports this finding.

CONCLUSION

The neural network with regression selected variables was the best model according to the average square error selection criterion. With the exception of outliers, turnovers proved to almost always decrease a team’s chances of winning. Interceptions proved more costly to teams than fumbles, however, fumble recovery touchdowns help a team’s chances of winning more than interceptions returned for touchdowns. If Auburn wants to get back to its winning ways, they have to stop turning the ball over!

APPENDIX

Figure 1: Variable Distribution

Figure 2: Odds Ratio Estimate from Regression

Figure 3: Analysis of Maximum Likelihood Estimates from Regression

Figure 4: Decision Tree

Figure 5: Fit Statistics of Decision Tree

Figure 6: Final Weights of Neural Network

Figure 7: Iteration Plot of Neural Network

Figure 8: Fit Statistics of Neural Network

Figure 9: Iteration Plot of Neural Network (Regression Selecting Variables)

Figure 10: Final Weights of Neural Network (Regression Selecting Variables)

Figure 11: Fit Statistics of Ensemble

Figure 12: Fit Statistics of Model Comparison

Figure 13: ROC Chart: Win/Loss of Model Comparison

Figure 14: Our Final Workspace

12