41
+ Kernel Methods in Data Analytics: Predicting College Football Outcomes by Logistic Regression Theodore Trafalis, School of Industrial and Systems Engineering , University of Oklahoma Panel in Sports Analytics, International Conference in Sports Management Athens, Greece May 9, 2016

Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+

Kernel Methods in Data Analytics: Predicting College

Football Outcomes by Logistic Regression

Theodore Trafalis, School of Industrial and Systems Engineering , University of Oklahoma

Panel in Sports Analytics, International Conference in Sports Management

Athens, Greece May 9, 2016

Page 2: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Agenda

Background

Existing Approaches

Model Development

Parameters Used

Offensive

Defensive

Interactions

Other

Results and Interpretation

Validation

Conclusions and Recommendations

Works Cited

Page 3: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Background

College Football Post Season

30-40 Bowls

College Football Playoffs

Huge Viewership

6 Million per game in 2014

33 Million watched Championship Game

Millions bet on outcomes of Bowl Games

Uncommon matchups

Lots of Uncertainty

Page 4: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Existing Approaches

Las Vegas Oddsmakers

Massey Ratings

Conglomerate of Computer Polls

History of Computer Polls in BCS era

ELO Ratings

Developed by Arpad Elo to rank Chess players

Adapted to sports, videogames, programming, etc.

S&P+ Ratings

Derived from Play by Play data

Efficiency

Explosiveness

Field Position

Finishing Drives

Turnovers

Football Power Index

Developed by ESPN

Predictive and Simulation based

Page 5: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Model Development

Binomial Logistic Regression

Fits dependent variable into one of two non-overlapping sets

Can utilize many different types of input variable

Nominal

Ordinal

Interval

Without Interactions

With Interactions

Page 6: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Model Development

Parameter Fitting

36 Games from 2014 used as training data

Method of Least Squares

Program Used

Excel’s Solver GRG package

Page 7: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Parameters Used

Offensive Metrics

8 Team Statistics

Consider both teams

Defensive Metrics

8 Team Statistics

Consider both teams

Quarterback Rating

Conversion Metrics

Disruptive Metrics

Penalty Metrics

Outside Rankings

Conference

Interaction Effects

Page 8: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Offensive Metrics

Yardage Metrics

Total Yards

Total Yards per Game

Passing Yards

Passing Yards per Game

Rushing Yards

Rushing Yards per Game

Scoring Metrics

Total Points Scored

Points Scored per Game

Page 9: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Defensive Metrics

Yardage Metrics

Total Yards Allowed

Total Yards Allowed per Game

Passing Yards Allowed

Passing Yards Allowed per Game

Rushing Yards Allowed

Rushing Yards Allowed per Game

Scoring Metrics

Total Points Allowed

Points Allowed per Game

Page 10: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Quarterback Rating

Abbreviated QBR

Measure of Quarterback Quality

Completion %

Yards per Attempt

Touchdown %

Interception %

Page 11: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Conversion and Conference Metrics

Conversion Metrics

Measure of a team’s ability to maintain possession of the ball.

Number of First Downs

3rd Down Conversion Rate

4th Down Conversion Rate

Conference Metric

Power 5 vs. Great 5

Page 12: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Disruptive Metrics

Defensive Disruption

Measure of Defense’s ability to disrupt Offense

Sacks

Interceptions

Fumbles

Offensive Disruption

Measure of team discipline and ability to keep offensive drives on

track

Total Penalty Yards Assessed

Page 13: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Outside Rankings

Massey Ranking

Conglomeration of Computer Polls

Las Vegas Spread

Made by Las Vegas Book Keepers

Goal is to insure equal betting on both teams

Negative spread means team is favored

Page 14: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Interaction Effects

Offense vs. Defense

Total Yards vs. Total Yards Allowed

Points Scored per Game vs. Points Allowed per Game

Rushing Yards per Game vs. Rushing Yards Allowed per Game

Passing Yards per Game vs. Passing Yards Allowed per Game

QBR vs. Defense

Disruptive

Sack Ratio

Interception Ratio

Fumble Ratio

Conversion

1st Down Ratio

3rd Down Conversion Ratio

4th Down Conversion Ratio

Penalties

Penalty Yard Ratio

Ranking

Massey Ranking Ratio

Vegas Spread

Page 15: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Model Results

Page 16: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Model Results

Page 17: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Model Results—Unexpected

Page 18: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Model Fit

Able to correctly categorize 73/79 (92.4%) CFB bowl outcomes from 2014.

Vegas

58.3%

Massey

61.1%

ELO

66.7%

S&P+

63.9%

FPI

58.3%

Page 19: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Cross Validation—2013

Ran model for 2013 bowl games

Model Accuracy dropped to 57.1%

Existing methods also dropped in accuracy

Vegas

62.9%

Massey

60%

ELO

60%

S&P+

54.3%

FPI

51.4%

Page 20: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Cross Validation—2012

Ran model for 2013 bowl games

Model Accuracy dropped to 55.7%

Existing methods also dropped in accuracy

Vegas

60%

Massey

60%

ELO

51%

S&P+

48.2%

FPI

45.2%

Page 21: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Conclusions and Recommendations

Conclusions

Model Works for Categorizing Games a posteriori

Not great for a priori predictions

Model is competitive with other predictive measures

Recommendations

Use ANOVA to filter out unwanted parameters

Incorporate other predictive measures

Path Forward

Compare Binomial Multiple Logistic Regression to SVM

Page 22: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Works Cited

[1] Rishe, P. (2015, January 15). Reviewing The 2014-15 Bowl Season: Highest Bowl Game Prices, Attendances, And TV Ratings. Retrieved November 30, 2015, from http://www.forbes.com/sites/prishe/2015/01/15/reviewing-the-2014-15-bowl-season-highest-bowl-game-prices-attendances-and-tv-ratings/

[2] Purdum, D. (2015, January 30). Wagers, bettor losses set record. Retrieved November 30, 2015, from http://espn.go.com/chalk/story/_/id/12253876/nevada-sports-bettors-wagered-lost-more-ever-2014

[3] World Football Elo Ratings: Rating System. (n.d.). Retrieved November 30, 2015, from http://www.eloratings.net/system.html

[4] Football Outsiders. (n.d.). Retrieved November 30, 2015, from http://www.footballoutsiders.com/stats/ncaa

[5] Hosmer, D., & Lemeshow, S. (1989). Introduction to the Logistic Regression Model. In Applied logistic regression (pp. 1-30). New York: Wiley.

[6] Diaz, A., Tomba, E., Lennarson, R., Richard, R., Bagajewicz, M., & Harrison, R. (2010). Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol. Bioeng. Biotechnology and Bioengineering, 374-383.

Page 23: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Predicting Major League

Baseball Championship

Winners through SVMs

Page 24: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

Attribute Category

Total Runs Scored (R) Offensive

Stolen Bases (SB) Offensive

Batting Average (AVG) Offensive

On Base Percentage (OBP) Offensive

Slugging Percentage (SLG) Offensive

Team Wins Record

Team Losses Record

Earned Run Average (ERA) Pitching

Save Percentage Pitching

Strikeouts per nine innings (K/9) Pitching

Opponent Batting Average (AVG) Pitching

Walks plus hits per inning pitched (WHIP) Pitching

Fielding Independent Pitching (FIP) Pitching

Double Plays turned (DP) Defensive

Fielding Percentage (FP) Defensive

Wins Above Replacement (WAR) Baserunning, Hitting, and Fielding

Table 1 Summary of attributes evaluated in this study

Page 25: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ SVM

The SVM algorithm, developed by Vapnik (1998), is frequently applied in machine learning.

The SVM algorithm for binary classification problems constructs a hyperplane that separates a set of training vectors into two classes (e.g., tornadoes vs. non-tornadoes).

The objectives of SVMs for the primal problem are to maximize the margin of separation and to minimize the misclassification error.

We utilize the probabilistic outputs for SVMs proposed by Platt (1999).

Page 26: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Illustration of SVM

Page 27: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ American League Pennant Model

Selection Standard SVM

Imbalanced data , bias towards L (majority class)

63 100%

20 100%

0 0%

0 0%

Page 28: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Different RBF Classifiers

Gaussian, γ=50 Gaussian, γ=300

Gaussian, γ=30,000 Gaussian, γ=300,000

Page 29: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ Comparison of the accuracy of

different Gaussian RBF classifiers

Model Cost for Majority

(L) Cost for Minority

(W) gamma (γ)

Accuracy

(%)

Case 1 1 3 50 80.7

Case 2 1 3 300 71.8

Case 3 1 3 30,000 86.7

Case 4 1 3 300,000 98.8

Page 30: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+

Sweet Spot

Page 31: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+

47 74.6%

8 40.0%

12 60.0%

16 25.4%

Classifier selected to perform prediction on the American League pennant race

Page 32: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+ National League Pennant Model

Selection

Sweet Spot

Sweet Spot

Sweet Spot Sweet Spot

Page 33: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+

SVM Model Cost for Majority

(L) Cost for Minority

(W) gamma (γ)

Accuracy

(%)

Gaussian RBF 1 3 30,000 77.1

49 77.8%

5 25.0%

15 75.0%

14 22.2%

The classifier selected to perform prediction on the National League pennant race

Page 34: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+

27 64.3%

21 50.0%

21 50.0%

15 35.7%

Figure 7 shows the best classifier acquired from the Linear Kernel SVM algorithm

Page 35: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

33 78.6%

24 57.1%

18 42.9%

9 21.4%

Figure 8 shows the best classifier acquired from the Quadratic Kernel SVM algorithm

36 85.7%

21 50.0%

21 50.0%

6 14.3%

Figure 9 shows the best classifier acquired from the Cubic Kernel SVM algorithm

Page 36: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

28 66.7%

12 28.6%

30 71.4%

14 33.3%

Figure 10 shows the best classifier acquired from the Gaussian Kernel RBFSVM algorithm

Machine Learning

Algorithm Model Accuracy Attribute (1) Attribute (2)

Linear Kernel SVM 57.1% Fielding Percentage Batting Average

Quadratic Kernel SVM 60.7% WHIP ERA

Cubic Kernel SVM 67.9% Double Plays turned Wins

Gaussian Kernel RBF 69% SLG Double plays turned

Table 4 compares the accuracy values of the best classifier for each SVM algorithm.

Page 37: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

Figure 11 shows the playoff results of the 2015 Major League Baseball season

American League National League

World Series

Page 40: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

Machine Learning Algorithm

Model Accuracy World Series Winner (W) World Series Loser (L)

Actual Result -

Linear Kernel SVM 57.1%

Quadratic Kernel SVM 60.7%

Cubic Kernel SVM 67.9%

Gaussian RBF Kernel SVM 69%

Table 5 Prediction results of the 2015 World Series for several classifiers

Page 41: Predicting College Football Outcomes · 2016. 7. 7. · Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33

+

End of Presentation

Contact: [email protected]