Upload
srinivas-osuri
View
451
Download
1
Embed Size (px)
Citation preview
1
Simple Input: Sophisticated Suggestions for Major League Baseball
• Use what you already know• Focus your efforts• Find the best
December 10th 2015
Adharsh, Deepika, Kaushik, Richard, Srinivas, Yueying
2
Problem Definition
Finding similar players
Identifying patterns among batters/fielders/pitchers
Next Steps
Agenda
3
Major League Baseball Data2014 Major League Baseball
4
Major League Baseball Data2014 Major League Baseball
1,320 players
5
Major League Baseball Data2014 Major League Baseball
1,320 players × 15+ metrics per player
6
Major League Baseball Data2014 Major League Baseball
1,320 players × 15+ metrics per player × 162+ games per season =
7
Major League Baseball Data2014 Major League Baseball
1,320 players × 15+ metrics per player × 162+ games per season =
Information overload!
8
Gap Analysis
Current State Future State
The manager of a baseball team has difficulty in identifying similarity/trends among players
The manager would like to have a solid technique to identify prospects for old-fashioned scouting
1,320 players made 2014 MLB appearances. No one can keep track of them all
By leveraging analytics the team manager is able to evaluate 1300+ players
For any player, a similar performer can be found and salaries can be compared
Describe a hole in your team, and find a player to fill it in
Lack of clarity on how to find similarity among players
Identifying groups of players
Gap
Key Questions What your team needs
What the available budget is
9
Use-case: Player Replacement
Name a player you’d like to
replace
Find the top 5 players with
similar outcomes
Evaluate by age, salary, and summary
performance metrics
10
Sophisticated Suggestions
Use What You Already KnowGets you the information you need to
focus on:
What position do you need?
What can you pay?
What sort of player are you interested in?
1
2
3
A simple set of questions you already know the answer to:
Player name
Player age
Summary performance metric
Salary (where available)
12
Use-case: Fill a hole
I need an outfielder
I can pay $14M
Batting performance
• How much can you pay?
• Requirement
• What do you want to emphasize?
13
Sophisticated Suggestions
14
Under the hood
Grouped players into position categories
Pitchers
Outfielders
Infielders
Catchers
Designated Hitters
Data Source
Lahman’s Baseball Database
Found groups of players based on simple
performance metrics
Hits
ErrorsStolen Bases etc.
Next Steps
Improvement•Refine performance inputs (combine, weight, or remove metrics)
In depth analyses •Implement more granular searches: Player availability?
Include more leagues •Expand to include minor league players
16
Thank You
Appendix A: Simple Clustering Example• Designated Hitter: A player that bats in place of the pitcher
• Only present on American League teams• Only bats, so only batting statistics are present
• Express all stats as a ratio with At-Bats, and then normalize
Two or three clusters?Prioritize Interpretability
cluster R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
10.11649
90.25214
20.05105
80.00480
70.02199
10.11213
70.01425
20.00632
40.08931
30.19267
40.00606
9 0.010340.00543
70.00840
30.02457
3
20.13331
30.24952
10.05124
50.00290
20.04508
30.15294
60.00615
40.00321
30.11527
50.25016
80.00889
30.01013
20.00117
90.00852
40.02476
9
cluster R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
10.12551
10.25409
60.05045
90.00747
30.01916
60.10617
10.02286
80.00984
7 0.095950.20309
30.00507
90.00985
30.00765
60.00759
50.01904
9
20.10670
30.25001
8 0.051710.00190
90.02506
20.11862
10.00488
50.00249
50.08209
90.18134
90.00714
50.01086
80.00302
40.00928
20.03057
7
30.13331
30.24952
10.05124
50.00290
20.04508
30.15294
60.00615
40.00321
30.11527
50.25016
80.00889
30.01013
20.00117
90.00852
40.02476
9
These clusters show one group swinging at the ball (Home Runs, Strike Outs), and another acting to score runs when on-base, or with a teammate on-base (Stolen Bases, Sacrifice Hits).
These clusters find large differences in rare metrics, such as Triple and Grounded Into Double Play.
19
Appendix B: Complex Clustering Example• Catcher
• Batting: same as all other players• Fielding is different from other players, with several rare outcomes tracked
• Collect fielding data only from games played as catcher• Measure as ratio with games played
• Join with all batting outcome ratios
20
Two or four clusters?Is the outlier cluster useful?Filtering out inactive players, the four clusters have sizes: 27, 5, 15, and 29.
A 5-player cluster is not much use, so we use two clusters instead.
21
Appendix C: Player similarity• The performance metric for batting and pitching is calculated
• Batting:
• Pitching1:
• Using this, the similarity matrix is computed using Euclidian distance• For each player, the 3 most similar players are selected (who aren’t in
the same team)
1: Lower the ERA, the better
Baseball glossary: http://baseball.about.com/od/termstatglossar1/a/statsglossary.htm
22
Appendix D: Age vs Batting Average
23
Appendix E: Age vs ERA