23
1 Simple Input: Sophisticated Suggestions for Major League Baseball Use what you already know Focus your efforts Find the best December 10 th 2015 Adharsh, Deepika, Kaushik, Richard, Srinivas, Yueying

Clustering of Players in Major League Baseball

Embed Size (px)

Citation preview

Page 1: Clustering of Players in Major League Baseball

1

Simple Input: Sophisticated Suggestions for Major League Baseball

• Use what you already know• Focus your efforts• Find the best

December 10th 2015

Adharsh, Deepika, Kaushik, Richard, Srinivas, Yueying

Page 2: Clustering of Players in Major League Baseball

2

Problem Definition

Finding similar players

Identifying patterns among batters/fielders/pitchers

Next Steps

Agenda

Page 3: Clustering of Players in Major League Baseball

3

Major League Baseball Data2014 Major League Baseball

Page 4: Clustering of Players in Major League Baseball

4

Major League Baseball Data2014 Major League Baseball

1,320 players

Page 5: Clustering of Players in Major League Baseball

5

Major League Baseball Data2014 Major League Baseball

1,320 players × 15+ metrics per player

Page 6: Clustering of Players in Major League Baseball

6

Major League Baseball Data2014 Major League Baseball

1,320 players × 15+ metrics per player × 162+ games per season =

Page 7: Clustering of Players in Major League Baseball

7

Major League Baseball Data2014 Major League Baseball

1,320 players × 15+ metrics per player × 162+ games per season =

Information overload!

Page 8: Clustering of Players in Major League Baseball

8

Gap Analysis

Current State Future State

The manager of a baseball team has difficulty in identifying similarity/trends among players

The manager would like to have a solid technique to identify prospects for old-fashioned scouting

1,320 players made 2014 MLB appearances. No one can keep track of them all

By leveraging analytics the team manager is able to evaluate 1300+ players

For any player, a similar performer can be found and salaries can be compared

Describe a hole in your team, and find a player to fill it in

Lack of clarity on how to find similarity among players

Identifying groups of players

Gap

Key Questions What your team needs

What the available budget is

Page 9: Clustering of Players in Major League Baseball

9

Use-case: Player Replacement

Name a player you’d like to

replace

Find the top 5 players with

similar outcomes

Evaluate by age, salary, and summary

performance metrics

Page 10: Clustering of Players in Major League Baseball

10

Sophisticated Suggestions

Page 11: Clustering of Players in Major League Baseball

Use What You Already KnowGets you the information you need to

focus on:

What position do you need?

What can you pay?

What sort of player are you interested in?

1

2

3

A simple set of questions you already know the answer to:

Player name

Player age

Summary performance metric

Salary (where available)

Page 12: Clustering of Players in Major League Baseball

12

Use-case: Fill a hole

I need an outfielder

I can pay $14M

Batting performance

• How much can you pay?

• Requirement

• What do you want to emphasize?

Page 13: Clustering of Players in Major League Baseball

13

Sophisticated Suggestions

Page 14: Clustering of Players in Major League Baseball

14

Under the hood

Grouped players into position categories

Pitchers

Outfielders

Infielders

Catchers

Designated Hitters

Data Source

Lahman’s Baseball Database

Found groups of players based on simple

performance metrics

Hits

ErrorsStolen Bases etc.

Page 15: Clustering of Players in Major League Baseball

Next Steps

Improvement•Refine performance inputs (combine, weight, or remove metrics)

In depth analyses •Implement more granular searches: Player availability?

Include more leagues •Expand to include minor league players

Page 16: Clustering of Players in Major League Baseball

16

Thank You

Page 17: Clustering of Players in Major League Baseball

Appendix A: Simple Clustering Example• Designated Hitter: A player that bats in place of the pitcher

• Only present on American League teams• Only bats, so only batting statistics are present

• Express all stats as a ratio with At-Bats, and then normalize

Page 18: Clustering of Players in Major League Baseball

Two or three clusters?Prioritize Interpretability

cluster R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP

10.11649

90.25214

20.05105

80.00480

70.02199

10.11213

70.01425

20.00632

40.08931

30.19267

40.00606

9 0.010340.00543

70.00840

30.02457

3

20.13331

30.24952

10.05124

50.00290

20.04508

30.15294

60.00615

40.00321

30.11527

50.25016

80.00889

30.01013

20.00117

90.00852

40.02476

9

cluster R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP

10.12551

10.25409

60.05045

90.00747

30.01916

60.10617

10.02286

80.00984

7 0.095950.20309

30.00507

90.00985

30.00765

60.00759

50.01904

9

20.10670

30.25001

8 0.051710.00190

90.02506

20.11862

10.00488

50.00249

50.08209

90.18134

90.00714

50.01086

80.00302

40.00928

20.03057

7

30.13331

30.24952

10.05124

50.00290

20.04508

30.15294

60.00615

40.00321

30.11527

50.25016

80.00889

30.01013

20.00117

90.00852

40.02476

9

These clusters show one group swinging at the ball (Home Runs, Strike Outs), and another acting to score runs when on-base, or with a teammate on-base (Stolen Bases, Sacrifice Hits).

These clusters find large differences in rare metrics, such as Triple and Grounded Into Double Play.

Page 19: Clustering of Players in Major League Baseball

19

Appendix B: Complex Clustering Example• Catcher

• Batting: same as all other players• Fielding is different from other players, with several rare outcomes tracked

• Collect fielding data only from games played as catcher• Measure as ratio with games played

• Join with all batting outcome ratios

Page 20: Clustering of Players in Major League Baseball

20

Two or four clusters?Is the outlier cluster useful?Filtering out inactive players, the four clusters have sizes: 27, 5, 15, and 29.

A 5-player cluster is not much use, so we use two clusters instead.

Page 21: Clustering of Players in Major League Baseball

21

Appendix C: Player similarity• The performance metric for batting and pitching is calculated

• Batting:

• Pitching1:

• Using this, the similarity matrix is computed using Euclidian distance• For each player, the 3 most similar players are selected (who aren’t in

the same team)

1: Lower the ERA, the better

Baseball glossary: http://baseball.about.com/od/termstatglossar1/a/statsglossary.htm

Page 22: Clustering of Players in Major League Baseball

22

Appendix D: Age vs Batting Average

Page 23: Clustering of Players in Major League Baseball

23

Appendix E: Age vs ERA