18
1 Decoding MLB Pitch Sequencing Strategies via Directed Graph Embeddings Abstract This paper presents a novel analysis of pitch sequencing in Major League Baseball (MLB). By leveraging high-resolution pitch tracking data from ~3.6 million pitches across the 2015-2019 seasons, this work introduces directed graph (network) embeddings that successfully map short- and long-term patterns in pitch sequences. This quantitative approach to pitch sequencing captures the intuition that pitchers exhibit a sense of long-term memory when on the mound that is not adequately represented by individual pitch selection or simple pitch-to-pitch correlation. By interpreting the graph embeddings as forward dependencies, there is compelling evidence that pitchers construct their sequences via strategic components, denoted “setup” and “knockout” pitches. Model-based clustering on the graph embeddings suggest that MLB pitch sequences can be grouped into a finite collection of universal patterns with respect to both pitch-type and zone selection. Exploratory data analysis of these sequence clusters indicate that a pitcher’s sequencing strategies are distinct—though not inseparable—from their available pitch arsenal; that is, in addition to pitch selection, pitch ordering is a significant component of pitcher decision-making. Moreover, pitchers exhibit patterns or styles in how they sequence their pitches over the course of a single matchup, game, season, and career. Ultimately, this paper introduces an analytical framework to study and visualize MLB pitch sequences with potential applications in matchup preparation, player evaluation, and player development. 1. Introduction Each moment in a baseball game is an extension of the battle between pitcher and batter. This conflict perhaps most closely resembles that of a high-performance mixed martial arts fight, where there is immense pressure—and corresponding incentive—for pitchers to optimize their approach against each batter they face. In this matchup, a pitcher’s arsenal can be imagined as a set of techniques that each pitcher has at their disposal to defeat their opponent. However, a pitcher’s pitch arsenal alone (i.e., the moves they have available) is an incomplete representation of a pitcher’s aggregate style. By treating both the selection and ordering of baseball pitches, pitch sequencing is much more informative of pitcher decision-making by taking into account how pitchers learn, build upon, and order their pitches. A major analytics challenge in sabermetrics is capturing differences between pitchers with respect to not only their performance on the field but also the styles they exhibit and the strategies they deploy. Although each at-bat between a pitcher and batter is ultimately either won or lost using an approximately unique collection of pitches, patterns may emerge wherein sets of pitch sequences recur across various in-game settings. If pitchers share pitch arsenals or face similar batter- matchup situations, it is also expected that sets of pitch sequences manifest across many individuals and across seasons. Conversely, pitchers can also distinguish themselves from others in their role that share their pitch arsenal through differences in their pitch sequencing. This is advantageous to the pitcher since it separates them from their peers and is critical for teams to diversify talent in their roster.

Decoding MLB Pitch Sequencing Strategies via Directed

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decoding MLB Pitch Sequencing Strategies via Directed

1

DecodingMLBPitchSequencingStrategiesviaDirectedGraphEmbeddings

AbstractThispaperpresentsanovelanalysisofpitchsequencinginMajorLeagueBaseball(MLB).Byleveraginghigh-resolutionpitchtrackingdatafrom~3.6millionpitchesacrossthe2015-2019seasons,thisworkintroducesdirectedgraph(network)embeddingsthatsuccessfullymapshort-andlong-termpatternsinpitchsequences.Thisquantitativeapproachtopitchsequencingcapturestheintuitionthatpitchersexhibitasenseoflong-termmemorywhenonthemoundthatisnotadequatelyrepresentedbyindividualpitchselectionorsimplepitch-to-pitchcorrelation.Byinterpretingthegraphembeddingsasforwarddependencies,thereiscompellingevidencethatpitchersconstructtheirsequencesviastrategiccomponents,denoted“setup”and“knockout”pitches.Model-basedclusteringonthegraphembeddingssuggestthatMLBpitchsequencescanbegroupedintoafinitecollectionofuniversalpatternswithrespecttobothpitch-typeandzoneselection.Exploratorydataanalysisofthesesequenceclustersindicatethatapitcher’ssequencingstrategiesaredistinct—thoughnotinseparable—fromtheiravailablepitcharsenal;thatis,inadditiontopitchselection,pitchorderingisasignificantcomponentofpitcherdecision-making.Moreover,pitchersexhibitpatternsorstylesinhowtheysequencetheirpitchesoverthecourseofasinglematchup,game,season,andcareer.Ultimately,thispaperintroducesananalyticalframeworktostudyandvisualizeMLBpitchsequenceswithpotentialapplicationsinmatchuppreparation,playerevaluation,andplayerdevelopment.1. IntroductionEachmomentinabaseballgameisanextensionofthebattlebetweenpitcherandbatter.Thisconflictperhapsmostcloselyresemblesthatofahigh-performancemixedmartialartsfight,wherethereisimmensepressure—andcorrespondingincentive—forpitcherstooptimizetheirapproachagainsteachbattertheyface.Inthismatchup,apitcher’sarsenalcanbeimaginedasasetoftechniquesthateachpitcherhasattheirdisposaltodefeattheiropponent.However,apitcher’spitcharsenalalone(i.e.,themovestheyhaveavailable)isanincompleterepresentationofapitcher’saggregatestyle.Bytreatingboththeselectionandorderingofbaseballpitches,pitchsequencingismuchmoreinformativeofpitcherdecision-makingbytakingintoaccounthowpitcherslearn,buildupon,andordertheirpitches.

Amajoranalyticschallengeinsabermetricsiscapturingdifferencesbetweenpitcherswithrespecttonotonlytheirperformanceonthefieldbutalsothestylestheyexhibitandthestrategiestheydeploy.Althougheachat-batbetweenapitcherandbatterisultimatelyeitherwonorlostusinganapproximatelyuniquecollectionofpitches,patternsmayemergewhereinsetsofpitchsequencesrecuracrossvariousin-gamesettings.Ifpitcherssharepitcharsenalsorfacesimilarbatter-matchupsituations,itisalsoexpectedthatsetsofpitchsequencesmanifestacrossmanyindividualsandacrossseasons.Conversely,pitcherscanalsodistinguishthemselvesfromothersintheirrolethatsharetheirpitcharsenalthroughdifferencesintheirpitchsequencing.Thisisadvantageoustothepitchersinceitseparatesthemfromtheirpeersandiscriticalforteamstodiversifytalentintheirroster.

Page 2: Decoding MLB Pitch Sequencing Strategies via Directed

2

However,preexistinganalyticalresearchinpitchsequencingcurrentlylimitsequenceanalysistopitchpairsviapitch-to-pitchcorrelation;thatis,apitchsequenceisdefinedexclusivelyintermsofconsecutivepitches[3].Thismemorylessapproachislikelynotreflectiveofhowpitcherspursuedecision-makingsinceitconstrainspitcherstoveryshort-termlearning.Theseearlymethodsglossoversomeofthecrucialqualitiesthatmakepitchsequencingascaptivatingasitisandgenerallyfailtoprovideinsightsastothestrategicmotivationsbehindpitchsequenceselection.

Thispaperintroducesagraph-basedsolutiontocapturethecomplexityofpatternsthatarelatentinpitchsequences.Unlikepriorstatisticalapproachesthatfocusprimarilyonpitch-levelpredictabilityorpitchpairings,thismethodsuccessfullyembedsshort-andlong-termpatternsinpitchsequencesinafinite-dimensionalfeaturespace.Eachgraphembedding,givenitsinterpretationasaforwarddependencybetweentwopitches,alsopreservesthedirected(i.e.,ordered)natureofbaseballpitchsequences.Fromthispropertyoftheembeddings,thispaperpresentsasuiteoftoolsthatcanbevaluableinastrategicsettingwithinateamfront-officeorplayerdevelopmentunit.Inordertomorecloselyanalyzehowpitchersconstructtheirsequences,theanalysisinthispaperseparatelyconsiderspitchsequencesbasedonpitch-typeconstitutionandzones.Throughmodel-basedclustering,thispaperidentifiesacollectionofpitchsequencingstrategiesthatarecommonthroughouttheMLBandvaryintheirconstitution,ordering,andentropy.Comparativeanalysisbetweensequenceclustersprovideinsightintocorestructuralelementsofpitchsequencingstrategy.Inparticular,thisworkfindsevidencefor“setup”and“knockout”pitchesbydrawingfromtheconceptofsinksingraphtheory.Withrespecttohowplayersselectpitchsequences,itisclearfromtheseclustersthatpitcherswhosharepitcharsenalscanalsodifferdrasticallyinhowtheyorderthosepitchesingamesituations.Applicationsofthisworktobaseballoperations,analytics,andscoutingaresuggestedandexploredthroughoutthepaper.

2. MethodologyThefollowingsectionswilldescribethedataandstatisticalmethodologythatunderliesthefindings.Section2.1providesanoverviewofthepaper’sprimarydataset(s)anddiscussesthemodel-specificimplicationsofdataquality.Section2.2.1introducesSequenceGraphTransform(SGT),thegraphembeddingalgorithmthatencodeseachpitchsequenceintoafeaturespace.Furthermore,Section2.2.2detailsthespecificmathematicalpropertiesofSGTthatmakeitattractiveforpracticalapplicationtothisproblem.Finally,Section2.3specifiestheclusteringtechniquethatisappliedtodetermineoptimalgroupingsofpitchsequenceswithoutrelyingonpre-existinglabelsorassumptions.

2.1 DataItisnosecretthatbaseballhasexperiencedatechnologicalrevolution,largelythankstoanexplosionofreliablehigh-resolutiondata.Allin-gamedata—includingallmeasurementsonpitches,idtracking,andcatcherdatathatareincludedinthispaper—arepubliclyavailablethroughtheMLBStatcastAPI[2].Thepitchingdataincludes3,595,944pitchesfromthe2015-2019MLBseasonsthatareprovidedalongside40featuresthatdescribepitch-levelcharacteristicsandinformationthatcontextualizeswheneachpitchwasthrown.Inaggregate,1523distinctpitchersand1894distinctbattersarerepresentedinthedata.Thisdataset—orashortenedformofit—hasbeenusedextensivelyinrecentlypublishedbaseballanalyticsresearch[4].

Page 3: Decoding MLB Pitch Sequencing Strategies via Directed

3

Sincethispaperisprimarilyconcernedwithsequentialdata,itisnecessarytorelyondiscretepitch-classificationlabels.Theuniquepitch-typeandzonelabelsintheMLBdatasetserveasthealphabetthattheembeddingalgorithmwilllearnfrom.Theaccuracyandprecisionofthepitch-classificationlabelsinthisdatasetisthuscriticaltothesuccessofthisresearch.Thismethodologyassumeshomogeneitywithinpitch-typeswithrespecttopitch-levelcharacteristics,thoughitisclearinFigure1thatthisassumptionhasshortcomings.Itisimportanttonotethatpitchclassificationitselfisanon-trivialproblem.Currently,in-houseMLBmodelsperformreal-timepitch-classificationfornearly750,000pitcheseachseasonusingpatentedneuralnetworks[2].Irrespectiveofthepitch-classificationlabelsthatareutilizedinthisanalysis,independentresearcherscaneasilyadaptthisworkusingtheirownpitchclassificationsystemsordatasets.

Figure1:PitchCharacteristicPairplot

Severaladjustmentstothedatasetaremadetoimprovethequalityofmodeling.At-batswithincompletemeasurement/labeldataarediscardedfromthedatasettominimizetheinadvertenteffectsofnoiseorcalibrationissues.Severalfeaturesarealsoaddedtotherawdataset.Toavoidlook-aheadbiasinpredictivemodeling,rollingstatisticsareusedtoreportpitcherandbatterperformancesandtrackchangesinplayeroutcomesovertime[4].Additionally,tocomparethesequencingtendenciesofpitchersinvariousrolesandcontexts,“starter”designationsforeachpitcherareinferredfromtheinformationavailableinthefirstat-batintopofthefirstinningandthefirstat-batinthebottomofthefirstinning[4].

Page 4: Decoding MLB Pitch Sequencing Strategies via Directed

4

2.2.1SequenceGraphTransform–OverviewandIntuitionThispaperappliesagraph-basedfeatureextractionmethodtoencodeforpatternsinMLBpitchsequences.Thismethod—SequenceGraphTransform—achievesthegoalofrepresentingthecomplexityofpitchsequencingintermsofbothpitchconstitutionandordering.

Asequencecanbedefinedasasetofdiscreteitemsthatarestructuredinanorderedseries[5].Sequencesareoneofthemostcommondatatypesinnatural,physical,andcomputationalsettings.Forexample,sequencingminingtechniqueshavefoundvaluebioinformaticsandcomputationalgenomics.Pitchsequencesnaturallyfollowthissequentialstructure;boththeorderandconstitutionofapitchsequencingareimportantindisruptingabatter’scalibrationtothepitcher’sstrategies.However,sincepitchsequencesrepresentunstructureddata—definedbyarbitrarilyplacedalphabetsofanarbitrarylength—theirmappingintoaEuclideanspaceisnon-trivial.Moreover,sincepitchsequencesgenerallyextendbeyondtwo-pitches,thisanalysisrequiresafeatureextractionmethodthatspecifieslong-termdependencies(i.e.,theeffectofdistantelementsinasequenceoneachother)inadditiontoshort-termdependencies(i.e.,theeffectofelementsinasequencethatoccurinshortsuccessionofeachother).

ThispaperutilizesSequenceGraphTransform(SGT),anembeddingfunctionthatextractspatternsinsequencesandmapsthemintoafinite-dimensionalEuclideanspace.TherearemultipleadvantagesthatareimpliedinthefunctionandmakeSGTparticularlyattractivetotheproblemofpitchsequencing.SGToperatesbyquantifyingtheobservablepatternsinasequencebyscanningthepositionsofeachitemrelativetootheritems.Theoutputtedembeddingscanbeinterpretedasadirectedgraph,whereeachalphabet(e.g.,pitch-type)becomesanodeandadirectedconnectionbetweentwonodesexplainstheirassociation.Inordertomaximizetheutilityofsequencingmininginbaseballpitchsequenceanalysis,itisimportanttohaveasequence-levelmeasureof(dis)similaritybetweenvarioussequences.Similarityisgivenexplicitlybythedot-productbetweentwoseparateembeddingswherehigherresultsindicatehighersimilarity.Additionally,SGTprovidesasolutiontoextractlength-insensitivepatternsinpitches.Asafeatureembeddingtool,SGTisparticularlyvaluablebecauseitprovidesamachine-interpretablerepresentationofacomplexdata-type.Thegraphinterpretationofeachembeddingallowsforpitch-typestoformnodesandthedirectedconnectionbetweennodesasasignalofthestrengthoftheirassociation.Graphembeddingsalsoenableunsupervisedlearningandclusteringforpatternanalysis.Moreover,eachgraphembeddingprovidedbySGTisdirected;thatis,boththeconstitutionandtheorderingdynamicsofeachpitchsequencearemaintained.[5]

Figure2:SGTApplicationtoPitchSequences

AlthoughthereareembeddingalternativestoSGT,theseexistingmethodseitherfailtoaccountforlong-termpatterns,producefalsepositives,orsubstantiallyincreasecomputation.Forexample,if

Page 5: Decoding MLB Pitch Sequencing Strategies via Directed

5

eachsequenceistransformedintoastring,thereareseveralmetricswhichdefinethedistancebetweentwouniquepatterns(e.g.,editdistance,hammingdistance).However,thesemetricsimproperlyaccountforthecomplexitiesofpitchsequencesanddonotoffertheadvantageofrepresentingpitchsequencesintermsofforwarddependencies.MeasuresdrawnfrominformationtheorysuchasShannonentropy,thoughinformativeofthelevelof“uncertainty”or“randomness”inasequence,alsodonotprovideenoughgranularitytoexplainsequencecharacteristics,namelypitchconstitutionandordering[6].

Ifpitchsequencesdonotcorrespondintermsofbothshortandlongpatterns,SGTembeddingswillnotproduceahighsimilaritymatch.WhilethepitchsequencesFF-CU-FF,FF-CU-FF-FF-CU-FF,andFF-SL-FF-CU-FF-SL-FF-SL-SLallshareasubsequenceinFF-CU-FF,SGTdistinguishesthefirsttwoexamplesfromthirdgiventheoverall,long-termdifferencesinthesequencepattern.Conversely,traditionalsubsequencematchingmethodswouldproduceasimilaritymatchbetweenallthreesub-strings[5].Giventheapplicationsetting,SGTispreferablesincetraditionalsubsequencingmatchingmethodswouldproduceafalsepositive.Finally,SGTiswidelyapplicableacrossmultipledomainsanddoesnotexhibitsettingbias.TheSGTalgorithmispubliclyavailable,anditslowcomputationalrequirementsallowittoberuninlocalenvironments[5].2.2.2SequenceGraphTransform–DefinitionSupposeasetofsequences𝑆iscontainedinadataset.Anyindividualsequenceinthedatasetcanberepresentedby𝑠 ∈ 𝑆,wheresisconstitutedbyanalphabet𝒱.Forthepurposesofpitchsequencing,alphabet𝒱eithercontainsacollectionofpitch-typelabelsordiscretezone-locationlabels.Eachsequence𝑠hasatleastoneinstanceofoneelementfromthealphabet𝒱,thoughitisnotnecessaryforeveryelementinalphabet𝒱toberepresentedineachset𝑠.Inbaseballterms,apitchsequenceisdefinedbyatleastonepitchthatisthrowninanat-bat;moreover,duetopitcharsenalconstraints,strategicdecision-makingmotivations,andin-gamefeasibility,itisnotexpectedthatapitcherthrowseverypitch-typeinasinglesequence.Thelengthofaspecificsequence𝑠isdenotedby𝐿(").Foreachsequence,𝑠$ willrepresenttheelementthatoccursatposition𝑙where𝑙 = 1,… , 𝐿(")and𝑠$ ∈ 𝒱.𝜑%(𝑑)isafunctionthattakesadistanceasinputand𝜅asatuninghyper-parameter.Thefeaturesforsequence𝑠areextractedintheformofassociationsbetweenalphabetinstances.Eachassociationisdenotedas𝜓&'

("),where𝑢, 𝑣 ∈ 𝒱correspondtospecificalphabetinstancesand𝜓correspondstoahelperfunctionof𝜑.

SinceSGTconsiderstherelativepositionsofinstancestoformeachfeature,𝜑3𝑑(𝑙,𝑚)5quantifiestheinformationthatisextractedfromtherelativepositionsoftwoinstances,where𝑙, 𝑚arethepositionsoftwodistinctinstancesand𝑑(𝑙,𝑚)isacorrespondingdistanceoutput.Specifically,𝜑3𝑑(𝑙,𝑚)5denotestheeffectofthefirstinstanceonthelaterinstance.

ToapplySGTonasetofsequences𝑆,thefollowingconditionsmustholdon𝜑:(1)Thefunctionoutputisstrictlygreaterthan0suchthat𝜑%(𝑑) > 0; ∀𝜅 > 0, 𝑑 > 0;(2)Thefunctionstrictlydecreaseswith𝑑suchthat (

()𝜑%(𝑑) < 0;and(3)thefunctionstrictlydecreaseswith𝜅suchthat

((*𝜑%(𝑑) < 0.ThefirstconditionisdesignedtomaintaintheinterpretabilityofeachSGT

embedding,whereasthesecondconditionmaintainsthatcloserneighborshavehighercorrespondingdependencies,andthelastconditionmaintainsthattheeffectofdistantneighborscanbeeffectivelytuned.

Page 6: Decoding MLB Pitch Sequencing Strategies via Directed

6

SGTusesanexponentialfunctionfor𝜑sinceitsatisfiesallthreeconditions.Usingthisapproach,thedistancebetweentwodistancescanbetakensimplyas𝑑(𝑙,𝑚) = |𝑚 − 𝑙|.

𝜑*3𝑑(𝑙,𝑚)5 = 𝑒+%(,+$), ∀𝜅 > 0, 𝑑 > 0 (1)

Sincepitchersthrowmultiplepitchesinasequence,therearelikelytobeseveralinstancesofalphabetpairs(𝑢, 𝑣).TheSGTalgorithmisinitiallyconcernedwithdetermininghowmanyofeachalphabetpairexistsinthesample.Eachalphabetpairisultimatelystoredina|𝒱| × |𝒱|asymmetricmatrixΛ.Λ&'willcontaineveryalphabetpair(𝑢, 𝑣)thatiscontainedaspecificsequence𝑠suchthatthe𝑣’spositioncanalwaysbeinferredtobeafter𝑢ineachinstancepair.

Aftercomputing𝜑foreach(𝑢, 𝑣)pairinstancethatiscontainedinasequence,theassociationfeature𝜓&'

(")isdefinedasthenormalizedaggregationofallinstances.Sinceallanalysisconductedinthispaperisconcernedwithlength-insensitiveproblems,theeffectoflengthiscontrolledforbynormalizing|Λ&'|,thesizeofthesetΛ&'orthenumberofunique(𝑢, 𝑣)pairs,withthesequencelength𝐿(")asshowninEq.2.

𝜑&'(𝑠) = ∑ .!"($!%)∀(%,$))*+(,)

|0*+(")|/2(,) (2)

TheSGTfeaturerepresentationofanentiresequencecanberepresentedastheaggregationofΨ(𝑠) = [𝜓&'(𝑠)], 𝑢, 𝑣 ∈ 𝒱.ThefeaturesΨ(")canbeinterpretedasadirectedgraphwithedgeweights𝜓andnodesin𝒱foreachsequence𝑠inthedataset.

SGTcontainsasingletuningparameter𝜅.𝜅modulatestheextenttowhichlong-termdependenciesarecapturedineachembedding.Asmallvalueof𝜅preserveslonger-termdependencies.Sincetheaveragepitchsequenceisapproximately5pitchesattheat-batlevel,asmall𝜅 = 1(defaultvalue)isselected.[5]

2.3Model-BasedClustering–GaussianMixtureModelsUnsupervisedlearningviamodel-basedclusteringcanbeusedtodiscoverpatternsinat-batpitchsequences.ThispaperreliesonGaussianMixtureModels(GMMs),amodel-basedclusteringtechniquethatusesanexpectation-maximization(EM)algorithm,tofindclustersthatcapturethecommonpatternsthatmanifestandrecurinpitchsequences.Giventhereisnosetofintuitionthatinformshownumberoftheseclustersorpatternsexists,thisanalysisrequiresanapproachthatself-determinesanoptimalnumberoftargetclustersKfromthegraphembeddings.

Otherpopularalgorithms,suchashierarchicalclustering,arealsoabletoproduceKwithoutuserlabels,usingmeasuresofin-clusterandout-of-clusterdistanceandvariance[6].However,currenthierarchicalclusteringalgorithmsscaleinO(𝑛3)timeandoftenproduceunstableresults[6].Thiscomputationalhurdleisespeciallyrelevantwhendealingwithahighvolumeofdataacrossmultipleseasons.

Toassessthepotentialinstabilityofanyresults,itisdesirabletoobtainaprobabilityestimatethatasequencebelongstotheirassignedcluster.Additionally,GMMscanframeK-selectionasamodel-selectionproblemusinglikelihood-basedmeasuressuchasBayesianInformationCriterion(BIC),whichbenefitsinterpretability[7].

Page 7: Decoding MLB Pitch Sequencing Strategies via Directed

7

AGaussianMixtureisafunctionthattreatsseveralGaussians,eachidentifiedby𝑘 ∈ {1,… , 𝐾}.EachGaussiankincludesamean𝜇thatdefinesitscenter,acovarianceΣthatdefinesitswidth,andamixingprobability𝜋thatdefinesthesizeoftheGaussianfunction,asshownbyFigure2[8].

Eq.3explicitlydefinestheGaussianMixtureModelLikelihoodFunctionusedinthisapproach.TheGaussianMixtureModelLikelihood-FunctioncomputesaMaximum-LikelihoodEstimate(MLE)tofindtheoptimaldistributionthatunderliesthedataset.

𝐿(𝜃|𝑋4, … , 𝑋5) = ∏ ∑ 𝜋%𝑁(𝑥6; 𝜇% , 𝜎7)8%94

5694 (3)

AllmodelingandassociateddataanalysiswereconductedinPythoninaJupyterNotebookenvironment.TheGaussianMixtureModelsandExpectationMaximizationalgorithmwerebuiltandimplementedfromscratch.3 Pitch-Type&ZoneSequencingStrategiesDuringMLBgameplay,pitchersengageinadvanceddecision-making.Pitchersareincentivizednotonlytopromoteunpredictabilityintheirsequencingdecisionsbutalsoadapttothestrengthsandweaknessesofthebattertheymatchupwith.Ineachofthesesubsequentanalyses,itisassumedthatapitcheriscapableof“callingtheirownpitches;”thatis,pitchersarelargelyresponsibleforwhatpitchestheythrow.Theintuitiontosupportthisassumptionisstrong—pitchersdonotcompletelyredesigntheirpitchingstyleswhenevertheircatcherorumpirechanges,forexample.Generally,pitchsequencingcanbedistilledintotwodistinctcomponents:Pitch-typeselectionandzone (location) selection. Inorder to fitMLBpitchingdata to the required input structureof theSequenceGraphTransformalgorithm,pitchsequencesarerepresentedasorderedlistsofpitch-typelabelsanddiscretezone-location labels, respectively (SeeSection2.2.1).Fromtheseembeddings,unsupervisedlearning(GaussianMixtureModels)isusedtoclusterpitch-typeandzonesequencesindependently (See Section 2.3). Then, each at-bat sequence is assigned with two cluster labels(pitch-typeandzone)andispairedwithrelevantin-gameinformation(e.g.,score,inning,resultofthesequence).This analytical design has multiple advantages: (1) It is possible to discern which sequencingstrategiesarecommoninavarietyofin-gamecontexts;(2)Thecovarianceofpitch-typeandzone-labelscanbeanalyzed,andthecomplexitybetweenpitch-typeselectionstrategiesandzoneselectionstrategies can be compared; and (3) Sequencing styles across players and time periods can besystematicallyevaluated.3.1.1Pitch-TypeSequenceClustersModeling results from pitch-type sequences show that there are many diverse groups of pitchsequencesat theat-bat level thatare thrownin theMLB.Sincetheoptimalnumberofclusters isselected on Bayesian Information Criterion (BIC), the number of pitch-type sequence clusters isinformativeoftheaggregatecomplexityofpitch-typesequencingstrategies.Eachsequenceclustervariesintheirpitch-typeconstitutionandordering(SeeTable1;Notethatonlythetopthreepitch-typesbyrelativefrequencyaredisplayedinTable1foreachclusterforthesakeofclarity).The fitted Gaussian Mixture Model supplies a probabilistic distribution for each sequence’sassignment to a pattern cluster. The overwhelming majority of at-bat sequences (99.87%) are

Page 8: Decoding MLB Pitch Sequencing Strategies via Directed

8

assignedtoclusterswithahighprobability(greaterthanorequalto99%assignmentprobability).The0.13%ofat-batsthatareassignedtoaclusterwithlessthananassociated99%probabilityallhave an assignment probability greater than 50% and have a mean assignment probability of86.37%.Withinthis0.13%ofat-bats,itisimportanttonotethereisnoexplicitpatternastowhichsequencesseemdifficulttoassigntoacluster.Giventhisdistribution,itisclearthatmostsequencesmapverystrongly toa singleclusterand that themodel-results canbe interpreted in termsofaprimaryclusterassignment.FromTable1, it isclearthatpitchersrelyonavarietyofpitchsequencesacrossmatchups.Sinceclusteringtakesintoaccountallpitchsequencesforpitchersthatmeetanat-batminimum,themodelresultsareexpectedlycorrelatedwithwhicharsenalsaremostpopularintheMLB(SeeFigure3,Table1).Forexample,cluster3containspitchsequencesthatincludethreeofthemostpopularpitch-typesintheMLB,soitisexpectedthatthisclusterisalsorelativelypopularamongpitchers.

Figure3:Pitch-TypeSequenceClusterbyRelativeFrequency/Density

Withineachpitchsequence,itisobservablethatthereisasplitbetweenfastballtypesandoff-speedpitchesthatarethrown;inotherwords,everysequencepatterninvolvesbothfastballpitchesandoff-speed pitches, though in varying frequencies. These results present significant evidence thatpitch-typemixingisacriticalaspectofallsequencingstrategiesacrossadiversesetofarsenals.Notethatsequenceclusters thatsharepitch-typesathigh frequenciesalsodiffer in theirorderingandentropy(SeeAppendix:Table2,Table3).While it is more difficult to summarize ordering in a table or visualization, Table 2 (Appendix)presentsaside-by-sideexample-basedcomparisonofcluster3,cluster7,andcluster10.Eachat-batexampleyieldsanestimated99%probabilityofmembershiporgreater,which impliesthatthesesequencesareexemplaryoftheidentifiedsequencecluster.Itisobservablethatsequenceclusterswithsimilarfrequenciesofpitch-typesalsodifferinordering.Theorderingcomponent ishighlightedfurtherbywhichpitch-typestendtobethrownearly inasequence versus later in the sequence (Table 2; colored by beginning, middle, and end of thesequence).Finally,itisevidentthateachclusterisapproximatelythesamelength,whichisexpectedsincethegraphembeddingslearnlength-insensitivepatternsinpitchsequences.

Page 9: Decoding MLB Pitch Sequencing Strategies via Directed

9

Table1:Pitch-TypeClusterMembership

3.1.2DefiningSetupandKnockoutPitcheswithGraphSinksBy interpreting the graph embeddings as forward dependencies, this analysis finds evidence for“setup”and“knockout”pitchesinvariouspitch-typesequenceclusters.Drawingfromtheconceptofa“sink”ingraphtheory,itispossibletodissecteachpitchsequenceclusterintermsofitsstrategiccomponents.Inoperationsresearchandgraphtheory,asinkisformallydefinedasanodewhichonlyhasincomingflow;thatis,multiplenodesdirecttothesinkbutthesinkdoesnotdirectitselftoanyothernodesinthesystem[9].Thisdefinitioncanbestrictlymaintainedforcertainapplicationsandcontexts,suchas when modeling fluids in a pipe system, currents in an electrical circuit, or even whenapproximatingwebpage importancesuchas inthePageRankalgorithm[10].However,giventhenatureofbaseballpitching,itisunlikelythatpitchersreservetheuseofacertainpitchesonlyfortheendorbeginningofasequence.Thisiscounterintuitivetotheincentiveeachpitcherhastominimizetheirpredictability[3].Instead,bylooseningthisformaldefinitionofgraphsinkstoallowforlimitednon-zerooutflowfromsinknodes,itispossibletoidentifysetupandknockoutpitchesinthecontextofpitchsequencing.Withoutlossofgenerality,eachgraphembedding(A,B),forapitch-typeAandpitch-typeB,providesinformationonthepreferredorderingbetweentwopitches.Ahighvalueinthefeature(A,B)indicatesthatAisfollowedbyasignificantnumberofBsinthesequence.However,itisnotguaranteedthatthisrelationshipholdsfor(B,A).ItispossiblethatBisnotfollowedbyasignificantnumberofAs,eveniftheconverseistrue.Specifically,thefrequencyofpitch-typeAmaynotbedependentonpitch-typeBtotheextentthatthefrequencyofpitch-typeBisonpitch-typeA.In particular, a setup pitch is defined within each sequence cluster as a pitch whose forwarddependencies are generally similar across all associated pitch-pairings and order permutations.Giventhisdefinition,setuppitchesarethusmorecommonrelativetoknockoutpitchesinsequenceconstructionbydesign(SeeTable4).Aknockoutpitchisdefinedwithineachsequenceclusterasapitchthatfollowsmanypitchesbuthaslimitedcorrespondingoutflow.Additionally,aknockoutpitch

Page 10: Decoding MLB Pitch Sequencing Strategies via Directed

10

couldfunctionasa localoruniversalsinkwithineachsequenceclustergraph;that is,aknockoutpitchcouldhavealargepositivedifferencebetweenitsinflowandoutflowforasinglespecificpitch(SeeAppendix:Table4–“GreatestDiff.”)orhavepositivedifferencesbetweeninflowandoutflowformultiplesetuppitches.Withineachcluster,boththesetupandknockoutpitchesneededtomeetafrequencythresholdtobeidentifiedassuch—thispreventspitchesthatrarelymanifestwithinasequencefrombeingdesignatedaseitherasetuporknockoutpitchforthesequence.Givengeneraldefinitionalconstraints,itisexpectedthattherearefarmoresetuppitchesandfarfewerknockoutpitches.Thenomenclaturehereisnotintendedtosuggestthatsetuppitchesarenotthrownwiththegoalofachievingafavorableoutcome(e.g.,strikingthebatterout,recordinganout)atthatspecificmoment.Likewise,aknockoutorterminalpitchisnotaninterpretationofaplayer’smosteffectivepitcheswithrespecttoachievingthosesimilarlyfavorableoutcomes.Table4displays theprimarysetupandknockoutpitches thatwere identifiedupon inspectionofforwarddependencies.Itisimportanttonotethatnotallclustershaveknockoutpitches.Specifically,using thedefinitionabove, for these fewclusters, therewerenopitches thathada largeenoughpositivedifferencebetweenitsinflowandoutflowforaspecificpitchorhadconsistentlypositivedifferencesbetween inflowandoutflowformultiplesetuppitches. Instead, forclusterswithoutaknockout pitch, the forward dependencies between pitch-types were approximately equal inmagnitudeirrespectiveoftheexactdirectionbetweenthetwocomparisonfeatures(A,B)and(B,A).

Table4:SetupandKnockoutPitches

Note:“Val.ofDiff.”reportsthedifferenceinthesumofforwarddependenciesfor(A,B)andthesumoftheforwarddependenciesfor(B,A)withineachcluster.PitchBisconsideredaknockoutpitch(i.e.,localsink)givenarelativelylargepositivedifferencebetweenthesevalues.Additionally,apitchBcanbeconsideredaknockoutpitchiftherearemultiplepitchesA,C,D…,nsuchthatthedifferencebetween∑ (𝑖, 𝐵)!

"#$,"&' ≫∑ (𝐵, 𝑖)!

"#$,"&' isalsotrue.

Themagnitudeofthedifferencebetweentheforwarddependenciesfromonedirectiontoanotheralso indicate that certain knockout pitches are “stronger” than other knockout pitches betweenclusters.Asshownincluster3,thereisaveryextremedistinctionbetweenthesumoftheforwarddependenciesin(FF,CH)versusthesumoftheforwarddependenciesin(CH,FF).Thisimpliesthatsequencesthatbelongtocluster3rarelyusechangeupsunlessthechangeupcomesattheendofthesequence.Incomparison,thegreatestdifferencebetweenforwarddependenciesforcertainclusterssuchascluster12isfarlowerthanthedifferencethatisdisplayedincluster3.Thisfindingsuggeststhat thechangeup, theknockoutpitch incluster12, isnotexclusivelyusedtowardstheendofa

Page 11: Decoding MLB Pitch Sequencing Strategies via Directed

11

sequence.Acrossallsequenceclusters,knockoutpitchestendtobebreaking-ballswhereassetuppitchesarelargelyfastballsorfastballvariations.Whilethisintuitionisassumedastrivialdomainknowledge for teams, coaches, and players, this quantitative assessment of setup and knockoutpitcheswithinthecontextofpitchsequencesislikelynovel.3.1.3Pitch-TypeSequencinginContextThereareseveralsequencingstrategiesandpatternsthatarepotentiallyobservablefromthemodel-results.Forexample,eachsequenceclusterinvolvesacombinationoffastballsandoff-speedpitches,ashighlightedabove.Itisperhapssignificanttonotethattherearemultiplesequencesthatsharepitch-typeconstitutionyetdifferinordering.Thissuggeststhatorderingisanimportantcomponentof pitch-type sequences which alternatives such as simple pitch frequencies and pitch-to-pitchcorrelationdonototherwisecapture.Sinceitisexpectedthatpitchersthrowpitchsequencesthatbelongtodifferentseveralclusters,itisfeasibletoanalyzehowsequenceusagevariesinmultiplein-gamecontextsandmatchups.Figure5&6(Appendix)displaystherelativefrequencyofeachclusterconditionalonpitcher-batterhandednessandstarter/relieverrole,asexamples.3.2.1ZoneSequenceClustersAs with pitch-type sequencing, there are several universal zone sequencing patterns that wereidentifiedbymodel-basedclustering.Pitcherstendtovaryinhowtheyset-upbattersviatheirzonesequences,buttheirusageoftheseclustersisnotuniformlydistributedacrossgameplaysituations(SeeFigure7,Appendix:Figure8&9).Sinceplayershaveallzonesavailabletothem(i.e.,theyarenotinhibitedbypitcharsenal),clusterpopularityrelatestowhichlocationstrategiesarepreferredbyMLBpitchersinsteadofphysicalorarsenalconstraints.Itissignificanttonotethatthemodel-basedclusteringidentifiesfewerclustersforzone-basedsequences thanpitch-typesequences.Additionally,whencomparedbetweeneachother,zone-basedclustersreflectthepitcher’sdesiretonotleavepitchesinthemiddleorhighinthezonetopreventsolid-contactandbatted-ballsthatarehitintheair(Table5).Instead,eachclusterhasahighrepresentationofzonesthatarelowerdowninthestrikezone.Thisultimatelysupportsacasethatpitch-typesequencingismorecomplexinaggregatethanzonesequencing.

Figure7:ZoneSequenceClusterbyRelativeFrequency/Density

Aswithpitch-typesequenceclustering,model-basedclusteringallowsustocalculateaprobabilisticdistributionforeachuniqueat-bat’sassignmenttoazonesequencecluster.Likepitch-typesequenceclusters, theoverwhelmingmajorityofat-batsequences(99.60%)areassignedtoclusterswithahighprobability(greaterthanorequalto99%assignmentprobability).The0.40%ofat-batsthatare

Page 12: Decoding MLB Pitch Sequencing Strategies via Directed

12

assigned to a zone clusterwith less than an associated 99% probability all have an assignmentprobabilitygreaterthan50%andhaveameanassignmentprobabilityof88.99%.Aswiththepitch-typeclusters,here isnoexplicitpatternas towhichzonesequences seemdifficult toassign toacluster.Theseresultsindicatethattheprimaryzoneclusterassignmentcanbeuniformlyusedgiventhehighassignmentprobabilities.

Table5:ZoneClusterMembership

3.2.2ZoneSequencinginContextIfzonesequencingstrategiesarerelativelynon-complex,thenitisexpectedthatthedistributionofzone sequence clusters to be relatively consistent between pitchers and across matchups. Inparticular, split summary statistics can easily describe how the usage of zone sequence clustersdiffersagainstvariousright-andleft-handedbattersandbetweenstartersandrelievers.Thetwozone-basedsequenceclusters(clusters4and6)thathaveahigherfrequencyofpitchesuporinthemiddleofthestrikezonetendtobethrowninsimilarcontexts,asshowninFigure8&9(Appendix).Bothcluster4andcluster6arethrownmorebyrelieversthanbystartersincomparisontotheotherzoneclusters,thoughtheabsolutedifferenceinusagebetweenstartersandrelieversisfarlessdrasticthandifferencesthatareobservedinusageofpitch-typeclustersbetweenstartersandrelievers.Additionally,bothclusters4and6are thrownpredominantlywhenpitcherssharehandednesswiththeirbatters.3.3RelationshipbetweenPitch-TypeandZoneSequencingTheChi-Squaretestofindependence(df=104)findsamplestatisticalevidencetosuggestthatthepitch-typeandzone-clustersarenotindependent;thatis,thevalueofoneclustervariablesimpliesachangeintheexpectedprobabilitydistributionoftheother(p<0.001).Giventhatcertainpitch-typescoincidewith location (e.g., breaking-balls aredesigned tobe in lowerzones), it is expected thatclusterswith high frequencies of these types of pitches tend to be associatedwith certain zonesequences.Thus,futureanalysisandworkcanattempttore-clusterpitchlabels(beforesequenceanalysis)by including locationasapitch-level characteristic. Inaggregate,playersdonotexhibitstrongdifferencesinzonesequencingstrategiesabovewhatisexpectedbydifferencesintheirpitch

Page 13: Decoding MLB Pitch Sequencing Strategies via Directed

13

arsenals.However, it is noteworthywhenpitchersdecide todeviate from their otherwise strongtendencies. This analytical framework allows teams to evaluate game- and matchup-levelcircumstanceswhentheymayanticipateshiftsinpitchsequencingbehavior.3.4ExamplePlayerSequencingComparisonsTheclusteringframeworkenablesplayercomparisononthebasisofpitchsequencing.Whileit isvaluabletoknowwhatsequencesaplayerisaccustomedtothrowing,itisincreasinglyimportanttohaveanunderstandingofhowaplayer’s sequencingbehavior changes invariousgamecontexts.Teamscanachieveanedgeinscouting,playerdevelopment,andmatchuppreparationbyusingthesetoolstodecryptpitchingpatternsandinvestigatemotivationsbehindthesedecisions.Figure 10 displays side-to-side comparison of the pitch-type sequencing strategies used by 4prominent MLB pitchers against left-handed and right-handed batters. There are several keytakeawaysthatareobservablefromthissmallsampleandarelargelyreflectiveofbaseballintuition.Each pitcher relies on various clusters or pitch-type sequencing patterns. Pitchers are largelyincentivized to remainunpredictable andvariance inpitch-type sequencing is one componentoftheirabilitytodeceiveabatter.Therearealsocommonalitiesamongpitcherswhoshowsimilaritiesinsequencepatterndistribution.Forexample,starters(e.g.,BlakeSnell)tendtoshowmorevarianceandmatchupversatilityintheirpitch-typesequencingstrategiesthanChapmanandHader,bothofwhomarehigh-leveragerelievers.Usingapitcher’sclusterdistributionasaproxytheoveralldiversityoftheirsequencingstrategies,astrong negative correlation between sequencing complexity and fastball velocity is detected.Intuitively,thissuggeststhatplayerswhothrowhardcanrelyontheir“stuff”more-sothanplayerswhodonothavethatadvantage.Conversely,pitchersthatcannotthrowinhighvelocitiesneedtotakemorediverseandcomplexapproachestotheirsequencingdecisions.Thisfindingisconsistentwithpriorresearchoneffectivevelocityandpitchsequencingstrategy[11].PlayersseemtopursueuniversalsequencingstrategieswithrespecttozoneselectionasshowninFigure11.Theirzonesequencedecision-makingishighlycorrelatedwiththeirpitcharsenal.Unlikepitch-typeselectionwhereplayersneedtodrawfrompitch-typesintheirpitch-arsenal,anyzoneisavailable toapitcherat agiven time, assuming thatapitcherhasageneral commandover theirpitches.Accordingly,eachpitcherinthissamplethrowsatleastonesequencethatbelongstoeachcluster.Aswithpitch-typesequencing,eachpitcherdisplaysauniqueassortmentofpatternsandbehaviorsintheirzonesequencingstrategies.BlakeSnell(LHP),forexample,takesdivergentapproacheswhenhepitchestoleft-handedandright-handedbatters;specifically,hereliesmuchmoreonsequencecluster6(characterizedbypitchesupinthezone)againstleft-handedbattersandmoreonsequencecluster 1 (mixed locations) and sequence cluster 8 (low locations). This shift is not particularlyobservable inotherpitchers and is visually apparent inFigure6.Otherpitchers, suchasAroldisChapman, Josh Hader, and Clayton Kershaw, depend on very similar zone sequencing strategiesagainstleft-handedandright-handedpitchers.So,whilezonesequencingstrategiesarelesscomplexthan pitch-type sequencing strategies in aggregate (See 3.2.1), players still exhibit significantdifferencesinhowtheyformstrategiesforlocatingtheirpitches.

Page 14: Decoding MLB Pitch Sequencing Strategies via Directed

14

Given that both pitch-type and zone sequences are clustered on an at-bat level, player-specificdevelopmentandevolutionacrossvarioustime-spans(e.g.,withinagameorseason)isobservable.Whilesomeplayers’sequencingbehaviorsarerelativelystableacrosstheseasonstheyparticipatedin,otherplayersshowevidenceofshifts inhowtheyapproachsequencingovertime.Thiscanbeattributedtochangesinmatchupstrategy,growthanddevelopmentofpitcharsenals,oreveninjurymitigation and recovery processes. Conversely, pitchers do not seem to exhibit significant time-dependenceintheirzonesequencingstrategies.

Figure10&11:Pitch-Type&ZoneSequencingTendencies–PlayerExamples

3.5AggregateSequence-BasedPlayerSimilarityAfteraggregatingpitch-leveldata,pitchsequencesareanalyzedattheat-batandplayer-careerlevel.Theat-batembeddingsareusedtoperformanalysisongame-varyingqualitiesofpitchsequencing;an example analysis that this paper explores is what pattern clusters are identifiable across allpitchers.Theaggregatepitchersequenceembeddingswillformthebasisforplayersimilaritysearch.Playersimilaritiesareimmenselyusefultoteams,scouts,andfans,andhavealonghistoryinbaseballanalytics dating back to Bill James’ similarity scores [11]. Since player similarity enhances ourcontextualunderstandingofaplayer’stendenciesandbehavior,itisvaluabletodetermineplayersimilarity as measured through sequencing decisions that players have made during in-gamesituations. InthesamewaythataDNAsequencedefinesthebiologicalmakeupofanorganism,apitchsequence—whenaggregatedacrossaplayer’scareer—summarizesthecumulativedecisions

Page 15: Decoding MLB Pitch Sequencing Strategies via Directed

15

that the pitcher has previously made. After embedding each player-aggregate sequence, playersimilaritiesarecomputedwiththedotproductofthequeryembeddingwithotherembeddingsinourplayersequencedatabase[5].Foreachplayersequencequery,theplayersequencingembeddingthatproduces the largestdotproductwith thequerywillbe takenas themostsimilarsequence.Whenusing rawembedding outputs, this similaritymethod takes into account diversity in pitcharsenals since pitch-types that are not seen in a specific players sequence will have a forwarddependencythatisinitializedwithazero-value.Usingthedotproduct,pitcherswithdiversepitcharsenals(i.e.,manypitch-typestheycanthrow)tendtoappearfrequentlyassimilaritymatchessincetheyhavethelastnumberofforwarddependenciesinitializedaszero.Thus,itisimportanttonotethat while these similarity matches are correlated with player pitch arsenals, sequence-basedsimilarityalsotakesintotheorderinwhichpitchesarethrown.Sequence-based player similarities are used to begin to explore the relationship between playervalueandsequencing.Drawingfromaplayersimilarityfunctionthatoutputstheclosestplayerforeachplayerinput,itisevidentthatsimilarityinpitch-typesequencingdoesnotcorrelatetoeithersimple(e.g.,K%,ERA,H/9)oradvancedmetricsofplayerperformance(e.g.,FIP,xFIP,tERA)abovewhatisachievedfromrandomassignment.Whiletherearemoreextensivestatisticalmodelsthatcantest thishypothesis further,playerswhomanifestsimilarsequencingstrategiesseemtovaryquitedrasticallyintheirperformance.Thissignalsthatsequencingalonedoesnotpredictaplayer’sperformanceorcompensateforalackofpitching“stuff.”

It is also important to acknowledge that the approach described in this paper should not beinterpretedasameanstoassignvaluetopitchsequencesorsequencingdecisions.Inordertodefinesequencingskillprecisely,researchersshouldstrivetoseparatetheeffectandvalueof individualpitches from the added benefit of a specific ordering. Instead, this research is instrumental inprovidingaframeworktodecodethesequencingstrategiesthatpitchersconsistentlyutilizeforthepurposes of in-game strategy. Instead of applying pitch sequencing similarities for player valuepredictions,playersequencingsimilaritiescanbeutilizedtoassistteamsinmatchuppreparationandpitchersintheirdevelopmentalgoals.

Fromthisgraphembeddingapproach,apitcher’saggregatesequencingdecisionscanberepresentedin total by a directed graph or network. Each graph representation can easily be presented as avisualization(SeeFigure12&13)thatcharacterizesapitcher’ssequencingpatternsacrossmultipleseasons (2016-2019). This visualization can capture the general complexity of each pitcher’spitchingstyle foravarietyofresearchanddevelopmentpurposes.Therelativesizeofeachnodecorrespondstotherelativefrequencyofeachpitch-typewhereastherelativethickness(i.e.,weight)ofeachedgecorrespondstotheforwarddependencybetweentwopitch-types.In order tomaintain visual consistency across all graphs, eachmajor pitch-type is included in apitcher’sgraphvisualizationirrespectiveofwhetherthatpitcherhasthatpitchintheirarsenal.Eachlinkiscoloredaccordingtothedirectoftheoriginatingnode.Thestrongerassociationislaidovertheweakerassociationdirectionally.Therelativesizeofeachnodedenotestheexistenceofthatpitchinaspecificpitcher’sarsenal.Thisconsistency inthevisualizationformatallowsforaqualitativeside-to-side comparison of player network graphs, though the design/aesthetic features arecustomizablebytheuser.Thedatathatunderlieseachnetworkvisualizationcanbelimitedtocertainin-gamescenariosormatchupstoillustratesituationaldecision-making.

Page 16: Decoding MLB Pitch Sequencing Strategies via Directed

16

Figure12&13:SequenceGraphNetworksforBlakeSnellandClaytonKershaw(2016-2019)

4 RelatedApplications&FutureWorkIn addition todeveloping anovel analytical framework to studypitch sequencing strategies, thispaperpresentsmultipletoolsthatMLBteamscanuseinmatchuppreparation,scouting,andplayerdevelopment.Sincepitchsequencingisotherwisedifficulttoencodeforusingtraditionalstatisticalmethods, teams can take immediate steps in applying this graph embedding approach to studypitcher styles and behavior. There are many questions to be asked about the motivationssurroundingusageofcertainpitchsequences.Thisframeworkenablesresearcherstoreproducethisapproachintheiranalysisofpitcherbehavior.Furthermore,graphnetworkvisualizations(asshownin Figure 12 & 13) that are based upon player-aggregate sequence data can be helpful incommunicatingapitcher’ssequencingprofileinaninterpretablemanner.Whenmappedinatime-series,teamscanobservehowsequencenetworkschangeovertimeandhowplayersaredevelopingand modifying their existing strategies. This tool can supplement advance scouting preparationagainst other teams, especially after conditionalizing player data based on certain in-game andmatchupscenarios.Additionally,teamscanconsiderhowapitcher’ssequencingchangesinresponsetotheirpreviousoutcomesinanat-batorgame.Thereisastrongstrategiccomponenttothisworkaswell—futureworkmayconsidertestingwhetherpitchsequenceclustersarepredictiveofplayerdecisions,whichwouldimplythatpitchersarevictimstoinformationleakageatthesequencelevel.Finally,stepsaretakentodevelopanovelopen-sourceaggregationdatabasethatallowsindependentresearchersandMLBteamstoperformpitchsequenceanalysisbasedonthisresearch.Functionalityofthistoolkitincludessearchingforspecificpitchsequencesthathaveoccurredoveramulti-yearspan,ability tovisualizenetworksanddownloadpitchsequencinggraphembeddings forspecificpitchers,andaccesstodrop-downcomparisonsandplayer-similaritymatching.

Page 17: Decoding MLB Pitch Sequencing Strategies via Directed

17

References[1]Bock,J.(2015).PitchSequenceComplexityandLong-TermPitcherPerformance.Sports,40-55.[2]Sharpe,S.(2020).MLBPitchClassification.https://technology.mlblogs.com/mlb-pitch-classification-64a1e32ee079.[3]Glenn,H.,&ZhaoS. (2017).UsingPitchf/x tomodel thedependenceof strikeout rateon thepredictabilityofpitchsequences.https://content.iospress.com/articles/journal-of-sports-analytics/jsa103.[4] Zhan, J., Gerstner, L., & Polimeni, J. (2020). Measuring the Impact of Robotic Umpires.https://global-uploads.webflow.com/5f1af76ed86d6771ad48324b/5f6a65851d1ac98081a707f0_Zhan_MeasurinM-the-impact-of-robotic-umpires.pdf.[5] Ranjan, C., Ebrahimi S., & Paynabar, K. (2016). Sequence Graph Transform (SGT): A FeatureEmbeddingFunctionforDataMining.arXiv:1608.03533v13.[6]Manning,C.,Raghavan,P.,&Schütze,H.(2008).IntroductiontoInformationRetrieval.ISBN:0521865719.[7]Srihari,S.(2019).MixturesofGaussians.https://cedar.buffalo.edu/~srihari/CSE574/Chap9/Ch9.2-MixturesofGaussians.pdf. [8]Carrasco,O.(2019).GaussianMixtureModelsExplained.TowardsDataScience.towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95.[9]Brossard,E.(2010).GraphTheory:NetworkFlow.UniversityofWashington.https://sites.math.washington.edu/~morrow/336_10/papers/elliott.pdf.[10] Agarwal, B., & Khan, M. H. (2013). Analysis of Rank Sink Problem in PageRank Algorithm.InternationalJournalofScientific&EngineeringResearch.https://www.ijser.org/researchpaper/Analysis-of-Rank-Sink-Problem-in-PageRank-Algorithm.pdf.[11]DrivelineBaseball.(2019).CallingtheRightPitch:InvestigatingEffectiveVelocityattheMLBLevel.https://www.drivelinebaseball.com/2019/05/calling-right-pitch-investigating-effective-velocity-mlb-level/.[12]BaseballReference.(2020).SimilarityScores.https://www.baseball-reference.com/about/similarity.shtml.

Page 18: Decoding MLB Pitch Sequencing Strategies via Directed

18

AppendixTable2:OrderComparisonbyCluster

Table3:ClusterAverageEntropy

Figure5&6:Pitch-TypeSequenceClusterSplitsbyMatchupType&Role

Figure8&9:ZoneSequenceClusterbyRelativeFrequency/Density