Pitch Classification with Mclust

Today I’m going to address classifying of pitch types using Pitch F/X data. Now, there are a number of options for classifying data. Here, I’ll focus on unsupervised classification, and even more specifically, cluster analysis. There are multiple ways to apply clustering to multidimensional data. These include k-means clustering, hierarchical clustering, model based clustering, and others. Today, I’m going to introduce model based clustering–using the package mclust–because I’ve seen plenty elsewhere on k-means and hierarchical clustering of pitch types. I also think it might be the best candidate for clustering pitch types, which I’ll explain later.

Let’s begin with: why cluster analysis? Well, we don’t get to ask the pitcher “what pitch was that supposed to be” each time it is thrown. So we can’t take a bunch of known characteristics of known pitches, and see if the new ones we have fit into that classification. Rather, we have to come up with our own classes. This is the “unsupervised” portion of clustering, and it leaves out other useful classification methods that tend to require training data with known classes such as classification trees and others.

While MLBAM gives us pitch type in the Pitch F/X data, they use their own algorithm to classify pitches. That means whatever they are telling us is also an estimate or a guess as to what the intent of the pitch was. However, we can use this as a baseline for comparison in our own analysis. But to take it as a given for the pitch type might be a mistake.

Before moving on, I’d like to clarify something about cluster analysis and pitch classification that goes largely ignored or misunderstood in the sabermetrics community. There are two possible goals of the cluster analyses: 1) Identify exactly the intention of the pitcher when throwing the pitch (i.e. knowing exactly what number of fingers the catcher put down prior to the pitch) and 2) Identify all the pitches that can reasonably be classified as “different” from one another. These goals are philosophically different, and that difference is important. I generally suspect Goal #2 is more reasonable, and actually much more interesting. The reason being is that we will never know if we got the “right” classification, so focusing on this is likely an effort in futility. However, it can be interesting to know if there are two types of curveballs–even within a given pitcher–for both training and scouting reasons, even if the intent of that pitcher is the same.

Ok, so I noted I would focus on model based clustering. There are a few characteristics of this method that I think make it a great candidate for clustering pitch types, and this is why I want to focus on this method today. These include:

1) No need to choose an arbitrary number of clusters (or even an educated guess).

There are many cases when trying to identify pitch types where we may be unsure of the number of different pitches a pitcher throws. Additionally, while a pitcher may have intent to throw a curveball, inconsistency in that pitch could mean there are essentially two different types of curveball that are thrown on a regular basis. When trying to understand different pitches coming from a given pitcher, this might be useful information for training or for scouting. Therefore, we let the data talk.

2) Different sizes and shapes of variance of clusters.

As with the choice of the number of clusters, we again let the data talk here. Mclust chooses the best solution to the clustering problem using BIC, while allowing consideration of different variance for each cluster. Other clustering techniques such as hierarchical or k-means clustering don’t allow for this sort of flexibility. Thinking about pitching, it makes a lot of sense to allow for this, given that we would probably expect more variance in the movement and velocity of a curveball or cutter than we would a four-seam fastball. This should help improve our ability to identify these different pitches. It may also help, again, with scouting and/or training. If a pitcher has a high variance in the movement of his curveball, this may not be what they want. Understanding different covariance estimates across pitches might help to consider the consistency of a pitchers’ pitches. Model based clustering allows for this by using a density estimation as its basis for cluster assignment.

3) No need to choose centroid location.

Clustering using k-means methods usually requires a choice of where your centroids begin in the variable space of your observations. This may be totally fine, and you often can arrive at a reasonable solution. The alternative, however, is that you end up at local optimums and never get to a globally optimal solution for your clustering. This could end up with misleading results and clusters that aren’t wholly representative of the pitch type(s).

Alright, onto the fun stuff. For the analysis today, I’m going to use Mark Buehrle. The reason being is that he is known to throw a mix of pitches, with no single pitch overly dominating our sample. If you visit Buehrle’s Fangraphs Pitch F/X page, you can see that his pitch mix is rather eclectic, when compared to someone like, say, Arolids Chapman.

Let’s begin by grabbing our data for Buehrle’s 2013 season from my website here. I went ahead and pulled it from my own database, as I don’t have one set up like Carson and Jim that was initially developed from pitchRx. Just click the link and download the file as a .CSV into your working directory. Then we can load it into R as well as our necessary library.

###load in Mark Buehrle 2013 pitch data
setwd("c:/...")
pitches <- read.csv(file="MarkB2013.csv", h=T)
head(pitches)
    p_id   b_id  ab_id pitch_id pitch_event_description type id_in_game      x      y start_speed end_speed sz_top sz_bot pfx_x pfx_z     px    pz    x0 y0    z0    vx0      vy0    vz0     ax     ay      az
1 279824 435063  81675   220107           Called Strike    S        194  89.27 164.92        84.8      78.1   3.19   1.49  8.58  7.35  0.325 1.609 1.381 50 5.831 -5.309 -124.066 -6.103 13.340 25.451 -20.676
2 279824 434567 110773   331257           Called Strike    S         84  95.28 163.19        85.6      78.6   3.38   1.55  3.79  9.57  0.084 1.635 1.259 50 5.845 -4.111 -125.216 -6.934  5.987 26.720 -17.002
3 279824 453056 133299   417033                    Ball    B         29 161.37 167.51        84.6      79.0   3.53   1.65  5.67  6.10 -1.793 1.679 0.898 50 5.896 -8.414 -123.665 -5.755  8.882 21.830 -22.540
4 279824 433898  54800   116462                    Foul    S        167  97.85 170.96        79.3      73.4   3.27   1.50  8.52  1.37  0.150 1.540 1.320 50 5.880 -5.200 -116.120 -3.330 11.600 22.270 -30.240
5 279824 429664  65243   157042           Called Strike    S        188  81.55 149.38        84.6      76.8   3.57   1.50 10.35  6.87  0.623 2.419 1.300 50 6.097 -4.894 -123.851 -4.463 15.767 29.200 -21.629
6 279824 474319 133331   417158          In play out(s)    X        293  91.85 143.33        83.6      77.4   3.76   1.62  8.13  5.06  0.211 2.517 1.027 50 5.911 -4.536 -122.464 -3.184 12.391 23.431 -24.383
  break_y break_angle break_length current_ball current_strike  on_1b  on_2b on_3b         sv_id pitch_type type_confidence my_pitch_type nasty
1    23.8       -29.4          6.6            0              1 120074     NA    NA 130511_152307         FT           0.897            NA    53
2    23.8       -15.6          4.9            3              0     NA 134181    NA 130608_132947         FF           0.900            NA    39
3    23.9       -17.7          6.4            0              2     NA     NA    NA 130630_134536         FF           0.900            NA    35
4    23.8       -19.1          9.7            0              0     NA     NA    NA 130415_194902         CH           0.900            NA    59
5    23.7       -32.9          7.2            0              0 458731 434624    NA 130425_200009         FT           0.894            NA    51
6    23.8       -24.9          7.3            3              1     NA     NA    NA 130630_145528         FT           0.900            NA    30
                                                                                                                                                                                            cc pitch_seq game_id
1 That was pitch number 45 for Mark Buehrle; his effectiveness may start slipping as he holds opposing hitters to a .295 average in the first 45 pitches but they hit .377 off him after that.         0    1084
2                                                                                                                                                                                                      0    1470
3                                                                                                                                                                                                      0    1768
4                                                                                                                                                                                                      0     725
5                                                                                                                                                                                                      0     867
6                                                                                                                                                                                                      0    1768
  inning half num end_ball end_strike end_outs stand                                                                                                     des     event  hit_x  hit_y hit_type bbtype pitcher_seq
1      3    2  27        0          3        3     R                                                                    Mike Napoli called out on strikes.   Strikeout     NA     NA              NA           0
2      2    1  12        3          1        3     R                                             Geovany Soto pops out to second baseman Emilio Bonifacio.     Pop Out 156.63 139.56        O     NA           0
3      1    2   4        1          2        1     L              Jacoby Ellsbury grounds out second baseman Munenori Kawasaki to first baseman Adam Lind.   Groundout 142.57 158.63        O     NA           0
4      3    1  22        1          2        1     R                   Jeff Keppinger grounds out pitcher Mark Buehrle to first baseman Edwin Encarnacion.   Groundout 125.50 183.73        O     NA           0
5      3    2  27        3          1        2     L Robinson Cano homers (7) on a fly ball to right field.    Jayson Nix scores.    Brett Gardner scores.    Home Run 206.83  64.26        H     NA           0
6      4    2  36        3          1        3     R                                              Brandon Snyder flies out to right fielder Jose Bautista.      Flyout 178.71 112.45        O     NA           0
  pitcher_ab_seq   def2   def3   def4   def5   def6   def7   def8   def9      date home away game wind  wind_dir temp game_type runs_home runs_away local_time first      last ump_id b_throws p_throws year
1              0 450317 452252 466988 543434 493128 466320 458675 430832 5/11/2013  bos  tor    1   19 Out to RF   63         1         2         3   13:35:00  Greg    Gibson     60        R        L 2013
2              0 450317 489365 466988 429665 430895 466320 458675 430832  6/8/2013  tor  tex    1   11 Out to RF   60         1         4         3   13:07:00 Bruce  Dreckman     80        R        L 2013
3              0 450317 452252 493128 430895 408314 434658 458675 430832 6/30/2013  bos  tor    1   12 Out to CF   83         1         5         4   13:35:00   Tim     Welke     14        L        L 2013
4              0 450317 429665 430895 136660 493128 466320 458675 466988 4/15/2013  tor  cha    1    0      None   68         1         4         3   19:07:00  Tony  Randazzo     36        R        L 2013
5              0 450317 429665 430895 543434 493128 466320 458675 430832 4/25/2013  nya  tor    1    6 Out to LF   65         1         5         3   19:05:00  Paul Schrieber     94        R        L 2013
6              0 450317 489365 493128 430895 408314 434658 458675 430832 6/30/2013  bos  tor    1   12 Out to CF   83         1         5         4   13:35:00   Tim     Welke     14        R        L 2013
    yr_gmID
1 2013_1084
2 2013_1470
3 2013_1768
4  2013_725
5  2013_867
6 2013_1768
nrow(pitches)
[1] 3532

###load in mclust library
library(mclust)

You should notice here that there are 3,532 pitches in the data set; however, some of these are from Spring Training. To ensure we don’t contaminate our data with some experimentation Buehrle may have done during that time, let’s restrict our data to only the regular season. You’ll see below that we remove a little less than 200 pitches, and are left with all 33 of Buehrle’s starts from 2013 in the regular season.

###subset to regular season only
pitches$date2 <- as.Date(pitches$date, "%m/%d/%Y")
pitches <- subset(pitches, pitches$date2 > "2013-04-01")
length(unique(pitches$date2))
[1] 33
nrow(pitches)
[1] 3368

We also don’t need all of these variables cluttering up our data, so let’s restrict to the information from the Pitch F/X system that relates to location, velocity, and movement.

###reduce data to pitch f/x information
pitches <- pitches[,c(8:29, 36, 37)]
head(pitches)
       x      y start_speed end_speed sz_top sz_bot pfx_x pfx_z     px    pz    x0 y0    z0    vx0      vy0    vz0     ax     ay      az break_y break_angle
1  89.27 164.92        84.8      78.1   3.19   1.49  8.58  7.35  0.325 1.609 1.381 50 5.831 -5.309 -124.066 -6.103 13.340 25.451 -20.676    23.8       -29.4
2  95.28 163.19        85.6      78.6   3.38   1.55  3.79  9.57  0.084 1.635 1.259 50 5.845 -4.111 -125.216 -6.934  5.987 26.720 -17.002    23.8       -15.6
3 161.37 167.51        84.6      79.0   3.53   1.65  5.67  6.10 -1.793 1.679 0.898 50 5.896 -8.414 -123.665 -5.755  8.882 21.830 -22.540    23.9       -17.7
4  97.85 170.96        79.3      73.4   3.27   1.50  8.52  1.37  0.150 1.540 1.320 50 5.880 -5.200 -116.120 -3.330 11.600 22.270 -30.240    23.8       -19.1
5  81.55 149.38        84.6      76.8   3.57   1.50 10.35  6.87  0.623 2.419 1.300 50 6.097 -4.894 -123.851 -4.463 15.767 29.200 -21.629    23.7       -32.9
6  91.85 143.33        83.6      77.4   3.76   1.62  8.13  5.06  0.211 2.517 1.027 50 5.911 -4.536 -122.464 -3.184 12.391 23.431 -24.383    23.8       -24.9
  break_length pitch_type type_confidence
1          6.6         FT           0.897
2          4.9         FF           0.900
3          6.4         FF           0.900
4          9.7         CH           0.900
5          7.2         FT           0.894
6          7.3         FT           0.900

Notice that we also kept the pitch type and confidence over the pitch type from the MLBAM classifications. Now that we have this information, let’s do our clustering very simply using start_speed, break_y, break_angle, and break_length. Now, note that given that I’m not a physicist, assistance from someone who understands the implications of each of these measures would help us to identify what this would look like in terms of pitch types. You can start here at Mike Fast’s blog with a glossary of what each of these variables means.

##perform model based clustering
clust1 <- Mclust(pitches[,c(3, 20, 21, 22)])
clust1
'Mclust' model object:
best model: ellipsoidal, varying volume, shape, and orientation (VVV) with 5 components

###plot BIC from EM algorithm
clust1BIC <- mclustBIC(pitches[,c(3, 20, 21, 22)])
png(file="clusterBIC.png", height=700, length=900)
plot(clust1BIC)
dev.off()

clusterBIC

Here, you can see that our model based clustering algorithm did lots of work for us and chose 5 clusters–basically 5 different pitch types–and actually noted that covariance structures across clusters is appropriate (you can find a key for the different possibilities here). The data from MLBAM identifies 6 different pitch types thrown by Buehrle; however, one of these is “IN”, which (I think!) means there was uncertainty over the pitch and no classification. So, it looks like we identify the same number of pitch types (5) as found by the MLBAM neural net (and this is what we see on Fangraphs as well ffor 2013–note that Buehrle had 6 pitches in some previous years).

One thing to note, however, is that clustering is sensitive to the scaling of your variables. Alternatively, we could rescale or normalize our variables to see if this changes the outcome. Depending on which are more/less important, we could choose to rescale some but not others. Additionally, you could play around with the different variables that you put into the clustering algorithm to see if all are necessary for the clustering to reach the given classifications. I encourage you to do so, but I won’t address this here for lack of room in this post.

Now we can also attach the classification to our data so that after this, we can compare our clusters against the neural net used for the pitch type in the data already. Because we have a density model, we also have information on likelihood of cluster membership for each pitch. This is also appended to the data, and we can compare this to the type_confidence variable from the original data and neural net classification.

###attach cluster membership of each pitch to data set and membership probabilities for each
pitches$class1 <- clust1$class
cProb <- data.frame(clust1$z)
pitches <- data.frame(cbind(pitches, cProb))
head(pitches)

Next, let’s simply take a look at the separation of our clusters visually. We can project both the clusters and their respective covariance estimates using the coordProj function. Note that you may have to try some different combinations of variables–particularly if you have higher dimensional data–to get the best picture of separation of the clusters. I’ve already pre-screened this, and present the four best projections of the clusters below (break_y seems to be somewhat discrete, so does not help with seeing the clusters well in a plot):

###visualize clusters along variables used in analysis
png(file="clusterProj.png", height=450, width=1200)
par(mfrow=c(1,3))
coordProj(pitches[,c(3, 20, 21, 22)], dimens=c(1,3), what="classification", classification=clust1$classification, parameters=clust1$parameters)
coordProj(pitches[,c(3, 20, 21, 22)], dimens=c(1,4), what="classification", classification=clust1$classification, parameters=clust1$parameters)
coordProj(pitches[,c(3, 20, 21, 22)], dimens=c(3,4), what="classification", classification=clust1$classification, parameters=clust1$parameters)
dev.off()

clusterProj

Notice the separation of clusters, as well as their different sizes, shapes, and orientations. This is, again, the advantage to the model-based clustering method. Here, we see the orange cluster getting significant separation from the others, particularly when looking at starting speed and break_length. To see if this makes sense, let’s pair our classifications alongside the neural net pitch_type that came along with the original data.

###tabulate pitch_type and class from the cluster analysis
table(pitches$class1, pitches$pitch_type)
         CH  CU  FC  FF  FT  IN
  1   0   3   0   0  56 555   0
  2   0 667   1   1  12  31  12
  3   0 107   4 646 244   0   0
  4   0   0   0   4 744  42   0
  5   0   0 237   2   0   0   0

###look at confidence across pitch types
tapply(pitches$type_confidence, pitches$pitch_type, mean)

Based on this, we can likely identify Cluster 1 as Two-Seam Fastballs (FT), Cluster 2 as Change-ups (CH), Cluster 3 as Cutters (FC–though there is significant uncertainty for this class), Cluster 4 as Four-Seam Fastballs (FF), and Cluster 5 as Curveballs (CU). It looks like the Cutter category is uncertain, but this makes sense. We would expect Cutters to possibly lose a little velocity, but be similar to Four-Seam Fastballs (and we see overlap here). But Cutters might also drop slightly in velocity and move similar to a Change-up (and we see some overlap here as well). Overall, I’m pretty happy with this given the very few dimensions we used to analyze the data. Interestingly enough, Cutters have the highest confidence level on average from the neural net from the original data. So there might be something we’re missing here by not including some of the other movement variables in the data.

So which cluster is the Orange one in the plots? Let’s take a look at the velocity averages of each cluster to find out.

###summarize each cluster with respect to start speed
tapply(pitches$start_speed, pitches$class1, mean)
       1        2        3        4        5
83.92036 78.62320 80.51109 84.73165 72.89456

We know that in our projection of the clusters, the orange cluster was the one with the lowest velocity. Given the summary above, that’s Cluster 5, which we noted was most likely to be Curveballs. This makes sense, given the velocity difference and lack of overlap as it relates to movement: Curveballs should move very different from a Change-up and any of the fastball types thrown by Buehrle. And you can also see the close velocity averages for Change-ups and Cutters from Buehrle, making it likely that either the neural net or our own clustering could be messing these up a bit. Note that based on velocity, things make sense among our other groups as well. Four-seam fastballs are thrown the hardest at nearly an average of 85 mph, and Two-Seamers are slightly behind.

All in all, I’m pretty happy with this quick and dirty clustering and identification of pitch types. It seems that cutters are the most difficult to deal with, but this isn’t particularly surprising. We otherwise got nearly perfect agreement with the neural net already included in the data outside of that group. I encourage you to continue with other players, and with other variables and transformations of those variables to see if you can get something even closer to the provided classifications. That doesn’t necessarily mean they’re better (maybe the neural net isn’t very good!), but it’s an interesting exercise to try and duplicate.

One thing to note here is that we only used data from a single pitcher to identify pitches. It is a much harder task to do this with multiple pitchers in the same data set, as Aroldis Chapman’s change-up is likely much faster than Buehrle’s. This likely would require some transformation of the data and possibly some prior clustering of pitchers and doing separate analyses on those different pitcher classifications. We’ll leave that for another day, or for someone with significantly more time on their hands.

Perhaps in my next post, I’ll go ahead and see what kinds of results we get when using other clustering techniques. Until then, I hope you’ve enjoyed today’s post!

11 responses

  1. Great post! Thanks! FYI….can’t get the link for covariance structures across clusters to work.

  2. Thanks! Should be updated with the correct link now. I appreciate you letting me know of the issue.

  3. IN is intentional ball.

  4. You are right. Brain fart on my part. Makes sense that they would look like changeups based on velocity the .

    Thanks for the comment!

  5. Hi bmmills,
    When I try to run mclustBIC(pitches[,c(3,20,21,22)]), I am given this message:
    “Error in if (loglik > signif(.Machine$double.xmax, 6) || any(!c(scale, :
    missing value where TRUE/FALSE needed”

    Is there supposed to be something after pitches() that I should include?

    Originally mclust(pitches[,c(3,20,21,22)]) wasn’t working for me either, but then I added in modelNames = c(“VVV”) afterwards and it worked. Should I be doing something similar with mclustBIC? Thanks!

  6. Hmmmm. It sounds like it is having trouble calculating BIC in both cases. By using the modelNames=c(“VVV”), you ultimately did not allow it to use BIC to choose the optimal model. Normally, you would want to let it go through with that itself, unless you have a specific model in mind for the clustering.

    It sounds like there might be missing data somewhere. Did you fully clean the data with the code used in the post prior to applying the Mclust function? I just re-ran all the code on my computer and things seem to work out OK. I am running R 3.1.2, in case there might be some changes since that version (sometimes things get a little wonky).

    Let me know if you continue having trouble with the clean data, but it’s not clear to me what the specific error is.

  7. Thanks for the promt reply! I reviewed the code but still can’t seem to figure where the error is… Would be happy to send you my code for you to glance at. Now it’s really nagging me why it’s not working!

  8. Feel free to send it along and I’ll see if it works on my computer. bmmillsy at hhp dot ufl dot edu

  9. I’m getting the “Error in if (loglik > signif(.Machine$double.xmax, 6) || any(!c(scale, :
    missing value where TRUE/FALSE needed” from Mclust() as well. I don’t have missing data. Were you able to determine the cause?

    Any help is much appreciated

  10. Same problem but with other dataset. Maybe there is a bug in the most recent mclust?

  11. Hey Everyone,

    I ended up having the same problem on a different computer after an R update. Be sure that you have the mclust 5.0.2 version uploaded into your R library. It seems that they authors did some bug fixes from 5.0.0. I think these took place in 5.0.1, but might as well go for the newest version to be safe.

    I hope this helps, and sorry for the inconvenience on trying to get this to work.

Leave a comment