Today I’m going to address classifying of pitch types using Pitch F/X data. Now, there are a number of options for classifying data. Here, I’ll focus on unsupervised classification, and even more specifically, cluster analysis. There are multiple ways to apply clustering to multidimensional data. These include k-means clustering, hierarchical clustering, model based clustering, and others. Today, I’m going to introduce model based clustering–using the package mclust
–because I’ve seen plenty elsewhere on k-means and hierarchical clustering of pitch types. I also think it might be the best candidate for clustering pitch types, which I’ll explain later.
Let’s begin with: why cluster analysis? Well, we don’t get to ask the pitcher “what pitch was that supposed to be” each time it is thrown. So we can’t take a bunch of known characteristics of known pitches, and see if the new ones we have fit into that classification. Rather, we have to come up with our own classes. This is the “unsupervised” portion of clustering, and it leaves out other useful classification methods that tend to require training data with known classes such as classification trees and others.
While MLBAM gives us pitch type in the Pitch F/X data, they use their own algorithm to classify pitches. That means whatever they are telling us is also an estimate or a guess as to what the intent of the pitch was. However, we can use this as a baseline for comparison in our own analysis. But to take it as a given for the pitch type might be a mistake.
Before moving on, I’d like to clarify something about cluster analysis and pitch classification that goes largely ignored or misunderstood in the sabermetrics community. There are two possible goals of the cluster analyses: 1) Identify exactly the intention of the pitcher when throwing the pitch (i.e. knowing exactly what number of fingers the catcher put down prior to the pitch) and 2) Identify all the pitches that can reasonably be classified as “different” from one another. These goals are philosophically different, and that difference is important. I generally suspect Goal #2 is more reasonable, and actually much more interesting. The reason being is that we will never know if we got the “right” classification, so focusing on this is likely an effort in futility. However, it can be interesting to know if there are two types of curveballs–even within a given pitcher–for both training and scouting reasons, even if the intent of that pitcher is the same.
Ok, so I noted I would focus on model based clustering. There are a few characteristics of this method that I think make it a great candidate for clustering pitch types, and this is why I want to focus on this method today. These include:
1) No need to choose an arbitrary number of clusters (or even an educated guess).
There are many cases when trying to identify pitch types where we may be unsure of the number of different pitches a pitcher throws. Additionally, while a pitcher may have intent to throw a curveball, inconsistency in that pitch could mean there are essentially two different types of curveball that are thrown on a regular basis. When trying to understand different pitches coming from a given pitcher, this might be useful information for training or for scouting. Therefore, we let the data talk.
2) Different sizes and shapes of variance of clusters.
As with the choice of the number of clusters, we again let the data talk here. Mclust
chooses the best solution to the clustering problem using BIC, while allowing consideration of different variance for each cluster. Other clustering techniques such as hierarchical or k-means clustering don’t allow for this sort of flexibility. Thinking about pitching, it makes a lot of sense to allow for this, given that we would probably expect more variance in the movement and velocity of a curveball or cutter than we would a four-seam fastball. This should help improve our ability to identify these different pitches. It may also help, again, with scouting and/or training. If a pitcher has a high variance in the movement of his curveball, this may not be what they want. Understanding different covariance estimates across pitches might help to consider the consistency of a pitchers’ pitches. Model based clustering allows for this by using a density estimation as its basis for cluster assignment.
3) No need to choose centroid location.
Clustering using k-means methods usually requires a choice of where your centroids begin in the variable space of your observations. This may be totally fine, and you often can arrive at a reasonable solution. The alternative, however, is that you end up at local optimums and never get to a globally optimal solution for your clustering. This could end up with misleading results and clusters that aren’t wholly representative of the pitch type(s).
Alright, onto the fun stuff. For the analysis today, I’m going to use Mark Buehrle. The reason being is that he is known to throw a mix of pitches, with no single pitch overly dominating our sample. If you visit Buehrle’s Fangraphs Pitch F/X page, you can see that his pitch mix is rather eclectic, when compared to someone like, say, Arolids Chapman.
Let’s begin by grabbing our data for Buehrle’s 2013 season from my website here. I went ahead and pulled it from my own database, as I don’t have one set up like Carson and Jim that was initially developed from pitchRx.
Just click the link and download the file as a .CSV into your working directory. Then we can load it into R as well as our necessary library.
###load in Mark Buehrle 2013 pitch data setwd("c:/...") pitches <- read.csv(file="MarkB2013.csv", h=T) head(pitches) p_id b_id ab_id pitch_id pitch_event_description type id_in_game x y start_speed end_speed sz_top sz_bot pfx_x pfx_z px pz x0 y0 z0 vx0 vy0 vz0 ax ay az 1 279824 435063 81675 220107 Called Strike S 194 89.27 164.92 84.8 78.1 3.19 1.49 8.58 7.35 0.325 1.609 1.381 50 5.831 -5.309 -124.066 -6.103 13.340 25.451 -20.676 2 279824 434567 110773 331257 Called Strike S 84 95.28 163.19 85.6 78.6 3.38 1.55 3.79 9.57 0.084 1.635 1.259 50 5.845 -4.111 -125.216 -6.934 5.987 26.720 -17.002 3 279824 453056 133299 417033 Ball B 29 161.37 167.51 84.6 79.0 3.53 1.65 5.67 6.10 -1.793 1.679 0.898 50 5.896 -8.414 -123.665 -5.755 8.882 21.830 -22.540 4 279824 433898 54800 116462 Foul S 167 97.85 170.96 79.3 73.4 3.27 1.50 8.52 1.37 0.150 1.540 1.320 50 5.880 -5.200 -116.120 -3.330 11.600 22.270 -30.240 5 279824 429664 65243 157042 Called Strike S 188 81.55 149.38 84.6 76.8 3.57 1.50 10.35 6.87 0.623 2.419 1.300 50 6.097 -4.894 -123.851 -4.463 15.767 29.200 -21.629 6 279824 474319 133331 417158 In play out(s) X 293 91.85 143.33 83.6 77.4 3.76 1.62 8.13 5.06 0.211 2.517 1.027 50 5.911 -4.536 -122.464 -3.184 12.391 23.431 -24.383 break_y break_angle break_length current_ball current_strike on_1b on_2b on_3b sv_id pitch_type type_confidence my_pitch_type nasty 1 23.8 -29.4 6.6 0 1 120074 NA NA 130511_152307 FT 0.897 NA 53 2 23.8 -15.6 4.9 3 0 NA 134181 NA 130608_132947 FF 0.900 NA 39 3 23.9 -17.7 6.4 0 2 NA NA NA 130630_134536 FF 0.900 NA 35 4 23.8 -19.1 9.7 0 0 NA NA NA 130415_194902 CH 0.900 NA 59 5 23.7 -32.9 7.2 0 0 458731 434624 NA 130425_200009 FT 0.894 NA 51 6 23.8 -24.9 7.3 3 1 NA NA NA 130630_145528 FT 0.900 NA 30 cc pitch_seq game_id 1 That was pitch number 45 for Mark Buehrle; his effectiveness may start slipping as he holds opposing hitters to a .295 average in the first 45 pitches but they hit .377 off him after that. 0 1084 2 0 1470 3 0 1768 4 0 725 5 0 867 6 0 1768 inning half num end_ball end_strike end_outs stand des event hit_x hit_y hit_type bbtype pitcher_seq 1 3 2 27 0 3 3 R Mike Napoli called out on strikes. Strikeout NA NA NA 0 2 2 1 12 3 1 3 R Geovany Soto pops out to second baseman Emilio Bonifacio. Pop Out 156.63 139.56 O NA 0 3 1 2 4 1 2 1 L Jacoby Ellsbury grounds out second baseman Munenori Kawasaki to first baseman Adam Lind. Groundout 142.57 158.63 O NA 0 4 3 1 22 1 2 1 R Jeff Keppinger grounds out pitcher Mark Buehrle to first baseman Edwin Encarnacion. Groundout 125.50 183.73 O NA 0 5 3 2 27 3 1 2 L Robinson Cano homers (7) on a fly ball to right field. Jayson Nix scores. Brett Gardner scores. Home Run 206.83 64.26 H NA 0 6 4 2 36 3 1 3 R Brandon Snyder flies out to right fielder Jose Bautista. Flyout 178.71 112.45 O NA 0 pitcher_ab_seq def2 def3 def4 def5 def6 def7 def8 def9 date home away game wind wind_dir temp game_type runs_home runs_away local_time first last ump_id b_throws p_throws year 1 0 450317 452252 466988 543434 493128 466320 458675 430832 5/11/2013 bos tor 1 19 Out to RF 63 1 2 3 13:35:00 Greg Gibson 60 R L 2013 2 0 450317 489365 466988 429665 430895 466320 458675 430832 6/8/2013 tor tex 1 11 Out to RF 60 1 4 3 13:07:00 Bruce Dreckman 80 R L 2013 3 0 450317 452252 493128 430895 408314 434658 458675 430832 6/30/2013 bos tor 1 12 Out to CF 83 1 5 4 13:35:00 Tim Welke 14 L L 2013 4 0 450317 429665 430895 136660 493128 466320 458675 466988 4/15/2013 tor cha 1 0 None 68 1 4 3 19:07:00 Tony Randazzo 36 R L 2013 5 0 450317 429665 430895 543434 493128 466320 458675 430832 4/25/2013 nya tor 1 6 Out to LF 65 1 5 3 19:05:00 Paul Schrieber 94 R L 2013 6 0 450317 489365 493128 430895 408314 434658 458675 430832 6/30/2013 bos tor 1 12 Out to CF 83 1 5 4 13:35:00 Tim Welke 14 R L 2013 yr_gmID 1 2013_1084 2 2013_1470 3 2013_1768 4 2013_725 5 2013_867 6 2013_1768 nrow(pitches) [1] 3532 ###load in mclust library library(mclust)
You should notice here that there are 3,532 pitches in the data set; however, some of these are from Spring Training. To ensure we don’t contaminate our data with some experimentation Buehrle may have done during that time, let’s restrict our data to only the regular season. You’ll see below that we remove a little less than 200 pitches, and are left with all 33 of Buehrle’s starts from 2013 in the regular season.
###subset to regular season only pitches$date2 <- as.Date(pitches$date, "%m/%d/%Y") pitches <- subset(pitches, pitches$date2 > "2013-04-01") length(unique(pitches$date2)) [1] 33 nrow(pitches) [1] 3368
We also don’t need all of these variables cluttering up our data, so let’s restrict to the information from the Pitch F/X system that relates to location, velocity, and movement.
###reduce data to pitch f/x information pitches <- pitches[,c(8:29, 36, 37)] head(pitches) x y start_speed end_speed sz_top sz_bot pfx_x pfx_z px pz x0 y0 z0 vx0 vy0 vz0 ax ay az break_y break_angle 1 89.27 164.92 84.8 78.1 3.19 1.49 8.58 7.35 0.325 1.609 1.381 50 5.831 -5.309 -124.066 -6.103 13.340 25.451 -20.676 23.8 -29.4 2 95.28 163.19 85.6 78.6 3.38 1.55 3.79 9.57 0.084 1.635 1.259 50 5.845 -4.111 -125.216 -6.934 5.987 26.720 -17.002 23.8 -15.6 3 161.37 167.51 84.6 79.0 3.53 1.65 5.67 6.10 -1.793 1.679 0.898 50 5.896 -8.414 -123.665 -5.755 8.882 21.830 -22.540 23.9 -17.7 4 97.85 170.96 79.3 73.4 3.27 1.50 8.52 1.37 0.150 1.540 1.320 50 5.880 -5.200 -116.120 -3.330 11.600 22.270 -30.240 23.8 -19.1 5 81.55 149.38 84.6 76.8 3.57 1.50 10.35 6.87 0.623 2.419 1.300 50 6.097 -4.894 -123.851 -4.463 15.767 29.200 -21.629 23.7 -32.9 6 91.85 143.33 83.6 77.4 3.76 1.62 8.13 5.06 0.211 2.517 1.027 50 5.911 -4.536 -122.464 -3.184 12.391 23.431 -24.383 23.8 -24.9 break_length pitch_type type_confidence 1 6.6 FT 0.897 2 4.9 FF 0.900 3 6.4 FF 0.900 4 9.7 CH 0.900 5 7.2 FT 0.894 6 7.3 FT 0.900
Notice that we also kept the pitch type and confidence over the pitch type from the MLBAM classifications. Now that we have this information, let’s do our clustering very simply using start_speed, break_y, break_angle, and break_length. Now, note that given that I’m not a physicist, assistance from someone who understands the implications of each of these measures would help us to identify what this would look like in terms of pitch types. You can start here at Mike Fast’s blog with a glossary of what each of these variables means.
##perform model based clustering clust1 <- Mclust(pitches[,c(3, 20, 21, 22)]) clust1 'Mclust' model object: best model: ellipsoidal, varying volume, shape, and orientation (VVV) with 5 components ###plot BIC from EM algorithm clust1BIC <- mclustBIC(pitches[,c(3, 20, 21, 22)]) png(file="clusterBIC.png", height=700, length=900) plot(clust1BIC) dev.off()
Here, you can see that our model based clustering algorithm did lots of work for us and chose 5 clusters–basically 5 different pitch types–and actually noted that covariance structures across clusters is appropriate (you can find a key for the different possibilities here). The data from MLBAM identifies 6 different pitch types thrown by Buehrle; however, one of these is “IN”, which (I think!) means there was uncertainty over the pitch and no classification. So, it looks like we identify the same number of pitch types (5) as found by the MLBAM neural net (and this is what we see on Fangraphs as well ffor 2013–note that Buehrle had 6 pitches in some previous years).
One thing to note, however, is that clustering is sensitive to the scaling of your variables. Alternatively, we could rescale or normalize our variables to see if this changes the outcome. Depending on which are more/less important, we could choose to rescale some but not others. Additionally, you could play around with the different variables that you put into the clustering algorithm to see if all are necessary for the clustering to reach the given classifications. I encourage you to do so, but I won’t address this here for lack of room in this post.
Now we can also attach the classification to our data so that after this, we can compare our clusters against the neural net used for the pitch type in the data already. Because we have a density model, we also have information on likelihood of cluster membership for each pitch. This is also appended to the data, and we can compare this to the type_confidence variable from the original data and neural net classification.
###attach cluster membership of each pitch to data set and membership probabilities for each pitches$class1 <- clust1$class cProb <- data.frame(clust1$z) pitches <- data.frame(cbind(pitches, cProb)) head(pitches)
Next, let’s simply take a look at the separation of our clusters visually. We can project both the clusters and their respective covariance estimates using the coordProj function. Note that you may have to try some different combinations of variables–particularly if you have higher dimensional data–to get the best picture of separation of the clusters. I’ve already pre-screened this, and present the four best projections of the clusters below (break_y seems to be somewhat discrete, so does not help with seeing the clusters well in a plot):
###visualize clusters along variables used in analysis png(file="clusterProj.png", height=450, width=1200) par(mfrow=c(1,3)) coordProj(pitches[,c(3, 20, 21, 22)], dimens=c(1,3), what="classification", classification=clust1$classification, parameters=clust1$parameters) coordProj(pitches[,c(3, 20, 21, 22)], dimens=c(1,4), what="classification", classification=clust1$classification, parameters=clust1$parameters) coordProj(pitches[,c(3, 20, 21, 22)], dimens=c(3,4), what="classification", classification=clust1$classification, parameters=clust1$parameters) dev.off()
Notice the separation of clusters, as well as their different sizes, shapes, and orientations. This is, again, the advantage to the model-based clustering method. Here, we see the orange cluster getting significant separation from the others, particularly when looking at starting speed and break_length. To see if this makes sense, let’s pair our classifications alongside the neural net pitch_type that came along with the original data.
###tabulate pitch_type and class from the cluster analysis table(pitches$class1, pitches$pitch_type) CH CU FC FF FT IN 1 0 3 0 0 56 555 0 2 0 667 1 1 12 31 12 3 0 107 4 646 244 0 0 4 0 0 0 4 744 42 0 5 0 0 237 2 0 0 0 ###look at confidence across pitch types tapply(pitches$type_confidence, pitches$pitch_type, mean)
Based on this, we can likely identify Cluster 1 as Two-Seam Fastballs (FT), Cluster 2 as Change-ups (CH), Cluster 3 as Cutters (FC–though there is significant uncertainty for this class), Cluster 4 as Four-Seam Fastballs (FF), and Cluster 5 as Curveballs (CU). It looks like the Cutter category is uncertain, but this makes sense. We would expect Cutters to possibly lose a little velocity, but be similar to Four-Seam Fastballs (and we see overlap here). But Cutters might also drop slightly in velocity and move similar to a Change-up (and we see some overlap here as well). Overall, I’m pretty happy with this given the very few dimensions we used to analyze the data. Interestingly enough, Cutters have the highest confidence level on average from the neural net from the original data. So there might be something we’re missing here by not including some of the other movement variables in the data.
So which cluster is the Orange one in the plots? Let’s take a look at the velocity averages of each cluster to find out.
###summarize each cluster with respect to start speed tapply(pitches$start_speed, pitches$class1, mean) 1 2 3 4 5 83.92036 78.62320 80.51109 84.73165 72.89456
We know that in our projection of the clusters, the orange cluster was the one with the lowest velocity. Given the summary above, that’s Cluster 5, which we noted was most likely to be Curveballs. This makes sense, given the velocity difference and lack of overlap as it relates to movement: Curveballs should move very different from a Change-up and any of the fastball types thrown by Buehrle. And you can also see the close velocity averages for Change-ups and Cutters from Buehrle, making it likely that either the neural net or our own clustering could be messing these up a bit. Note that based on velocity, things make sense among our other groups as well. Four-seam fastballs are thrown the hardest at nearly an average of 85 mph, and Two-Seamers are slightly behind.
All in all, I’m pretty happy with this quick and dirty clustering and identification of pitch types. It seems that cutters are the most difficult to deal with, but this isn’t particularly surprising. We otherwise got nearly perfect agreement with the neural net already included in the data outside of that group. I encourage you to continue with other players, and with other variables and transformations of those variables to see if you can get something even closer to the provided classifications. That doesn’t necessarily mean they’re better (maybe the neural net isn’t very good!), but it’s an interesting exercise to try and duplicate.
One thing to note here is that we only used data from a single pitcher to identify pitches. It is a much harder task to do this with multiple pitchers in the same data set, as Aroldis Chapman’s change-up is likely much faster than Buehrle’s. This likely would require some transformation of the data and possibly some prior clustering of pitchers and doing separate analyses on those different pitcher classifications. We’ll leave that for another day, or for someone with significantly more time on their hands.
Perhaps in my next post, I’ll go ahead and see what kinds of results we get when using other clustering techniques. Until then, I hope you’ve enjoyed today’s post!
Great post! Thanks! FYI….can’t get the link for covariance structures across clusters to work.
Thanks! Should be updated with the correct link now. I appreciate you letting me know of the issue.
IN is intentional ball.
You are right. Brain fart on my part. Makes sense that they would look like changeups based on velocity the .
Thanks for the comment!
Hi bmmills,
When I try to run mclustBIC(pitches[,c(3,20,21,22)]), I am given this message:
“Error in if (loglik > signif(.Machine$double.xmax, 6) || any(!c(scale, :
missing value where TRUE/FALSE needed”
Is there supposed to be something after pitches() that I should include?
Originally mclust(pitches[,c(3,20,21,22)]) wasn’t working for me either, but then I added in modelNames = c(“VVV”) afterwards and it worked. Should I be doing something similar with mclustBIC? Thanks!
Hmmmm. It sounds like it is having trouble calculating BIC in both cases. By using the modelNames=c(“VVV”), you ultimately did not allow it to use BIC to choose the optimal model. Normally, you would want to let it go through with that itself, unless you have a specific model in mind for the clustering.
It sounds like there might be missing data somewhere. Did you fully clean the data with the code used in the post prior to applying the Mclust function? I just re-ran all the code on my computer and things seem to work out OK. I am running R 3.1.2, in case there might be some changes since that version (sometimes things get a little wonky).
Let me know if you continue having trouble with the clean data, but it’s not clear to me what the specific error is.
Thanks for the promt reply! I reviewed the code but still can’t seem to figure where the error is… Would be happy to send you my code for you to glance at. Now it’s really nagging me why it’s not working!
Feel free to send it along and I’ll see if it works on my computer. bmmillsy at hhp dot ufl dot edu
I’m getting the “Error in if (loglik > signif(.Machine$double.xmax, 6) || any(!c(scale, :
missing value where TRUE/FALSE needed” from Mclust() as well. I don’t have missing data. Were you able to determine the cause?
Any help is much appreciated
Same problem but with other dataset. Maybe there is a bug in the most recent mclust?
Hey Everyone,
I ended up having the same problem on a different computer after an R update. Be sure that you have the mclust 5.0.2 version uploaded into your R library. It seems that they authors did some bug fixes from 5.0.0. I think these took place in 5.0.1, but might as well go for the newest version to be safe.
I hope this helps, and sorry for the inconvenience on trying to get this to work.