Monthly Archives: February, 2015

Pitch Classification with k-means Clustering

Last time I posted, we went through how to classify pitch types using model-based clustering in R with the mclust package. This week, I want to address the same problem with another method. While I stick by the statement that model-based clustering is a better choice–particularly when we don’t necessarily know the number of different pitches a pitcher is throwing–using the same data for the same type of problem, but with a different method, allows for some fun comparison. Here, we’ll use k-means clustering to evaluate pitch types. There are a lot of clustering options in R, and they largely come with female names such as pam (Partitioning Around Medoids), agnes (Agglomerative Nesting), diana (Divisive Hierarchical Clustering), fanny (Fuzzy Clustering), and all can use daisy (for calculating a dissimilarity matrix for clustering inputs). You can vary the distance metric used (Euclidean, Mahalanobis, Manhattan, etc.), and there are a lot of other adjustments, rules, criteria, etc. to make if you so please. We’ll keep things simple here.

To start, go ahead and download the same Mark Buerhle data set from last time here. Just click the link and download the file as a .CSV into your working directory. Then we can load it into R as well as our necessary library and make sure to only use regular season data:

###set working directory and load data
setwd("c:/Users/...")
pitches <- read.csv(file="MarkB2013.csv", h=T)

###subset to regular season only
pitches$date2 <- as.Date(pitches$date, "%m/%d/%Y")
pitches <- subset(pitches, pitches$date2 > "2013-04-01")

I’m not going to paste the data frame view again. If you want to see that, check out the last clustering post. Let’s just get right to the good stuff. First, let’s get rid of intentional balls, since these didn’t do much for us in our last attempt at pitch classification, and reduce the data down to the information we care about for clustering.

###subset data to exclude intentional balls
pitches <- subset(pitches, pitches$pitch_type!="IN")

###reduce data to pitch f/x information
pitches <- pitches[,c(8:29, 36, 37)]
head(pitches)

Because k-means clustering requires us to choose the number of clusters, it might be worthwhile to look at what MLBAM has classified, at least as a starting point. We’ll do that below with a simple summary of the different pitch types and how often they were classified as such by the MLBAM neural network.

###check MLBAM classifications
summary(pitches$pitch_type)

CH   CU   FC   FF   FT   IN
777  242  653 1056  628    0

Notice that R still remembers there were pitches classified as “IN”, but they are no longer in the data we’re analyzing. So we see that there are 5 different pitch types as classified by MLBAM. We can start with that for our number of groups. This is cheating a bit, as normally we tend to go into clustering a bit more “unsupervised” than we are here. But that’s OK. We’re not using some out-of-sample data to create rules that classify this data.

When we choose the number of groups, or k, we are telling R that we want 5 centroids placed into our data. From there, each pitch is associated with the closest centroid, using some distance metric to do so . R begins by placing the centroids within the data, then recomputing some chosen criteria for the fit of the clustering. If the criteria improves from the last move of the centroids, the new location is chosen. This continues until there are no more improvements possible.

Not that the choice of the initial location of the centroids can be random, or we can use some other method–such as hierarchical clustering–to do so. One drawback, as noted in my last post, is that the final centroids can end up depending on where they started, and reach a local optimum, rather than a global optimum. This can mean we end up with empty or artificially small clusters. Additionally, there are some issues in handling the data when the clusters are different shapes and densities. Given what we saw in the model-based outcome, I have suspicions that k-means might not perform as well as we’d like with pitch type data.

For today’s post, we’ll use the base kmeans function, as it is very easy to implement. First, set the random seed in your R instance so that you get the same results that I do. Then, we’ll use the function on the variables start_speed, break_y, break_angle, and break_length. These are the same variables that we used for the model-based clustering post, so we can compare them if we want. Note that we want 5 clusters, with 25 different random initial centroid assignments to try (with the nstart option). This latter option should help us avoid local optimum issues.

###set the seed to ensure same outcome
set.seed(12345)

###do k-means with 5 clusters and 25 different randomly place centroid locations for each
pclust <- kmeans(pitches[,c(3, 20, 21, 22)], 5, nstart=25)
head(pclust)

###add cluster assignment to data
pitches$cluster <- pclust$cluster
head(pitches)

       x      y start_speed end_speed sz_top sz_bot pfx_x pfx_z     px    pz    x0 y0    z0    vx0      vy0    vz0     ax
1  89.27 164.92        84.8      78.1   3.19   1.49  8.58  7.35  0.325 1.609 1.381 50 5.831 -5.309 -124.066 -6.103 13.340
2  95.28 163.19        85.6      78.6   3.38   1.55  3.79  9.57  0.084 1.635 1.259 50 5.845 -4.111 -125.216 -6.934  5.987
3 161.37 167.51        84.6      79.0   3.53   1.65  5.67  6.10 -1.793 1.679 0.898 50 5.896 -8.414 -123.665 -5.755  8.882
4  97.85 170.96        79.3      73.4   3.27   1.50  8.52  1.37  0.150 1.540 1.320 50 5.880 -5.200 -116.120 -3.330 11.600
5  81.55 149.38        84.6      76.8   3.57   1.50 10.35  6.87  0.623 2.419 1.300 50 6.097 -4.894 -123.851 -4.463 15.767
6  91.85 143.33        83.6      77.4   3.76   1.62  8.13  5.06  0.211 2.517 1.027 50 5.911 -4.536 -122.464 -3.184 12.391
      ay      az break_y break_angle break_length pitch_type type_confidence cluster
1 25.451 -20.676    23.8       -29.4          6.6         FT           0.897       1
2 26.720 -17.002    23.8       -15.6          4.9         FF           0.900       2
3 21.830 -22.540    23.9       -17.7          6.4         FF           0.900       2
4 22.270 -30.240    23.8       -19.1          9.7         CH           0.900       2
5 29.200 -21.629    23.7       -32.9          7.2         FT           0.894       1
6 23.431 -24.383    23.8       -24.9          7.3         FT           0.900       1

Here, you can see that FT seems to be our Cluster 1, with FF as Cluster 2. But there is a CH here classified in the latter group, so clearly either our clustering or the MLBAM method is doing something wrong (based on the speed of the pitch, it looks like we might be off here). Let’s now take a look at the characteristics of each cluster, as well as the number of pitches within each.

###look at characteristics by cluster and cluster size
aggregate(pitches[,c(3,20,21,22)], by=list(cluster=pitches$cluster), mean)
  cluster start_speed  break_y break_angle break_length
1       1    83.14363 23.76597 -28.7589878     7.491274
2       2    80.74479 23.79267 -19.1932417     7.849485
3       3    77.37422 23.83401   7.8667702    10.002019
4       4    83.50673 23.80986  -9.2485133     6.058685
5       5    81.63923 23.82105   0.6539075     7.252153
summary(factor(pitches$cluster))
  1   2   3   4   5
573 873 644 639 627 

table(pitches$cluster, pitches$pitch_type)
   
         CH  CU  FC  FF  FT  IN
  1   0 102   0   0  21 450   0
  2   0 506   1   0 190 176   0
  3   0   0 236 408   0   0   0
  4   0  95   0   0 542   2   0
  5   0  74   5 245 303   0   0

Notice that the clusters are largely equal sizes, unlike what we saw before in our model-based solution. There seems to be some significant disagreement this time. One solution could be to scale our variables, given the very different ranges of values that we have. This can often help a kmeans analysis (it actually can also help model-based clustering in many situations). This is easy to do in R, with the following command:

###scale variables
pitchSC <- scale(pitches[,-23])

###run clustering on scaled data
pclustSC <- kmeans(pitchSC[,c(3,20,21,22)], 5, nstart=25)

###attach these to data
pitches$clusterSC <- pclustSC$cluster

###look at characteristics by cluster and cluster size
aggregate(pitches[,c(3,20,21,22)], by=list(cluster=pitches$clusterSC), mean)
  cluster start_speed  break_y break_angle break_length
1       1    82.50039 23.69538  -20.268786     7.518304
2       2    72.87815 23.83277    6.177311    12.944958
3       3    80.36312 23.83835    4.186652     7.939480
4       4    84.37787 23.81589  -14.379798     6.032507
5       5    78.94233 23.81725  -20.224121     8.680831
summary(factor(pitches$clusterSC))
   1    2    3    4    5
 519  238  884 1089  626 

###cross-tabulate pitch_type and class from the cluster analysis
table(pitches$clusterSC, pitches$pitch_type)
   
         CH  CU  FC  FF  FT  IN
  1   0 133   0  28 106 252   0
  2   0   0 237   1   0   0   0
  3   0  97   4 623 160   0   0
  4   0   0   0   1 784 304   0
  5   0 547   1   0   6  72   0

The summary of the cluster sizes makes much more sense here, relative to what we saw in the MLBAM data. You should also see a bit more agreement across the clusters now, but still some issues with Cluster 1. Let’s visualize the clusters using a new function, clusplot, that is housed within the cluster package. This function projects the clusters onto two principal components in the plot.

###visualize clusters
library(cluster)
png(file="kmeansPlot.png", height=500, width=650)
clusplot(pitches[,c(3,20,21,22)], pclustSC$cluster, color=T, lwd=2)
dev.off()

kmeansPlot

Honestly, I’m not thrilled with these plots that the function produces, and I encourage you to play with the options to make them a bit more legible. We can also compare this solution to our solution from the model-based clustering using the code below (slightly different than last time, as I’ve removed “IN” from the data).

###now look at model-based solution and compare
library(mclust)
clust1 <- Mclust(pitches[,c(3, 20, 21, 22)])
pitches$MCluster <- clust1$class

table(pitches$clusterSC, pitches$MCluster)##note that MCluster class is along the top row
   
      1   2   3   4   5
  1 260  69 133  57   0
  2   0   0   0   0 238
  3   0   9   2 872   1
  4 345 673  10  61   0
  5  59   0 561   6   0

table(pitches$clusterSC, pitches$pitch_type)
   
         CH  CU  FC  FF  FT  IN
  1   0 133   0  28 106 252   0
  2   0   0 237   1   0   0   0
  3   0  97   4 623 160   0   0
  4   0   0   0   1 784 304   0
  5   0 547   1   0   6  72   0

table(pitches$MCluster, pitches$pitch_type)
   
         CH  CU  FC  FF  FT  IN
  1   0   4   0   0  73 587   0
  2   0   0   0   4 733  14   0
  3   0 666   1   1  11  27   0
  4   0 107   4 646 239   0   0
  5   0   0 237   2   0   0   0

As you can see, there is some disagreement between these two clustering methods. Given the consistency we see with the model-based clustering relative to the MLBAM neural network, I’m (again) leaning toward model-based being a better choice. It largely only has issues with some Cutter and Four-Seam differentiation relative to the neural network, which should be surprising (and maybe our solution is better than theirs). But it’s pretty likely that the kmeans and model-based solutions came up witih the same classifications of Curveballs. Overall, both turned out relatively reasonable, and determining the “best” solution will require a bit more investigation.

In that vein, note that within the kmeans clustering output, there are measures of within- and between-cluster sum of squares that you can use to compare different choices of clustering criteria, such as the number of centroids to be placed in the data. I won’t get into that here, but it might be helpful when we have less of an idea regarding the number of pitches a player might be throwing. We could compare this with the neural network of the MLBAM output as well, and see if we have a “winner.” But I’m here mostly to give you the tools to do so, not do everything we could possibly do. If you are interested in doing this, get after it.

And that’s that for now. Next time, I’ll either cover hierarchical clustering and silhouette plots, or maybe shift gears to something completely new. Thanks for stopping by, and I hope this was helpful for your baseball analysis needs.