Houston Astros Whiffs and Exit Velocity

In my last couple of posts, I’ve used some functions that tells us about density in two dimensions, as well as some that fit non-parametric models to identify umpire strike zones. This week, I’m going to merge the use of color and the use of generalized additive models in order to visualize batted ball velocity and whiff rate around the strike zone. My team of interest will be the Houston Astros in 2015. In the Astros, we have a team that strikes out the most, and hits the most home runs in the league. And this year, the Pitch f/x data now includes exit velocity, so we can expand on our use of these models.

The first thing to do is download this file. This data is from Baseball Savant, and includes every pitch seen by Houston Astros batters for 2015. I’ve narrowed the data down from what Baseball Savant provides, just for ease of taking a look through for our own purposes. But this is a great tool for getting your own data and doing your own analyses.

Once you have the data, load it into R. We’ll subset the data into two data set: one that includes only pitches at which batters have swung, and one that includes only pitches that have exit velocity information (i.e. largely balls in play). We then define a dummy variable that indicates whether or not the batter whiffed when he swung at the given pitch. This will allow us to make two models: 1) A model of the expected exit velocity based on location, and 2) A probabilistic model of whiff likelihood based on location. One thing to note here is that I have not separated LHB and RHB. This will make our visuals look a little weird, but I encourage you to further break things down with your own data pulled from Baseball Savant.

setwd("c:/...")

astros <- read.csv(file="AstrosFX.csv", h=T)
nrow(astros)

library(mgcv)
library(parallel)
library(RColorBrewer)

###subset for pitches with exit velocity information
astrosExit <- subset(astros, is.na(astros\$batted_ball_velocity)==F)
astrosExit <- subset(astrosExit, astrosExit\$px > -2 & astrosExit\$px < 2 & astrosExit\$pz > 0 & astrosExit\$pz < 5)
nrow(astrosExit)

###look at pitch result categories and subset data only for swings
unique(astros\$pitch_result)
astrosSwing <- subset(astros, astros\$pitch_result=="Swinging Strike" |
astros\$pitch_result=="In play, run(s)" |
astros\$pitch_result=="In play, out(s)" |
astros\$pitch_result=="In play, no out" |
astros\$pitch_result=="Foul" |
astros\$pitch_result=="Swinging Strike (Blocked)" |
astros\$pitch_result=="Foul Tip" |
astros\$pitch_result=="Foul (Runner Going)")
nrow(astrosSwing)

###define whiff
astrosSwing\$whiff <- ifelse(astrosSwing\$pitch_result=="Swinging Strike" | astrosSwing\$pitch_result=="Swinging Strike (Blocked)", 1, 0)

Next, we use the parallel package to spread out our generalized additive model estimation over multiple cores on our computer. I have used this in past work, so I’ll leave aside the explanation of the code. I follow by defining a matrix of locational data, onto which we will project our models of predicted exit velocity and predicted whiff probability. This latter portion of the code follows from this post, where we drew probability contours for umpire strike zones. In this post, instead of drawing line contours, we’ll fill them with color using the filled.contour function. This will scale our predicted values to color, similar to what we saw with the density estimation using smoothScatter in my last post.

###indicate number of cores for parallel processing
if (detectCores()>1) { ## no point otherwise
cl <- makeCluster(detectCores()-1)
} else cl <- NULL

###generate new data sequence for visualization
myx <- matrix(data=seq(from=-2, to=2, length=100), nrow=100, ncol=100)
myz <- t(matrix(data=seq(from=0,to=5, length=100), nrow=100, ncol=100))

From here, we can fit our model and project it onto our data, then create a matrix used for plotting the data using filled.contour. Let’s start with the exit velocity model. Note that this data is not binary (like umpire strike calls, or whiffs), and therefore we’ll use the Gaussian link function, rather than the logit.

###fit exit velocity
set.seed(12345)
fit.exit <- bam(batted_ball_velocity ~ s(px, pz), method="GCV.Cp", data=astrosExit, family=gaussian, cluster=cl)
summary(fit.exit)
fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))
mypredict.exit <- predict(fit.exit, fitdata, type="response")
mypredict.exit <- matrix(mypredict.exit, nrow=c(100,100))
density(mypredict.exit)

png(file="AstrosExitVelo.png", width=600, height=675)
filled.contour(x=seq(from=-2, to=2, length=100), y=seq(from=0, to=5, length=100), z=mypredict.exit, axes=T, zlim=c(85,92), nlevels=15,
color=colorRampPalette(rev(brewer.pal(11, "RdYlBu"))),
main="Houston Astros 2015 Exit Velocity", cex.main=2, xlab="Horizontal Location (ft., Umpire's View)", ylab="Vertical Location (ft.)",

plot.axes={
axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")
axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")
rect(-0.83, 1.75, 0.83, 3.45, border="black", lty="dashed", lwd=2)
},
key.axes={
ylim=c(0,1.0)
axis(4, at=c(85,86,87,88,89,90,91,92), labels=c(85,86,87,88,89,90,91,92), pos=1, las=0, col="black")
})
text(1.4, 2.5, "Exit Velocity", cex=1.1, srt=90)
dev.off() As you can see from the code, one frustrating issue with the filled.contour function is that any additions to the plot have to be done within the function, rather than afterward. We draw our axes with labels to match the data on which the model is predicted, and add values to our color scale on the right hand side, basically drawing many of the features from scratch. Much of this code is modified from Dave Allen‘s work at Baseball Analysts, Fangraphs, and elsewhere. Also note that I use the density function to get an idea of the range of predicted values. This allows me to scale our color scale correctly, and ensure we have a range of colors.

Turning to the visualization, you can see that location and exit velocity are a bit all over the place. We haven’t really broken down our exit velocities by fly balls, ground balls, line drives, etc. I suspect if we did, we might find a bit more systematic differences across the pitch locations. And a RHB/LHB split is likely useful here. As it is, this seems to tell me that exit velocity isn’t all that useful on its own, and that the other information provides important context for hitting.

But we can find some interesting things. Since the lineup is more RHB heavy, it seems to show that pitches up and in or down and away are hit softer than those down the middle. This makes some sense. The high predicted exit velocity for very high pitches at the top is likely a result of one or two pitches being hit very hard and messing with our models a bit (Evan Gattis, anyone?). There are other ways we might want to subset our data to get a more interesting picture, including removing some outliers in the data. But since this is just a exhibition of how R works for these models and visuals, I’ll leave that up to you.

Now moving on to whiff rates, we’ll use a logistic GAM to estimate whiff probability conditional on location. The code is almost identical, as you can see below, but we use a logit link function due to our binary dependent variable. Therefore, the predicted values we get from our model will be probabilities, and be in the range from 0 to 1.

###fit swinging strike
fit.strike <- bam(whiff ~ s(px, pz), method="GCV.Cp", data=astrosSwing, family=binomial(link="logit"), cluster=cl)
summary(fit.strike)
fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))
mypredict.strike <- predict(fit.strike, fitdata, type="response")
mypredict.strike <- matrix(mypredict.strike, nrow=c(100,100))

png(file="AstrosStrikeSwing.png", width=600, height=675)
filled.contour(x=seq(from=-2, to=2, length=100), y=seq(from=0, to=5, length=100), z=mypredict.strike, axes=T, zlim=c(0,1), nlevels=50,
color=colorRampPalette(rev(brewer.pal(11, "RdYlBu"))),
main="Whiff Probability", cex.main=2, xlab="Horizontal Location (ft., Umpire's View)", ylab="Vertical Location (ft.)",
plot.axes={
axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")
axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")
rect(-0.83, 1.75, 0.83, 3.45, border="black", lty="dashed", lwd=2)
},
key.axes={
ylim=c(0,1.0)
axis(4, at=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0), labels=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0), pos=1, las=0, col="black")
})
text(1.4, 2.5, "Swinging Strike Probability", cex=1.1, srt=90)
dev.off() Here, we can see much more of a systematic relationship between location and whiff probability (conditional on swinging in the first place). Again, this might be more enlightening if we split between LHB and RHB, as I suspect the heart shape of blue we see is a result of merging these into one model. But things tend to make sense: if you swing at a pitch in the dirt, you’re more likely to miss. If you decide to swing at a pitch down the middle, you’re least likely to miss.

These models can be used for a number of types of investigation, including fielding (assuming we have data on where balls land). I’ve used them extensively for umpire strike zone evaluation. And I still think GAMs are underused both in sports analytics and in academic work, where we’re unsure of the functional form of our data, or there’s no reason to believe our data are linear. Note that because of the way that the the mgcv package works, with cross-validation of the smoothing degrees of freedom, these models tend to require larger amounts of data (for strike zones of umpires, anything less than 5,000 called pitches seems to be too small). But if you have the data, and the will to explore, they can be a really interesting way to address important issues in baseball analysis.