Recently I have been exploring hitter swing rates as reported in the “Plate Discipline” section of their Batting Leaders — for example, here is the page for 2016 leaders. Also a reader recently asked for clarification on the mixed models approach taken, for example, in this Baseball Prospectus article by Judge et al. I see this as an opportunity to explain the use of these models for a simple example. I’ll illustrate the use of a binomial/beta model, and then discuss an alternative method on logits that leads to more sophisticated modeling as described in the Judge et al post.
Why Look at Swing Rates?
I think the point of the “plate discipline” section of Fangraphs is to dig deeper into the strikeout (SO) and walk (BB) outcomes. Why are batters striking out at such a high rate? Why are some hitters such as Joey Votto good in drawing walks? Since they are measuring swing rates for pitches thrown inside or outside of the zone, this suggests that some hitters may be chasing too many pitches outside of the zone.
If you glance at the swing rates in Fangraphs, you will see a lot of variation between the hitters — some batters tend to be free-swingers and other batters are reluctant to swing. But there several problems in making sense of these swing rates. First, these are observed rates based on sample sizes that are not given. What conclusions can we draw about the “true” swing rates or swing probabilities? Second, since pitchers have a lot to do with batters’ swings, how much of the variability in swing rates is due to the variation between pitchers compared to the variation between hitters? Mixed models or random-effects models are helpful for addressing these questions.
A Simple Random Effects Model
Here is a binomial/beta mixed model that I have talked about before in the context of Efron and Morris’ example of batting averages. Suppose we collect the number of swings and number of pitches for each batter in the 2016 season. (I collect this data using the PitchFX system — the FanGraphs data does not provide the sample sizes.) We assume that yi, the number of swings of the ith batter, is binomial with sample size ni and probability of success pi. We believe the swing probabilities p1, …, pN come from a beta curve with shape parameters a and b. This mixed model is simple to fit using my
LearnBayes package — I write a short wrapper function that inputs the vector of swing counts, and the vector of sample sizes and the output is the estimate of the beta shape parameters. I apply this fit to the 1103 players who came to bat during the 2016 season. It turns out that the beta shape parameters are a = 46.5 and b = 51.1. Here is the fitted beta curve — this represents the distribution of the swing probabilities, that we can call the swing talents.
An Alternative Mixed Model Based on Logits
There is an alternative mixed model approach that leads to several interesting generalizations. Think of our raw data consisting of two variables, swing (either 1 or 0) and the identity of the batter, batter_name. In my PitchFX dataset, I have data on 718,292 pitches where I collect these two variables. As above, I let p denote the probability of a swing for a particular observation, and define the logit, logit p = log(p) – log(1-p). The model says that
logit p = b0 + u_i
where u_i is a coefficient for the ith batter and b0 is an unknown constant. We assume that the u_i follow a normal distribution with mean 0 and standard deviation sigma. This is a similar model to the binomial/beta model — in the bb model, we assigned the swing probabilities an unknown beta curve, and here we are assigning the u_i (on the logit scale) a normal curve with unknown standard deviation.
This logit random effects model is also easily fit using the glmer function from the lme4 package. Here is the basic syntax:
fit <- glmer(Swing ~ (1 | batter_name), family=binomial, data=All_Data)
This takes a bit more time to fit (remember we have 718,292 pitches). When you look at the output, you get an intercept estimate of -0.9403 and sigma = 0.2025.
On the surface, it looks like we get very different fits, but remember that the bb model is fitting on the probability scale, and here we are fitting on the logit scale. It turns out that both methods can be shown to give essentially the same answer — so the curve above represents the fitted talent curve for swing rates using either model.
Extending the Logit Model
One attractive feature of this second approach is that it allows for easy generalizations. As said before, the probability of a swing depends both on the identity of the batter and the identity of the pitcher. That raises the interesting question: how much of the variability in the swing rates is due to the batters and how much is due to the pitchers? This logit random effects model (and the use of the
glmer function) can be generalized to add random effects for both the batters and the pitchers and we can answer this question from fitting the model.
In a future post, I’ll focus more on the coding and show the code for fitting both types of random effects models. These are important models since they really address the “who is better” questions among hitters, and allow us to separate variation into the part that is due to chance, and the part that is due to the differences in player talents. Without some type of analysis like this, one really can’t make sense of the raw percentages reported by the Fangraphs site.