I’m planning on attending the SABR Analytics meeting soon and I’ll be giving a talk about predicting batting averages. One point in my talk is that a batting average is essentially a composite measure of a strikeout rate and the rate of hits in balls in play, and one can learn more about batting averages by studying these component rates.
Since batters currently tend to strike out a lot, its interesting to understand what variables are predictive of strikeouts. If one looks at the Fangraphs site, you’ll see the usual count stats (AB, H, SO, etc.), but you see some modern statistics that measure “plate discipline”. Here are those “discipline” statistics:
- O-Swing% (Outside the Zone Swing Percentage): Swings at pitches outside the zone divided by pitches outside the zone.
- Z-Swing% (Inside the Zone Swing Percentage): Swings at pitches inside the zone divided by pitches inside the zone.
- Swing% (Swing Percentage): Swings divided by total pitches.
- O-Contact% (Outside the Zone Contact Percentage): Contact made outside the zone divided by swings outside the zone.
- Z-Contact% (Inside the Zone Contact Percentage): Contact made inside the zone divided by swings inside the zone.
- Contact% (Contact Percentage): Contact made divided by swings.
- Zone% (Zone Percentage): Pitches inside the zone divided by total pitches.
- F-Strike% (First Pitch Strike Percentage): Percentage of PA that begin with a strike.
- SwStr% (Swinging Strike Percentage): Swinging strikes divided by total pitches.
To rephrase my question, which among these 9 discipline variables are predictive of strikeout rates?
Get the Data
The Fangraphs site gives tables of standard batting statistics and plate discipline statistics for qualified players. Since data is broken down on separate pages, it seemed reasonable to just use the download capability of Fangraphs to download two cvs files (for the default qualifying players) — one containing the standard batting stats and one the modern discipline stats. I read these two cvs files into R and use the
inner_join function from the
dplyr package to merge them
discipline <- read.csv("fangraphsdiscipline.csv") standard <- read.csv("fangraphsstandard.csv") library(dplyr) batting <- inner_join(discipline, standard, by="Name")
Fitting Some Logistic Models
We will fit a series of logistic models and use the “drop in deviance” statistic as our guide in choosing models. The
glm function will fit logistic models. After some experimentation I fit a model with variables Contact, Swing, F.Strike, SwStr, and Zone, and use the
anova function to show the deviance functions for different models, sequentially adding each variable.
regdata2 <- batting[, c(names(batting)[3:11], "SO", "AB")] fit <- glm(cbind(SO, AB - SO) ~ Contact + Swing + F.Strike + SwStr + Zone, family=binomial, data=regdata2) anova(fit) ## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: cbind(SO, AB - SO) ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev ## NULL 140 1781.92 ## Contact 1 1353.44 139 428.48 ## Swing 1 144.24 138 284.24 ## F.Strike 1 14.63 137 269.61 ## SwStr 1 6.36 136 263.25 ## Zone 1 0.06 135 263.19
Looking at the anova output, it is pretty clear that variables Contact, Swing, and F.Strike are helpful in predicting strikeout percentage. (I am focusing on the drop in deviance and comparing each drop with a chi-square variable with one degree of freedom.) The final model, displayed below, is
final.fit <- glm(cbind(SO, AB - SO) ~ Contact + Swing + F.Strike, family=binomial, data=regdata2) library(arm) display(final.fit) ## glm(formula = cbind(SO, AB - SO) ~ Contact + Swing + F.Strike, ## family = binomial, data = regdata2) ## coef.est coef.se ## (Intercept) 4.60 0.20 ## Contact -6.63 0.17 ## Swing -2.83 0.25 ## F.Strike 1.14 0.30 ## --- ## n = 141, k = 4 ## residual deviance = 269.6, null deviance = 1781.9 (difference = 1512.3)
Making Sense of My Final Model
To make this fitted model easier to interpret, I will standardize each input by subtracting the mean and dividing by a standard deviation. Here’s the fitted model using the standardized predictors:
final.fit2 <- glm(cbind(SO, AB - SO) ~ scale(Contact) + scale(Swing) + scale(F.Strike), family=binomial, data=regdata2) display(final.fit2) ## glm(formula = cbind(SO, AB - SO) ~ scale(Contact) + scale(Swing) + ## scale(F.Strike), family = binomial, data = regdata2) ## coef.est coef.se ## (Intercept) -1.41 0.01 ## scale(Contact) -0.37 0.01 ## scale(Swing) -0.15 0.01 ## scale(F.Strike) 0.05 0.01 ## --- ## n = 141, k = 4 ## residual deviance = 269.6, null deviance = 1781.9 (difference = 1512.3)
Now the intercept makes sense — the logit of the probability of striking out for an average hitter (with mean values of Contact, Swing, and F.Strike) is -1.41 which is 0.196 on the probability scale. If a hitter’s Contact proportion is one standard deviation above the mean, then we predict (on the logit scale) that his strikeout logit will be -1.41 – 0.37 = -1.78 which corresponds to a SO probability of 0.144.
Graphs are helpful for understanding these effects. Below we graph the predicted probability of striking out as a function of Contact where we fix Swing and F.Strike at the average values. Likewise, we show how the fitted probability changes as a function of Swing and F.Strike where the other vars are fixed at their average values.
There is more to say about this regression example, but among all of the plate discipline variables, the Contact percentage seems to be the input that drives the strikeout rate, although the Swing percentage, and first strike rate are helpful inputs.