A Prediction Problem
At this point in the season, folks are interested in extreme stats and want to predict final season measures. On the morning of Saturday May 20, here are the leading batting averages:
Justin Turner .379
Ryan Zimmerman .375
Buster Posey .369
At the end of this season, who among these three will have the highest average?
Regression to the Mean
Of course these batting averages are based on a small number of at-bats (between 120 and 144) and one expects all of these extreme averages to move towards the mean as the season progresses. One might think that Turner will win the batting crown, but certainly not with a batting average of .379. Anyway, the message in this post is that one can do better than simple shrinkage of the batting averages towards the mean.
Breaking Down a Batting Average
Recently I wrote a paper (eventually published in JQAS) on a better way of estimating a collection of batting averages. One can write
BA = (1 – SO.Rate) (HR.Rate + (1 – HR.Rate) BABIP)
- SO.Rate = SO / AB
- HR.Rate = HR / (AB – SO) (the proportion of home runs among all at-bats that are not strikeouts)
- BABIP = (H – HR) / (AB – SO – HR) (the batting average on balls hit in play)
A Better Estimator of BA
The idea in my paper is to separately estimate the SO rates for all players, the HR rates, and the BABIP rates. Each estimation is done with a multilevel (random effects) model. The estimates will shrink the rates towards the average, and the degree of shrinkage depends on the “luckiness” of the rate. If much of the variation in the rates is due to chance, then the degree of shrinkage will be high. On the other hand, if the rates are more talent-driven, then the degree of shrinkage will be low. What is interesting here is that the SO rates, the HR rates, and the BABIP are different from a luck/talent perspective, and the degree of shrinkage will vary among these three rates.
I collected all of the standard batting stats from qualifying hitters in Fangraphs in the morning of May 20. Using my method I found estimates of final season BA by separately estimating the so rates, the home run rates, and the BABIP rates, and then combining the component rates (using the formula above) to get estimates of the batting averages. I construct a scatterplot of the May batting average against my final season prediction for all players. I add the line y = x and the horizontal line at the median batting average. Note that the May batting averages fall between 0.150 and 0.379 — in contrast, my predictions fall between 0.210 and 0.310. It is possible to fall below the Mendoza line in May, but unlikely in October.
The Three Hitters
I’ve labeled points corresponding to the top 3 hitters. Note that currently (in May) Turner’s AVG is higher than Zimmerman which is higher than Posey. But my final season predictions are different — Posey is predicted higher than Zimmerman who is higher than Turner. How does this happen?
The explanation is that these three hitters are getting a high BA different ways. To help understand this, I have graphed the observed and predicted component rates for these three players below. Let’s look at each component rate
- (Strikeout rates) Posey has the smallest strikeout rate followed by Turner and Zimmerman. Both Posey and Turner have below-average SO rates and so we expect them to move a bit towards the average at the end of the season. But SO rates are more talent-driven, so the size of the shrinkage is small.
- (Home run rates) Zimmerman has the highest HR rate followed by Posey and Posey. We expect Zimmerman’s high HR rate to drop and Turner’s low HR rate to increase. The degree of shrinkage again is modest since HR rates (like SO rates) are talent driven.
- (BABIP rates) Turner has the highest BABIP rate followed by Zimmerman and Posey. All of these BABIP rates are high and since these rates are more luck-driven we shrink them strongly towards the mean. (Actually it is more accurate to say that the estimates from the model perform this severe shrinkage.)
So I predict Posey to finish with a .311 average followed by Zimmerman at .305 and Turner at .297. As we saw above, I have a stronger belief in Posey’s small strikeout rate than Turner’s high BABIP rate since SO rates are more reflective of batter’s talents than BABIP rates.
Two Interesting Players
Generally one predicts a final-season batting average to move towards the average. But using this method, it is not always the case. Here’s an example. Keon Broxton (labeled in the scatterplot) currently has an AVG of 35 / 132 = 0.265, but I predict his final season AVG to be 0.228 which is below the mean. Why? Let’s break down his 0.265 average into the component rates:
- SO Rate is 55 / 132 = .416 — predict this to be .366 at season end
- HR Rate is 4 / 77 = 0.52 — predict this to be .052 at season end
- BABIP Rate is 31 / 73 = .425 — predict this to be .325 at season end
So I predict Broxton’s final season batting average (using my formula) by
AVG = (1 – .366) (.052 + (1 – .052) .325) = 0.228
Broxton is a player with a high SO rate who is maintaining this .265 AVG by being very lucky in getting hits on balls in play. I believe more in his high SO rate than his high BABIP rate — this explains why I expect his batting average to drop to 0.228.
I have also labeled Miguel Sano in the scatterplot — he also has an insanely high BABIP rate of 28 / 64 = 0.4375 which he won’t sustain during the season. His current batting average is .299, but I predict it to drop to .246 by the end of the season.
Other Relevant Posts
I’ve written about shrinking batting averages and component rates on this blog in the past — here are six posts that take different slants on these issues. Some of these posts share R code to perform the calculations and graphs.