When we think of home run hitting, we tend to focus on the home run leaders such as Babe Ruth, Roger Maris, Barry Bonds, and Mark McGwire over the history of MLB. But what is happening in the 2019 season is not great home run leaders but rather a general increase in home run hitting among all players. So it is better to focus on the distribution of home runs of all players. Actually I want to focus on the distribution of home run talents instead of their current home run performances. Here we will spend some time describing the distinction between home run talent and home run performance, describe a reasonable model for estimating home run talents, and then show how the distribution of home run talents has changed over recent seasons.
Distinction Between Home Run Talent and Home Run Performance — Comparing Two Hitters
Suppose we measure home run performance by the fraction on in-play balls that are home runs. Who is currently better (through games of June 10) — Miguel Sano or Christian Yehlich? Looking at the data, Sano hit 6 home runs if 38 balls put in play for a home run rate of 6 / 38 = .158, and Yehlich has a home run rate of 24 / 172 = .140. So Sano is better, right?
Well, maybe not, since there is a sample size issue in this comparison. Although Sano has a higher HR rate, it is based on a smaller sample size — we are more confident in Yehlich’s HR rate since it based on more balls in-play. So although Sano has a better home run performance than Yelich, it is very possible that Sano has a smaller home run talent than Yelich.
The baseball fan struggles with this sample size issue. One way of handling this is to only compare hitters with sufficient opportunities. So we might say that we won’t compare the home run rates of Sano and Yehlich since we might limit our search to players with at least 100 balls in play. This helps the issue, but it is unclear what to set the lower limit for the sample size.
Estimating Home Run Probabilities
One can construct a graph that shows the actual home run rates for all players in a particular season, but this is hard to interpret since these rates are based on different sample sizes. It is better to focus on looking at the distribution of estimated home run probabilities. A home run probability is better than a home run rate in terms of predicting future performance.
Before I show you a statistical model, let’s think about what these home run probabilities will look like for our two players. Christian Yehlich is a regular player who is clearly one of the best hitters in baseball. So we are pretty confident that Yehlich’s home run probability is close to the current rate of .140 — we might want to adjust the probability estimate a little towards the average home run rate since we know historically .140 is very high. In contrast, we believe less in Miguel Sano’s rate of .158 since it is based on a small number of balls in play. Our informal estimate at Sano’s home run probability will likely be adjusted a bit towards the average rate for all players.
A Multilevel Model
Here is the model we’ll use to estimate the home run talents (probabilities) for all players in a particular season. We assume that the number of home runs y_j for a player is binomial with sample size n_j and probability p_j. We reexpress p_j by the logit theta_j = log (p_j / (1 – p_j)) — this is a good thing to do since theta_j will be real-valued which makes it convenient for regression modeling.
We assume the home run probabilities p_1, …, p_N follow a symmetric distribution with mean M and standard deviation S. The typical assumption is to let this distribution be Normal with mean S and standard deviation S. An alternative assumption to let the home run distribution be Cauchy with location M and scale S. In either case, we assume that M and S are unknown and assign each parameter a weakly informative prior distribution.
Comparing Normal and Cauchy Models
There is an advantage to using the Cauchy model assumption instead of the Normal. We are aware of extreme home run performances (Ruth’s 60 , Maris’ 61, and Bonds 73) and these two models treat these outliers differently. Let me illustrate data for the 2001 season — I’m estimating the home run rates using this multilevel model for all players with at least 25 balls in play.
I’ve labelled four outlying home run estimated probabilities by B (Barry Bonds), M (Mark McGwire), S (Sammy Sosa), and T (Jim Thome). Below I display the estimated home run probabilities using Cauchy (top) and Normal (bottom) models. Looking at the figure closely, you’ll see that the Normal model tends to estimate these outlying home run probabilities by pulling these home run rates towards the overall average rate. The Cauchy model tends to leave these unusual rates alone. I prefer this Cauchy modeling behavior since it better sets apart these unusual performances.
Comparing 2001 and 2019
Since we are talking about one of those “steroids” season, let’s compare home run hitting for 2001 and 2019. For 2001, I’m considering the home run rates for all players with at least 25 in-play. For 2019, I consider all players with at least 8 in-play. (I do this so we have a comparable number of players in each group.) Although there were some great home run hitters in 2001, note that the majority of hitters in 2019 hit for a higher rate than in 2001. In 2001, we see a sizeable number of hitters who hit for small true rates of 0.02 or smaller — in contrast, few 2019 hitters hit with true rates smaller than 0.0.2.
Comparing Five Recent Seasons
Since there is much interest in the pattern of home run hitting for recent seasons, I use my Cauchy multilevel model to estimate the home run probabilities for each of the seasons 2015 through the current 2019 season. I collect hitters with a minimum number of in-play (25 for seasons 2015 through 2018 and 8 for 2019) so we are looking at roughly 550 players each season. There are two main takeaways.
- First each season has some home run sluggers with home rates in the 10-15% range. In other words, these seasons look similar at the extreme high end.
- But there is a general change in the location of the home run probabilities for the majority of players. In 2015, most batters had home run probabilities in the 0.03 – 0.4 range and it was unusual to have a home run rate over 0.05. A quick calculation gives that only 13% of the 2015 players had home run rates exceeding 0.05. In contrast, the distribution of the home run probabilities for 2015 is actually centered about 0.05 — I compute that 35% have HR probabilities exceeding 0.05. Most current players have a good chance of hitting a HR.
Here are some main points and I’ll mention the R tools I use.
- Let’s return to our Yehlich/Sano comparison. Remember that Sano’s 2019 home run rate of 6/38 = .158 exceeds Yehlich’s rate of 24 / 172 = .140 which looks like on the surface that Sano is a better home run hitter. But when we fit our multilevel model, we find that Sano’s estimated home run probability is .083 which is smaller than Yehlich’s home run probability estimate .125. We believe less in Sano’s home run rate since it is based on a small sample size and the model shrinks Sano’s observed rate strongly towards the average rate.
- By using this method, one doesn’t have to limit our leaderboard to batters with a minimum sample size. This method provides reasonable probability estimates for all hitters. For players with smaller number of balls in play, the probability estimates will be close to the average rate across all players.
- Is baseball talent normally distributed? Our graphs indicate that the middle of the distribution of home run probabilities is bell-shaped but we have some outliers that would not be predicted by a normal curve.
- These Bayesian models are easy to fit using the MCMC software JAGS and the runjags R package. One writes a script defining the model and one simulates from the posterior probability distribution of all unknown parameters which we summarize to get these graphs.