This has been an exciting season for the New York Yankees who are having one of their best seasons. Fans are following the hitting of Aaron Judge who is currently on a record-setting pace with his 42nd home run on July 30. That raises the obvious question: how many home runs will Judge hit in 2022 and will he beat Roger Maris’ 61 home runs hit during the 1961 season?
To address this question, I illustrate Bayesian prediction of Judge’s home run hitting for the remainder of this season. I construct a prior for Judge’s 2022 home run probability using his performance during earlier seasons and update this information with the current 2022 data. Also I construct a prior for the number of plate appearances in the remainder of the season. Using this prior and a binomial sampling model, I simulate home run results for the remainder of the season. Using a R function and associated Shiny app, I obtain a point prediction for the number of 2022 home runs and compute the probability Judge hits at least 62 home runs this season.
The Method — Figuring out Priors
We are interested in learning about , the number of home runs that Judge will hit in the remainder of the season. Suppose we let denote Judge’s probability of a home run in a single plate appearance in the 2022 season. If denotes the number of remaining plate appearances, then a common model says that the number of future home runs has a binomial distribution with sample size and probability of success . The problem in using this distribution is that we have two unknowns — we don’t know Judge’s home run probability and we don’t know the exact number of plate appearances in the remainder of the season. Our first task to assign reasonable prior distributions to and , and then we can simulate the predictive distribution for .
Prior for the home run probability. We’ve watched Aaron Judge over the season 2015 through 2021 and have some belief for the location of Judge’s 2022 home run probability on a single PA. Suppose we can construct an interval such that we believe that is equally likely to be inside or outside of the interval. (These two numbers are the quartiles of the prior.) This information can be matched with a beta curve with shape parameters and .
Update this prior with data. Next we look at Judge’s home run performance for the 2022 season. At this time (through July 31), he has hit 42 home runs in 441 plate appearances. Assuming that the home run occurrences follow a binomial model, our updated or new beliefs about Judge’s home run probability follow a beta curve with shape parameters and .
Prior for the number of future PA. We also need a prior for the number of plate appearances in the remainder of the season. We can make an intelligent guess at — call this guess and the standard deviation for this guess is . We assume is normal with mean and standard deviation . This means that we’re 68% confident that the number of future PA is between and .
Predicting Future Home Runs
We simulate from the predictive distribution of the future number of home runs . We do this simulation in two steps:
- We simulate values of the hitting probability and the future number of PA from our priors — is simulated from a beta() curve and is simulated form a normal(.
- Then we simulate from a binomial distribution with sample size and probability of success .
By repeating these two steps many times, we obtain a large number of simulated draws from the predictive distribution. Given this sample, we compute any predictive summary of interest. For example, we can compute the mean of this distribution — this will be a prediction of the number of home runs that Judge will hit in the remainder of the season. Or we can compute predictive probabilities of interest. One probability of interest is the probability that Judge will hit at least 62 home runs and eclipse Maris’ Yankee record.
The Shiny App
I initially wrote a R function
predict_hr() to implement this prediction exercise and then a Shiny app to run this function. Below is a snapshot of using this function.
- After some thought, I believe Judge’s true 2022 home run probability (based on his performance before the 2022 season) is 50% likely to be in the interval (0.05, 0.09) — I enter these values using the slider in the Shiny app. The output from the app tells me that these beliefs are consistent with a beta curve with shape parameters 5.20 and 66.97.
- After the games of July 31, Judge has hit 42 home runs in 441 plate appearances — I enter those values in the Observed PA and Observed HR boxes.
- Last, since the Yankees have 60 remaining games, I guess that Judge will have 254 future PA (averaging 4.3 PA per game for the remaining 59 games) and the standard deviation of this estimate is 15. I enter these values in the Future PA and Standard Deviation of Future PA boxes.
Based on these inputs, the app displays a plot of the predictive distribution of the number of home runs Judge will hit in the remainder of the season. The mean of this distribution is 23.7 — adding this number to Judge’s current HR count (42), I predict he will hit about 66 home runs in 2022. The chance he will break Maris’ record is 0.764. (I am adding up the predictive probabilities shaded in red.)
I have talked about Bayesian prediction a number of times on this blog. Five years ago in this post, I described predicting Aaron Judge’s home run count in the 2017 season using a different Bayesian model. There I used a multilevel model where I simultaneously estimated home run rates for all regular players. There the shape parameters of the beta prior were estimated by using data for all players. In both that situation and this situation, I estimate Judge’s true home run probability by adjusting the observed home run rate towards an average value. (Both models reflect the belief that Judge’s home run rate will cool off over these last 59 games.)
- This Shiny app is currently live at https://bayesball.shinyapps.io/Aaron_Judge_HR_Prediction/ and the interested reader is encouraged to try it out. The predictions depend on the inputs and so one can check the sensitivity of the predictions on different priors on and .
- Note the substantial variation in the prediction distribution. This variation is driven primarily by the binomial sampling variation and not on the prior on the hitting probability.
- The app can be used during the concluding weeks of the 2022 season. One just needs to enter in the current PA and HR values and make an intelligent guess at the number of remaining PA and the standard deviation of this guess.
- The Shiny app is currently part of my
ShinyBaseballpackage. The single
app.Rfile found here contains all of the code and can be run assuming all of the prerequisite packages are installed. The amount of code is relatively short and might be helpful for those who are learning Shiny.