Ryan Pollack recently posted on his prediction for Aaron Judge’s home run in the 2017 season and asked me on Twitter about my predictions. I’m happy to oblige — I’ve written about reasonable Bayesian predictions of home run counts in the past (for example, look at Chapter 11 in Curve Ball), but it gives me an opportunity to showcase the use of a R function
predict_hr in my
predict_HR will provide the predictions — the inputs to the function are
- the data frame of current (HR, AB) stats for a group of players in the current season
- the last name of the player that you are interested in predicting hr counts
- the number of remaining games for the particular team in the season
- the number of iterations in the simulation (the default is 10,000)
Here is my prediction method incorporated in the function
- First I extract the current HR, AB data from the data frame d for the player of interest.
- Currently Judge has 24 HR in 245 AB for a 10% home rate, but since this is a large value, it is unlikely that he will maintain this rate for the remainder of the season. So using a beta-binomial model, I simultaneously estimate the true rates of all of the players including Judge. When I fit this model, I get estimates at eta, the average true rate, and K which determines the spread of the home run talents. By the way, here eta = 0.042 and K = 147.
- My beliefs about Judge’s true home run rates are expressed by a beta curve with shape parameters (24 + K * eta, 245 – 24 + K * (1 – eta)) where I substitute the estimates from above. This curve tells me about Judge’s true HR talent, not the number of home runs he will hit.
- I estimate the number of future AB using the number of future games and the average number of AB per game for Judge’s current games.
- I’m ready to simulate my predictions for Judge’s future games — I do this in two steps. I first simulate a value of Judge’s HR probability p (from the beta posterior distribution), and then simulate from a binomial distribution using the number of future AB and the probability equal to the simulated value. I repeat this process many times and I get a sample of future home runs from what is called the Bayesian predictive distribution.
Here’s my R script to do this prediction. The commands read in the current data from SI. All you need for predict_hr is the name of the player and the number of games in the remainder of the sseason.
library(BApredict) d <- collect_hitting_data() out <- predict_hr(d, "Judge", 93, ITER=10000)
Here is a histogram of the simulations from the predictive distribution. I’m plotting the final 2017 home run total for Judge.
ggplot(data.frame(HR = out$future_HR + out$current_HR), aes(HR)) + geom_histogram(binwidth=1, color="red", fill="white") + ggtitle("Aaron Judge's Predicted HR in 2017 Season")
Looking at the graph, it appears that it likely that Judge will hit between 40 and 60 home runs in the 2017 season. Of course, fans are interested in point predictions (looking at the graph, it appears that 49 is my best guess), but this guess ignores the large uncertainty in prediction and my method gives reasonable estimates for the spread in the prediction.
If you are interested in trying this method out for another player or just interested in looking at the code in the function
predict_hr, here’s a simple way of installing the package from github.
Addendum: June 23
Reflecting on my
predict_hr function, it seemed awkward to have to input the number of remaining games in the season. So I revised the function so it does not need this input. (I found the number of remaining games using the Standings table in the SI website.) If you are interested in predicting the final season home run count for Joey Votto, then use this code:
library(BApredict) d <- collect_hitting_data() out <- predict_hr(d, "Votto")
The output variable
out is a list with components