I’m excited — saberseminar is coming up this Saturday and Sunday. I’m not sure I’ll be able to listen to all 50+ presentations, but there is a great lineup of talks and I’m happy to have the opportunity to present. There will be five talks on Sunday that talk about home runs and the bat & ball collusion. (We might learn at saberseminar what is happening with the baseball that has contributed to the increase in home run hitting.)
Since there likely will be a lot of discussion about home runs this week, I thought I’d devote this post to predicting the 2019 home run rate. This is an interesting statistical problem and I thought it would be instructive to show how a modern Bayesian might approach this prediction problem. This approach is a good foundation for more sophisticated models as I will explain in the closing comments.
In an earlier post, I illustrated an empirical approach for doing this home run prediction. Bob Carpenter gave a good illustration of Bayesian thinking for this problem in his comments to this earlier post. Here I am implementing a Bayesian model-based approach which has the advantage that it can be generalized in different ways.
Home Runs per Game are Not Poisson
To start, a basic statistical model says that the home run count for a particular game is Poisson with a single unknown rate parameter . Honestly, I don’t believe this model is reasonable since I think that the rate of home runs per game is not constant — the rate would depend on the ballpark, teams playing, etc. Let’s illustrate a simple graphical method, called a Poisonness Plot, for checking if the observed home run counts over many games is Poisson distributed.
- (Get data) Starting with the Statcast data I collect the count of home runs in each game played in the 2019 season.
- (Bin the data) I have the home run counts in all games — numbers like 4, 3, 2, 3, etc. We bin these counts — get the number of games with a home run count of 0, number of games with a home run count of 1, etc.
- (Construct the plot) We have the binned data (y, Ny) where Ny is the number of games where there are exactly y home runs hit in a game. We compute the proportions p where y / Ny. A Poissonness plot graphs the values of y (horizontal) against the values PPy = log(p) + log(y!) (vertical). If these points follow a straight line, then the home runs are Poisson distributed where the slope is . If not, there is some evidence that a Poisson distribution with a single rate is not suitable.
Here’s this Poisonness plot for the 2019 home run counts. There is a clear nonlinear pattern here — there are more low home run games and more high home run games than one would predict on the basis of a Poisson model with a constant rate.
A Random Effects Model
Okay, the Poisson model with a single rate parameter doesn’t work for home run counts per game. A simple way to generalize this model is by the following random effects model.
- Assume that the count of home runs for the jth game is Poisson() — we are assuming that the true rate of home runs is different for each game.
- One assumes that the logarithms of the rates follow a normal distribution with mean M and standard deviation S.
- To complete the Bayesian model, one places weakly informative priors on M and S.
Fitting the Model using Stan and the brms Package
Currently, we can fit these random effect models using simulation-based Bayesian software. One of the most attractive MCMC algorithms is Hamiltonian sampling implemented by the Stan software and the R package brms provides an attractive interface for this MCMC engine for many popular regression models.
It is remarkably easy to fit this random effects model. Here I have a data frame S with two variables game_pk and Hr. There is a simple formula to write the random effects model and by default (by use of the family = Poisson option) there will be a linear model on the log rates.
When you run this program, you get estimates for M and S, the parameters of the normal random effects density. Here the posterior means of M and S are 0.99 and 0.24, respectively. I’ve graphed the N(0.99, 0.24) density below. The fitted model says that the log Poisson means (for the different games) are distributed from this density.
Actually this function provides simulated draws of the joint posterior for all parameters — in particular, one obtains a simulated sample of values of (M, S) — this represents a sample from the marginal posterior of (M, S).
How Many 2019 Home Runs?
We want to predict the number of home runs in the remaining games in the 2019 season. Using Bayesian talk, we are essentially simulating from the posterior predictive distribution. One can simulate the count of home runs by a three step process:
- (Step 1) First simulate a pair (M, S) from the posterior — this represents a single (normal) random effects distribution.
- (Step 2) In our illustration we have 799 games remaining. We simulate 799 values from the normal(M, S) distribution — these represent the true rates for the remaining games.
- (Step 3) Finally we simulate home run game counts from independent Poisson distributions where — these numbers represent draws from the predictive distribution.
- (Prediction) We sum these simulated home run counts and add the current home run count — this total is a prediction of the final 2019 home run count.
I repeated this simulation 1000 times and collected the simulated final 2019 HR counts. Here’s a histogram of these simulated predictions. Quickly, I can see that my best guess for the 2019 HR total is about 6700 and I am pretty certain that this total will fall in the interval (6600, 6800).
By the way, when I did this exercise (Friday, August 2), we had played 1631 MLB games and there were 799 remaining games to play — at this time there were 4505 home runs hit in these 1631 games. So I expect to see about 2100 to 2300 additional home runs this season.
Improving the Model and Other Things
- I like this approach since it illustrates the application of a complete Bayesian multilevel model and how one predicts future data from the model.
- Also this model can be generalized in various ways. Earlier I mentioned that the home run count should depend on the ballpark and the teams. One can explicitly model this dependence by placing a regression model on the true rates . This would be a good exercise for the interested reader.
- All of the R code for this exercise can be found here on my Github Gist site. If you have the Statcast data, then you’ll see that the coding is pretty compact.
- In my next post, I’ll focus on some of the interesting talks from the 2019 saberseminar.
Neat. I didn’t know brms made it that easy to fit these kinds of hierarchical models! I mainly work down at the C++ infrastructure level on Stan.
That default brms model provides much more flexibility than just assuming a negative binomial to handle overdispersion. You can easily add random effects like ballparks as you suggest and also fixed effects like temperature and humidity.
P.S. I didn’t initially see you on the program for the sabermetrics conference—I had to click on the Sunday button to get the link for Sunday talks.