As many of you know, the number of home runs hit in 2016 has increased dramatically from recent seasons. In fact, the current total is 5420 which is getting close to the record number of 5693 from the 2000 season. That raises the obvious question — with 7 days of baseball remaining, how like is it to tie or break the 2000 record? This is a good opportunity to illustrate Bayesian thinking. I’ll propose a simple model for the number of home runs hit by a team in a game, fit this model to the team/game data for the 2016 season, and then use this to predict the number of home runs hit in this last week of the baseball season.
A model for home runs
Let y denote the number of home runs hit by a team in an individual game — this is typically a small number like 0, 1, 2, 3, etc. I will assume that home run numbers for all of the 2016 games follow a negative binomial distribution with parameters
mu (I am using the R syntax for the
dnbinom function). The parameter
mu is the mean number of home runs and
size is a dispersion parameter. (By the way, this is a better fitting model than the Poisson distribution which has only one free parameter.) To complete this model, I assign a vague, noninformative distribution to the parameters
Fit the model
From Baseball Reference I am able to collect the number of home runs hit by team per game — there were over 4500 sampled values of y. I fit the Bayesian model — I got estimates of 1.17 and 9.95 for
size , respectively. I can approximate the posterior distribution of
(mu, size) by a bivariate normal distribution with mean vector (1.17, 9.95) and an associated variance-covariance matrix.
Looking at the 2016 baseball schedule, it appears that we have 98 games remaining (including the rescheduling of the Miami/Atlanta game), so we have 98 x 2 = 196 team/games remaining. To simulate the number of home runs hit in these 196 opportunities, we first simulate
(mu, size) from the posterior distribution, and then simulate values of
(y1, ..., y196) from the negative binomial distribution and compute the total number of home runs hit. If I repeat this process 1000 times, I get 1000 draws from the predictive distribution for the number of home runs hit in the remainder of the season.
How many home runs will be hit?
I have graphed the predictive distribution of the number of home runs hit in the remainder of this season below and overlaid the target of 273 to reach the 2000 season total.
What have we learned?
- There is a good amount of uncertainty about the actual number of home runs hit in the remaining 98 games. My best prediction is about 225, but it could easily be any value between 200 and 260.
- It is very unlikely that we will meet or exceed the 2000 season total. My calculations (based on my model) is that the predictive probability of meeting or exceeding the target is only about .002 or .2 %.
Although this has been an interesting exercise, maybe we asked the wrong question. What is more important is maybe not the number of home runs hit, but rather the rate of home runs hit. Looking quickly at Baseball Reference, we see
- In the 2016 season (with 7 days remaining in the season), we have 5420 home runs in 159,144 AB
- In contrast, in 2000, there were 5693 home runs in 167,290
Computing rates, we see that the 2016 home run rate is 3.406 percent which actually exceeds the 2000 rate of 3.403. So really 2016 might be the greatest home run hitting season of all time.
Another question is why — what is causing the increase in home run hitting in 2016? I have not done a careful investigation, but I think part of the answer is the players’ approach to hitting — a large fraction of the scoring in baseball is due to home runs.