#### Home Run Prediction

We are essentially at the halfway point in the 2019 baseball season (1211 out of 2430 games played) and so it is time for baseball writers to talk about the highlights and make some predictions. Jayson Stark just posted an article which states that the projected home run count for the 2019 season is 6644. Okay, this is a reasonable estimate. One assumes that home runs will continue to be hit at a 2.73 per game clip that we’ve seen in the first half of the 2019 season. If you multiply this average by the number of remaining games and add the current home runs, you get (2430 – 1211) * 2.73 + 3311 which is approximately equal to 6644.

But I am pretty sure that **this estimate will be wrong** — the number of home runs for 2019 is almost certain to not be 6644. What’s missing is some indication of the size of the prediction error. So here’s a quick attempt to produce a 95% prediction interval for the number of home runs hit this season.

#### Distribution for the Number of Home Runs in a Game

We start by collecting the number of home runs hit in each of the 1211 games this season. I’ve graphed the probability distribution below — we see between 0 and 13 home runs hit and the most likely numbers are 2 and 3. (I already wrote a short post on the remarkable game between the Phillies and Diamondbacks where 13 home runs were hit.)

#### Prediction

When one predicts, there are two types of error involved. First, one is uncertain about the underlying** population of home runs hit per game.** I’m estimating this based on what we observed for the first 1211 games. Now there are reasons that the population of home run hitting for the 2nd half is different from the first half. For example, the weather tends to be a little warmer which might produce more home runs. Or there might be other reasons. But since we have a good amount of data and I would suspect that the population of home run hitting would not change much during one season, I think the histogram provides a good approximation to the underlying home run population for the 2nd half of the 2019 season. In other words, I am thinking that the inferential source of the variability is minimal.

The second source for prediction error is **the sampling or game to game variability**. Baseball is exciting due to this variability — one game there will be a slugfest and another game in the series will be a pitcher’s dual. One can simulate the number of home runs for a game by simply choosing a home run count at random from the home run counts for the 2019 first half games.

I can actually do this simulation 10,000 times using a single line of R code. I have a data frame S that contains the number of home runs for each of the 1211 games in the first half of 2019. I use the sample() function to take a sample of size 1219 (remember 1219 games remaining) from this data frame — I sum these simulated home runs to get a simulated prediction for the home run count for the remainder of the season. I repeat this 10,000 times to get many predictions — by adding 3311 to each prediction I get predictions for the entire season.

#### My Prediction

Here I graph the prediction distribution that I have simulated. By picking up the 2.5 and 97.5 percentiles, I get a 95% prediction interval. The probability that the 2019 home run falls between 6519 and 6770 is 0.95.

Some of you might be surprised about the size of the prediction interval. But it is very certain that the 2019 home run total will eclipse the current 2017 home run count of 6105. Also, assuming that the home run process doesn’t change much in the 2nd half of the season, I feel pretty confident that the home run count will fall in (6519, 6770).

#### R Code

For those who are interested, here is my R code for this prediction exercise.

My first instinct if I really trusted that first-half home-run rate would to be to predict total number of home runs is going to be Y = 3311 + Poisson(3311), which provides a rough 95% interval of 6622 +/- 2 * sd[Poisson(3311)] = 6622 +/- 2 * sqrt(3311) = (6507, 6737), which is narrower than the bootstrap estimate in the post of (6519, 6770) [and would be narrower still if I approximated with 1.96 * sd rather than 2].

If I really cared about this prediction, I’d want to model the uncertainty due to the first half of the season only being a sample (despite it being a large-ish sample). A simple-minded approach might be to say OK, the real rate is itself probably something like normal(3311, sqrt(3311)) so we might model estimation uncertainty as Z ~ normal(3311, sqrt(3311)), and set Y = 3311 + Poisson(Z). If I simulate that I get central 95% posterior intervals of (6466, 6781).

Then if I really cared, I’d fit a proper Bayesian model and do full Bayesian posterior predictive inference. I can write a simple Stan model that hard-codes the data

and fit it in R:

Nobody expects those last digits to accurately describe the 95% intervals, so we might as well report as the bootstrap from the article (6520, 6770), fixed rate Poisson (6510, 6740), normal-Poisson (6470, 6780), and finally, the Bayesian Poisson predictive model, (6460, 6790).

Oops. That ate all my < punctuation. The simple simulation was

Stan declaration needs