What Do You Learn from 45 Home Runs?

Introduction

Here is a simple question. Suppose you are told that a MLB hitter slugged exactly 45 home runs in a season. What have you learned about the number of balls in play and his rate of hitting home runs? This is a variation of the simple model where one assumes that one has binomial trials with a known sample size N and you want to learn about the hitting probability p. Here both the sample size and the probability are unknown. This is actually a common problem when we want to predict a player’s home run count in the following season. We don’t know how many opportunities he will have next season and we don’t know the true rate that he will be hitting home runs.

I’ll illustrate a Bayesian approach to this problem. Our model is that y the number of home runs is binomial(N, p) where both N and p are unknown. I’ll construct a reasonable prior for (N, p), update this prior with the observed count of y = 45 home runs, and we’ll see what we have learned from this data.

Constructing a Reasonable Prior

I decide to focus on non-pitchers, and I decide to look at all of the players in the 2017 season with at least 100 balls in play. I don’t see much of a relationship between the balls in play and the home run rates, so I assume that my beliefs about N and p are independent, so I can represent my prior as g(N, p) = g(N) g(p) and focus on obtaining priors individually on each unknown parameter.

Below I show a histogram of the balls in play for all players with at least 100 BIP for the 2017 season. The smooth curve g(N) = 4.5 + 0.1898 N + 0.00033 N ^ 2 seems to be a reasonable fit these BIP, so I’ll use that for my prior.

The prior on the batter’s probability p is more difficult to construct, as we don’t directly observe p but instead observe the HR rates HR / BIP for all hitters with at least 100 BIP. We use the random effects model logit(p_i) = \theta_i, where the \theta_i are assumed normal with mean \mu and standard deviation \sigma. Using JAGS to fit this model, we obtain posterior estimates \hat \mu = -3.07, \hat\sigma = 0.542. Assuming that our hitter is just a representative hitter among those with at least 100 BIP, I assume that logit(p) is N(-3.07, 0.542).

Below I display a contour graph of my prior on N and logit(p). This indicates that I’m pretty ignorant about the BIP, but I have some idea about the location of the home run probability.

Predictive Checks on My Prior

Before I am set with my prior, I should check if it predicts home run counts similar to what is actually observed in current seasons. I used the actual BIP counts and my normal prior on the logits to simulate a set of home run counts, and by summing the home run counts, I get an estimate at the total HR count for all hitters with at least 100 BIP. I repeated this simulation a number of times, and the actual 2017 home run count was consistent with values simulated from this predictive distribution. So I am satisfied that I have a reasonable prior and I can now update this prior with data.

Updating My Beliefs with Data

It is a relatively simple process to obtain the posterior of (N, p). Remember I observed exactly 45 home runs and the likelihood is given by L(N, p) = {N \choose p} p^{45} (1 - p)^{N - 45}. For each of the points in my grid, I multiply values of the prior and the likelihood, obtaining the posterior contour plot below. Note that p and N are negatively correlated — 45 home runs reflects a large p for a part-time player but a smaller p for a full-time player. I’m pretty confident that our hitter had at least 350 BIP in this season.

We are most interested in the player’s HR probability p. To obtain that, I first simulate draws of (N, logit p) from the grid of values — here I place the simulated draws on top of the contour graph.

I obtain the marginal posterior of p by constructing a density estimate of the simulated draws of p. Based on observing 45 home runs, I am pretty sure that his home run rate is between 0.07 and 0.13.

Wrap-Up

Here’s some interesting features about this analysis.

  • The sample size is unknown. We probability spend too much time talking about the binomial model where we know the sample size in advance. In reality, we don’t know N, so we should include that into the Bayesian analysis.
  • Using an informative prior. Assuming that the sample size is unknown complicates matters since, for example, there is not an obvious choice for a prior. But that is good since we need more practice thinking of informative priors. If we can’t think of relevant priors in the baseball context, then good luck working on priors on other applications.
  • Applying predictive checks. To see if any prior is reasonable, it is a good exercise to see if data predicted from the prior makes sense. Actually, it is easier to think about prediction of future data than it is to think about abstract parameters.
  • Computation is easy. This grid/simulation approach for summarizing posterior distributions is easy to apply and is a good starting point for learning about more sophisticated Bayesian simulation methods.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: