# A Conversation with Herman Rubin

Our department recently hosted a one-day meeting on Statistical Learning and we were fortunate to have one of my Purdue professors Herman Rubin in attendance.  Herman and I have had many conversations over the years and interestingly, one of our conversations this day was about baseball.  Here’s my recollection of the conversation:

Herman:  Batting averages are not beta distributed.
Me:  Okay, but the beta is a useful model.
Herman:  Is it?  Have you ever seen a .500 batting average?

Anyway, this conversation motivated me to look at my use of a beta curve to model hitting probabilities.

My beta/binomial model says that true batting probabilities $p_1, ..., p_N$ follow a beta curve with shape parameters a and b.  The number of hits $y_i$ for the $i$th player is assumed to be binomial with sample size $AB_i$ and hitting probability $p_i$.

Here’s an illustration of using the LearnBayes package to fit this beta/binomial model to Efron and Morris’s baseball data.

Using 2015 Retrosheet play-by-play,

• I fit this model to (H, AB) data for all players in the first-half of the season with at least 100 AB (to exclude pitchers).  I get estimates for the hitting probabilities that shrink the observed batting averages towards the average.
• I use this model to predict the number of hits of these players in the 2nd-half of the season (I use the known AB in the 2nd half).

If this model is reasonable, then predicted batting averages for the 2nd half should resemble the actually 2015 second-half batting averages.

Here’s a graph of box plots — 20 of the boxplots are simulated predictions of all 2nd-half batting averages from the model, and we compare that with the boxplot of the actual 2nd-half batting averages. Looking carefully, we see that the center and spread of the predicted model averages are similar to the actual batting averages.  But I see one discrepancy — the predicted averages tend to have outliers on the high end and there are no high outliers in the actual 2015 2nd-half averages.

To look at this more, we think of a statistic that might be helpful in distinguishing the model predictions from the actual data — since we are concerned about outliers, I will compute the Number of Predictions that are .350 or higher.

For each of 500 simulated model predictions of the 2nd-half averages, I compute the number of .350+ predictions — I construct a histogram of these numbers and overlay the actual number of 2nd-half .350 averages which is equal to 1. It is pretty obvious that the beta model is questionable in that it predicts more 350+ avg hitters than we actually see in the data.

So Herman is right in the sense that the beta/binomial model does not seem that great in predicting batting averages in the right tail.  An interesting question is how to modify this model to get better predictions.