One of the famous small datasets in statistics is one contained in a paper in the Journal of the American Statistical Association by Brad Efron and Carl Morris where they use baseball data to illustrate the benefits of shrinkage estimation. In April of 1970, Brad and Carl looked at the Sunday paper where tables of baseball statistics were published and chose 18 players who had exactly 45 at-bats. The goal was to predict the batting average of these 18 players in the remainder of the 1970 season. I got introduced to this study by my advisor at Purdue and Efron and Morris’s research had a significant impact on my early research in statistics. Anyway, I thought it would be interesting to revisit this example by downloading Retrosheet data. This is a nice toy example to illustrate Bayesian random effects modeling and Bayesian prediction. Recently I interviewed Carl Morris in a Chance article where he gives some insight on the collection of this baseball dataset.
By the way, this dataset still is used in a number of Bayesian books to demonstrate exchangeable modeling. Specifically, Parent and Rivot’s book has an appendix on “The baseball players historical example” that discusses hierarchical analyses for this particular 1970 dataset.
Getting the Data
In a previous post, I presented a function for downloading Retrosheet play-by-play for a particular season. I read in this
parse.retrosheet2.pbp from my gist site and use it to download the data for the 1970 season.
To find the players from Efron and Morris’s study, I collect the number of AB and H for all players through April 24 (I had found that Roberto Clemente had exactly 45 AB through April 24) and then collected all hitters who had between 44 and 46 AB. This gave me 17 hitters — they substantially overlap with the list of players used in Efron and Morris’s paper. (I was happy to see that Johnny Callison, a popular Phillies hitter, was included in my list.)
Fitting a Random Effects Model
Here is a relatively simple random effects model to learn about the batting abilities for these 17 players. The number of hits y_i for the ith batter is binomial with hit probability p_i (i goes from 1 to 18) and then the collection of hit probabilities p_1, …, p_17 come from a beta “talent” curve with shape parameters K eta and K ( 1 – eta). (I place weakly informative priors on the unknown parameters K and eta.) Here’s a picture of this model (the notation is a bit off, but it demonstrates the basic idea.)
Anyway, I have a function
fit.model that will estimate K and eta from this data. The estimates are K = 68 and eta = .290. We use these to estimate the hitting probability for any player. For example, Max Alvis had 7 hits in 44 AB for a batting average of 0.159. To estimate Alvis’ hitting probability we essentially combine Alvis’ hitting data with an “average” player with 68 AB and batting average of .290.
Here’s a graph showing the observed batting averages and the estimates of the hitting probabilities that shrink or move these batting averages towards the overall average.
The objective of this modeling is to predict the batting averages of these 17 players for the remainder of the season. I use the Retrosheet data to collect the number of AB for all players in the remainder of the season. There are two types of uncertainty in this prediction exercise. There is uncertainty in the true hitting probabilities for the players, and once we have values of the hitting probabilities, there is additional (binomial) uncertainty in the future number of hits for these players. For a specific player, we simulate values of his hitting probability (based on the posterior distribution of p_i) and then using these simulated values of the probability, we simulate the number of hits by binomial distributions. When we are done, we have 1000 values of the number of hits from the predictive distribution. I use density curves to display these predictive distribution of the players’ batting averages in the remainder of the 1970 season. These intervals might seem pretty wide, but then again we don’t have much data (only 45 AB) to learn about the batting abilities and there is a large chunk of the season remaining to be played.
This general approach can be used to predict the batting averages of a group of hitters at any point during the season. One nice feature of the Bayesian approach is that the posteriors for the hitting probabilities can be conveniently updated during the season with new data.
If you want to play with this method for other players and other seasons, all of the R code can be found on my github gist site.