Shrinking Batting Averages

In an earlier post, I made the comment that extreme batting measures from one season tend to regress or move towards the average. Actually, it is straightforward to use a statistical model to obtain good estimates of player abilities using statistics for one season — these estimates are good predictions of how the players will perform the following season. I’ll first briefly describe the statistical model, and then I’ll use a function from my LearnBayes package to fit the model and obtain these improved estimates.

The following picture will be used to describe the following random effects model. I’ll talk in terms of batting averages, although the model can be used to estimate other types of rate measures.


  • Suppose the talents of a group of players \theta_1, ..., \theta_k come from a bell-shaped distribution with unknown mean \mu and spread \tau.
  • We don’t observe the talents, but the observed batting averages AVG_1, ..., AVG_k have distributions (analogous to flipping biased coins) with respective means \theta_1, ..., \theta_k.

When we fit this model, we get the following estimate for the ith player’s true batting average:

ESTIMATE = \frac{AB_i}{K + AB_i} AVG_i + \frac{K}{K + AB_i} P.ALL

In this estimate, P.ALL is the overall AVG for the group of players, and K is a so-called smoothing constant which controls the amount of shrinkage of an individual batting average towards the overall average.

If you install the R packages Lahman, plyr, and LearnBayes, then you can use the function shrink (displayed at the end of the post) to compute these estimates and produce a nice graph for any “rate statistic” for players in a particular season.

This function has four inputs — year (what season of data do we want to look at), n.var (this is the numerator of the rate), d.var (this is the denominator of the rate), and Lower.Bound (we’ll only use data for players who have at least Lower.Bound opportunities). In this example, we’re looking at 2012 data, we are looking at batting average (the numerator is “H” and the denominator is “AB”), and we’ll focus on players with at least 200 AB. The following graph is produced — we plot the observed AVG against AB as black circles and the improved estimates are shown as red solid dots.

year <- 2012
n.var <- "H"; d.var <- "AB"
Lower.Bound <- 200
s.out <- shrink(year, n.var, d.var, Lower.Bound)


The title of the figure shows that the fitted model has p.ALL = 0.263 and K = 344. So the improved estimates are given by

ESTIMATE = \frac{AB_i}{344 + AB_i} AVG_i + \frac{344}{344 + AB_i} 0.263

Here the improved estimates “shrink” or adjust the observed batting averages substantially towards the overall value. For example, in 2012 Melky Cabrera had a .346 batting average in 459 AB — his average is represented by the black dot at the top of the graph. Using this formula, we would predict his batting average for the 2013 season to be

ESTIMATE = \frac{459}{344 + 459} 0.346 + \frac{344}{344 + 459} 0.263 = 0.310

In other words, we believe a better estimate of Melky’s true average is 0.310. We expect Melky to regress substantially in the 2013 season.

The function shrink returns a list s.out with all relevant output, the observed rates and improved estimates for all players and the values of the fitted values p.ALL and K. We can display some of this output by use of the str function.

List of 3
 $ data :'data.frame':	319 obs. of  5 variables:
  ..$ playerID: chr [1:319] "abreubo01" "ackledu01" "alonsyo01" "altuvjo01" ...
  ..$ H       : int [1:319] 53 137 150 167 128 66 81 180 150 81 ...
  ..$ AB      : int [1:319] 219 607 549 576 525 275 384 629 520 347 ...
  ..$ observed: num [1:319] 0.242 0.226 0.273 0.29 0.244 ...
  ..$ estimate: num [1:319] 0.255 0.239 0.269 0.28 0.251 ...
 $ p.ALL: num 0.263
 $ K    : num 344

In a future post, I’ll explain how the R function shrink works and illustrate the use of ggplot2 graphics to better illustrate the shrinkage estimates.

shrink <- function(year, n.var, d.var, Lower.Bound=100){
  # load required packages (each should be installed first)
  # data handling part
  Batting.year <- subset(Batting, yearID==year)
  data <- Batting.year[, c("playerID", "stint", n.var, d.var)]
  data <- data[complete.cases(data), ]
  # collapse over stint variable
  sum.function <- function(d){
    apply(d[, c(n.var, d.var)], 2, sum)}
  data <- ddply(data, .(playerID), sum.function) 
  # collect (n.var, d.var) data for all players with Lower.Bound opportunities
  data.LB <- data[data[, d.var] >= Lower.Bound, ]
  # fit the multilevel model
  fit <- laplace(betabinexch, c(0, 2), data.LB[, -1])
  p.all <- exp(fit$mode[1]) / (1 + exp(fit$mode[1]))
  K <- exp(fit$mode[2])
  # compute the multilevel model estimate
  data.LB$observed <- data.LB[, 2] / data.LB[, 3]
  data.LB$estimate <- (data.LB[, 2] + K * p.all) / (data.LB[, 3] + K)
  # graph the original and improved estimates against opportunities
  plot(data.LB[, 3], data.LB$observed,
     xlab="OPPORTUNITY", ylab="ESTIMATE",
     main=paste("(", n.var, ",", d.var,") Data in", year, "Season:",
                "pALL =", round(p.all, 3),",K =", round(K)))
  points(data.LB[, 3], data.LB$estimate, pch=19, col="red")
  abline(h=p.all, lwd=2, col="blue")
  legend("topleft", legend=c("ACTUAL", "IMPROVED"), 
         pch=c(1, 19), col=c("black", "red"))
  list(data=data.LB, p.ALL=p.all, K=K)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: