In an earlier post, I made the comment that extreme batting measures from one season tend to regress or move towards the average. Actually, it is straightforward to use a statistical model to obtain good estimates of player abilities using statistics for one season — these estimates are good predictions of how the players will perform the following season. I’ll first briefly describe the statistical model, and then I’ll use a function from my LearnBayes
package to fit the model and obtain these improved estimates.
The following picture will be used to describe the following random effects model. I’ll talk in terms of batting averages, although the model can be used to estimate other types of rate measures.
- Suppose the talents of a group of players
come from a bell-shaped distribution with unknown mean
and spread
.
- We don’t observe the talents, but the observed batting averages
have distributions (analogous to flipping biased coins) with respective means
.
When we fit this model, we get the following estimate for the ith player’s true batting average:
In this estimate, P.ALL is the overall AVG for the group of players, and K is a so-called smoothing constant which controls the amount of shrinkage of an individual batting average towards the overall average.
If you install the R packages Lahman
, plyr
, and LearnBayes
, then you can use the function shrink
(displayed at the end of the post) to compute these estimates and produce a nice graph for any “rate statistic” for players in a particular season.
This function has four inputs — year
(what season of data do we want to look at), n.var
(this is the numerator of the rate), d.var
(this is the denominator of the rate), and Lower.Bound
(we’ll only use data for players who have at least Lower.Bound opportunities). In this example, we’re looking at 2012 data, we are looking at batting average (the numerator is “H” and the denominator is “AB”), and we’ll focus on players with at least 200 AB. The following graph is produced — we plot the observed AVG against AB as black circles and the improved estimates are shown as red solid dots.
year <- 2012 n.var <- "H"; d.var <- "AB" Lower.Bound <- 200 s.out <- shrink(year, n.var, d.var, Lower.Bound)
The title of the figure shows that the fitted model has p.ALL = 0.263 and K = 344. So the improved estimates are given by
Here the improved estimates “shrink” or adjust the observed batting averages substantially towards the overall value. For example, in 2012 Melky Cabrera had a .346 batting average in 459 AB — his average is represented by the black dot at the top of the graph. Using this formula, we would predict his batting average for the 2013 season to be
In other words, we believe a better estimate of Melky’s true average is 0.310. We expect Melky to regress substantially in the 2013 season.
The function shrink
returns a list s.out
with all relevant output, the observed rates and improved estimates for all players and the values of the fitted values p.ALL and K. We can display some of this output by use of the str
function.
str(s.out) List of 3 $ data :'data.frame': 319 obs. of 5 variables: ..$ playerID: chr [1:319] "abreubo01" "ackledu01" "alonsyo01" "altuvjo01" ... ..$ H : int [1:319] 53 137 150 167 128 66 81 180 150 81 ... ..$ AB : int [1:319] 219 607 549 576 525 275 384 629 520 347 ... ..$ observed: num [1:319] 0.242 0.226 0.273 0.29 0.244 ... ..$ estimate: num [1:319] 0.255 0.239 0.269 0.28 0.251 ... $ p.ALL: num 0.263 $ K : num 344
In a future post, I’ll explain how the R function shrink
works and illustrate the use of ggplot2 graphics to better illustrate the shrinkage estimates.
shrink <- function(year, n.var, d.var, Lower.Bound=100){ # load required packages (each should be installed first) require(Lahman) require(LearnBayes) require(plyr) # data handling part Batting.year <- subset(Batting, yearID==year) data <- Batting.year[, c("playerID", "stint", n.var, d.var)] data <- data[complete.cases(data), ] # collapse over stint variable sum.function <- function(d){ apply(d[, c(n.var, d.var)], 2, sum)} data <- ddply(data, .(playerID), sum.function) # collect (n.var, d.var) data for all players with Lower.Bound opportunities data.LB <- data[data[, d.var] >= Lower.Bound, ] # fit the multilevel model fit <- laplace(betabinexch, c(0, 2), data.LB[, -1]) p.all <- exp(fit$mode[1]) / (1 + exp(fit$mode[1])) K <- exp(fit$mode[2]) # compute the multilevel model estimate data.LB$observed <- data.LB[, 2] / data.LB[, 3] data.LB$estimate <- (data.LB[, 2] + K * p.all) / (data.LB[, 3] + K) # graph the original and improved estimates against opportunities plot(data.LB[, 3], data.LB$observed, xlab="OPPORTUNITY", ylab="ESTIMATE", main=paste("(", n.var, ",", d.var,") Data in", year, "Season:", "pALL =", round(p.all, 3),",K =", round(K))) points(data.LB[, 3], data.LB$estimate, pch=19, col="red") abline(h=p.all, lwd=2, col="blue") legend("topleft", legend=c("ACTUAL", "IMPROVED"), pch=c(1, 19), col=c("black", "red")) list(data=data.LB, p.ALL=p.all, K=K) }