In last week’s post, I illustrated using a function shrink
to compute and graph improved estimates of baseball rates. Here’s a quick summary of the R work that I did to create this graph.
- From the
Batting
data frame in theLahman
package, I collected the season batting statistics for all players in a particular season with at least a given number of at-bats. - I used the
laplace
function in theLearnBayes
package to fit the random effects model described in the earlier post. From the model estimates, I compute the improved estimates. - Using traditional graphics, I use the
plot
function to plot the basic estimates, thepoints
function to overlay the improved estimate points in red, and use thelegend
function to add a legend.
Here’s another example of using of using shrink
to estimate the true batting averages from all players in the famous 1941 season with at least 100 AB. Although Ted Williams hit .406 that particular season, we estimate his true batting average to be closer to .350.
data.1941 <- shrink(1941, "H", "AB")
One problem with this graph is that it does not clearly display the shrinkage or movement of the observed batting averages towards the overall average. This motivates the use of a ggplot2 display that is easy to construct using the data frame data.1941
returned from the previous function.
Here’s the new plotting function ggplot.shrink
that requires the ggplot2
package. It illustrates the use of the geometric object function geom_segment
that will connect a player’s AVG with his improved estimate using a separate line segment.
ggplot.shrink <- function(stuff){ require(ggplot2) require(grid) names(stuff$data)[2:3] <- c("y", "n") ggplot(stuff$data, aes(x=n, xend=n, y=observed, yend=estimate)) + geom_segment(arrow = arrow(length = unit(0.3, "cm"))) + geom_hline(yintercept=stuff$p.ALL, color="red") + labs(title = paste("Shrinkage of Multilevel Estimates", ", K =", round(stuff$K))) + theme(plot.title = element_text(size = rel(2))) + xlab("OPPORTUNITIES") + ylab("ESTIMATE") }
Here is an example of this new function on the 1941 hit data.
ggplot.shrink(data.1941)
I think this plot is better for illustrating the dramatic movement of the observed batting averages towards the overall average. In particular, note that the movement is largest for the players with a small number of at-bats. One has less information about a player’s true average based on only 100 AB, and so you want to adjust this average more towards the overall average.
By the way, the size of the shrinkage depends on what rate you are estimating. Suppose instead that we’re interested in estimating the true home run rates for all hitters in the 1941 season. A home run rate reflects a batter’s ability more than a batting average, so we see smaller shrinkage of the raw rates towards the average.
data.1941.HR <- shrink(1941, "HR", "AB") ggplot.shrink(data.1941.HR)
This is really interesting. Would you mind extending this idea and demonstrate how you determine K=1274 in Chapter 9 of the book?
In a future post, I’ll describe how to find optimal shrinkage of multinomial rates such as described in Chapter 9 of the book.