Shrinking Batting Averages: Part II

In last week’s post, I illustrated using a function shrink to compute and graph improved estimates of baseball rates. Here’s a quick summary of the R work that I did to create this graph.

  • From the Batting data frame in the Lahman package, I collected the season batting statistics for all players in a particular season with at least a given number of at-bats.
  • I used the laplace function in the LearnBayes package to fit the random effects model described in the earlier post. From the model estimates, I compute the improved estimates.
  • Using traditional graphics, I use the plot function to plot the basic estimates, the points function to overlay the improved estimate points in red, and use the legend function to add a legend.

Here’s another example of using of using shrink to estimate the true batting averages from all players in the famous 1941 season with at least 100 AB. Although Ted Williams hit .406 that particular season, we estimate his true batting average to be closer to .350.

data.1941 <- shrink(1941, "H", "AB")

1941plot1

One problem with this graph is that it does not clearly display the shrinkage or movement of the observed batting averages towards the overall average. This motivates the use of a ggplot2 display that is easy to construct using the data frame data.1941 returned from the previous function.

Here’s the new plotting function ggplot.shrink that requires the ggplot2 package. It illustrates the use of the geometric object function geom_segment that will connect a player’s AVG with his improved estimate using a separate line segment.

ggplot.shrink <- function(stuff){
  require(ggplot2)
  require(grid)
  names(stuff$data)[2:3] <- c("y", "n")
  ggplot(stuff$data, aes(x=n, xend=n, 
                y=observed, yend=estimate)) + 
    geom_segment(arrow = arrow(length = unit(0.3, "cm"))) +
    geom_hline(yintercept=stuff$p.ALL, color="red") +
    labs(title = paste("Shrinkage of Multilevel Estimates", 
                       ", K =", round(stuff$K))) +
    theme(plot.title = element_text(size = rel(2))) +
    xlab("OPPORTUNITIES") + ylab("ESTIMATE") 
}

Here is an example of this new function on the 1941 hit data.

ggplot.shrink(data.1941)

1941plot2

I think this plot is better for illustrating the dramatic movement of the observed batting averages towards the overall average. In particular, note that the movement is largest for the players with a small number of at-bats. One has less information about a player’s true average based on only 100 AB, and so you want to adjust this average more towards the overall average.

By the way, the size of the shrinkage depends on what rate you are estimating. Suppose instead that we’re interested in estimating the true home run rates for all hitters in the 1941 season. A home run rate reflects a batter’s ability more than a batting average, so we see smaller shrinkage of the raw rates towards the average.

data.1941.HR <- shrink(1941, "HR", "AB")
ggplot.shrink(data.1941.HR)

1941plot31941plot4

Advertisements

2 responses

  1. This is really interesting. Would you mind extending this idea and demonstrate how you determine K=1274 in Chapter 9 of the book?

  2. In a future post, I’ll describe how to find optimal shrinkage of multinomial rates such as described in Chapter 9 of the book.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: