Author Archive: Jim Albert

Was Ty Cobb Ever a True .400 Hitter? (A Tribute to Carl Morris)


I was saddened to learn about Professor Carl Morris’ recent passing. Carl was a great statistician famous for his groundbreaking work in empirical/hierarchical Bayesian modeling. On a personal note, Carl and I worked in similar research areas. When I was working on my dissertation, I read the famous series of papers by Brad Efron and Carl Morris in the 1970’s. I was fortunate to meet Carl at one of the first conferences I attended — he was always very supportive of my Bayesian work.

Besides statistics, Carl Morris and I shared a passion for sports and the application of statistical thinking to sports problems. Carl played tennis — one of his earliest sports papers was on the most important point in a game of tennis. He also was an avid baseball fan and he used baseball examples to illustrate statistical concepts.

One of Carl Morris’ most influential papers was “Parametric Empirical Bayes Inference: Theory and Applications” published in the Journal of the American Statistical Association in 1983. This paper provided an overview of multilevel modeling with the use of empirical Bayes methods in parameter estimation.

In Section 5 of this empirical Bayes paper, Carl uses the season to season batting averages of Ty Cobb to illustrate the application of multilevel models. Specifically, he wishes to address the question whether Cobb was ever a true .400 hitter during his career. As a tribute to Carl’s research, I thought it would be interesting to revisit Carl’s example using simulation-based computations from a similar multilevel model.

The Data and Quadratic Fit

Here is a scatterplot of Ty Cobb’s batting averages for all of his seasons from 1905 through 1928. We add a quadratic smoothing curve that seems to be a reasonable fit to these data.

We see from the graph that Cobb had an average exceeding .400 in the 1911, 1912 and 1922 seasons. But these averages don’t directly display Cobb’s talent. For example, Cobb’s .419 average in the 1911 season is influenced both by his hitting talent and the sampling variation due to other factors such as the fielding, pitching and ballpark characteristics. An interesting question is: “Did Cobb’s batting ability exceed .400 sometime during his career?” We address this question as Morris did by use of a statistical model.

A Multilevel Model

Let y_j denote the number of hits of Cobb in n_j at-bats in the jth season. We assume that y_j is binomial with sample size n_j and probability p_j. We can think of p_j as representing Cobb’s true ability of getting a hit in that particular season.

Since a quadratic model seems to be a reasonable fit to these data, we let p_j have a beta distribution with mean \eta_j and precision K where the means satisfy the logistic model

\log \left(\frac{\eta_j}{1 - \eta_j}\right) = \beta_0 + \beta_1 j + \beta_2 j^2.

At the final stage of the model, we assign the regression parameters \beta_0, \beta_1, \beta_2 and K weakly informative priors.

Fitting the Model

We fit this model by constructing a MCMC algorithm to sample from the joint posterior distribution. This was conveniently done using the JAGS software and the associated runjags R package.

As in Carl’s paper, we focus on Bayesian estimates of the hit probabilities $p_j$. One can write the estimate of p_j as

\tilde p_j = \frac{y_j + \hat K \hat{p_j}}{n_j + \hat K},

where \hat p_j is an estimate of p_j based on the quadratic fit and \hat K is an estimate of the precision parameter. Essentially one is moving the raw batting average y_j/n_j towards the quadratic estimate and the size of the movement is governed by \hat K. I plot the observed, quadratic fit and multilevel AVG estimates below. Here the estimate of K is 812; since this estimate exceeds the AB counts for Cobb, this indicates that the multilevel estimate adjusts the raw AVG over 50% of the way towards the quadratic estimate

Was Ty Cobb Ever a True .400 Hitter?

We know that Ty Cobb did hit over .400 for several seasons during his career. But we are asking a different question — did his true batting average $p_j$ exceed .400 for any season?

We can easily address this question from our Bayesian fitting. We define the maximum batting probability

p = max {p_1, …, p_{23}}

and examine the posterior density of the maximum probability p. This posterior density is displayed below. One computes the probability the maximum probability exceeds .400 is 0.761, so we conclude that it is likely that Cobb was a true .400 hitter.

Relation With Morris’ Work

In Carl Morris’ 1982 JASA paper, a similar multilevel model was fit based on normal distributions. This paper predated fitting Bayesian models by Markov Chain Monte Carlo, so Morris relied on approximate empirical Bayes methods to do the fitting. Figure 2 from his Parametric Empirical Bayes paper is shown below which resembles my graph.

Carl Morris gets similar results — for example, he comments that the shrinkage (movement) of the Bayesian estimates is typically 62% of the way towards the quadratic curve which agrees with our work. By using an independence argument he computes the posterior probability that Cobb was a .400 hitter (for at least one season) to be 88 percent which again is similar to our computed probability.

The player_function_lb() Function

I thought it would be worthwhile to write a R function that would perform these calculations for a player of interest. The function player_function_lb() accepts as input the Lahman player id of the player. The function will fit this multilevel model to the (H, AB) data for the player where the underlying probabilities are believed to follow a quadratic model. The two main outputs are …

  • a plot displaying the observed AVG, the quadratic fit, and the multilevel estimates plotted as a function of the season
  • a plot displaying the posterior density of the maximum hitting probability

In addition, all of the data used in the plots are included as part of the output. I’ve posted the player_function_lb() function on my Github Gist site.

To make the function self-contained, the multilevel model fitting is implemented through a two-step approach using functions from the LearnBayes package. The marginal posterior of the hyperparameters (\beta_0, \beta_1, \beta_2, \log K) is estimated by a normal approximation. For the Ty Cobb example, the probability estimates were approximately the same as the ones found using the JAGS/MCMC fitting.

Further Comments

  • In 2014, I had the opportunity to interview Carl Morris for Chance magazine. This interview focuses on Morris’ interests in sports and statistics, and his comments about the current and future use of statistical thinking in sports.
  • I have written several posts about the famous baseball example in a paper by Brad Efron and Carl Morris to illustrate these shrinkage estimates. In this post, I illustrate fitting a random effects model using 1970 season data found from Retrosheet. In this second post, given batting data for the beginning of a season, I describe the use of a Shiny app to predict batting rates for the remainder of the season.
  • A modern criticism of this study is that AVG (batting average) is a poor measure of batting ability. In a future post, I will illustrate how this type of modeling can be used in exploring career trajectories of Hall of Famers using a modern measure of batting performance such as wOBA.