Monthly Archives: August, 2015

Redrawing Steven Jay Gould’s Graph

The late Harvard paleontologist and baseball fan Steven Jay Gould wrote a famous study on the disappearance of the .400 batting average in baseball in his book Triumph and Tragedy in Mudville: A Lifelong Passion for Baseball. Essentially his argument was that (1) the variation in batting averages among “regular” players has showed a steady decrease over time, (2) great hitters are moved towards the average due to this smaller variation, and so (3) it is harder to get a .400 average.

Gould used a graph to show the decrease in the standard deviation of batting averages over time. Michael Friendly has discussed the problems in Gould’s graph and shown an improved graph. Since I’m currently teaching principles of statistical graphics, I thought it would be helpful to show how one can use the Lahman package together with the dplyr and ggplot2 packages to replicate Gould’s display and an improved one.

First we start by collecting the number of hits, at-bats, walks, hit-by-pitches, and SFs for all players and seasons from the Lahman Batting data frame.

S <- summarize(group_by(Batting, yearID, playerID),
                        H=sum(H), AB=sum(AB),
                        BB=sum(BB), HBP=sum(HBP, na.rm=TRUE),
                        SF=sum(SF, na.rm=TRUE))

We say that a player is “regular” if his minimum number of at-bats is 3 x We find this min.AB for all seasons from the Teams dataset, merge this data frame with the Batting data frame, and use the filter function (from the dplyr package) to restrict attention to player/seasons that exceed the minimum number of AB.

ST <- summarize(group_by(Teams, yearID),
               Games = round(mean(W + L)),
               min.AB = 3 * Games)
S2 <- merge(S, ST, by="yearID")
S.regular <- filter(S2, AB >= min.AB) 

We use the summarize function again to find the standard deviation of the batting averages and the standard deviation of the on-base percentages for all seasons. <- summarize(group_by(S.regular, yearID),
                     SD.AVG = sd(H / AB),
                     SD.OBP = sd((H + BB + HBP) / (AB + BB + HBP + SF)))

We replicate Gould’s graph from his book using ggplot2 graphics. This graph is hard to read, and it really is ineffective in communicating Gould’s main point that the standard deviations are decreasing over time.

ggplot(, aes(SD.AVG, yearID)) + geom_point() +
  xlim(.005, .075) + 
  ggtitle("Standard Deviations of AVG of Regulars (Gould's Display)")


We draw this in a more standard way where the horizontal variable is season and the vertical is standard deviation. To help see the pattern, I add a smoothing curve. The message from this graph is that the standard deviation of the AVGs has stayed constant in recent seasons.

ggplot(, aes(yearID, SD.AVG)) + geom_point() + 
  geom_smooth(se=FALSE, span=0.35) + 
  ggtitle("Standard Deviations of AVG of Regulars")


What if we considered a different batting measure? Specifically, how has the spread in on-base percentages changed over time? In contrast to the AVG, the standard deviations of the OBPs seemed low in the 1980’s, higher in seasons around 2000, and the SDs appear low in recent seasons.

ggplot(, aes(yearID, SD.OBP)) + 
  geom_point() + geom_smooth(se=FALSE, span=0.35) +
  ggtitle("Standard Deviations of OBP of Regulars")


This interesting pattern of OBP sd’s probably deserves more study. For example, how has the spread of the OBPs of the first batter in the lineup changed over time? Are there other variables (such as strikeout rates) that might impact the spread of the OBPs? Generally, it would be interesting to better understand the reasons for the changes in variation over time.