Graphing HR Totals

Recently, the New York Times had an interesting article about baseball records where they present some nice graphs showing the pattern of different types of performance (HR, RBI, AVG) of the top players over time. I thought it would helpful to demonstrate the ease of creating these types of graphs using the ggplot2 package in R. We’ll focus on the home run counts.

First, we read in the relevant data, the Batting data downloaded from the Lahman database that I have stored on my computer.

Batting <- read.csv("~/OneDriveBusiness/lahman-csv_2015-01-24/Batting.csv")

I use the summarize function to collapse the HR and AB counts for all players over the stint variable, and only consider seasons from 1901 (when there were two leagues).

HR.Data <- summarize(group_by(Batting, yearID, playerID),
                     AB=sum(AB), HR=sum(HR))
HR.Data <- filter(HR.Data, yearID >= 1901)

Looking at the NY graph, it appears that they graphed only the top HR totals. With some trial and error, it seemed that this approximately corresponded to the 97th percentile of the HR totals. I use the summarize function again to find the 97th percentile of HR for each season and I merge these values with the HR data. I use the filter function to create a data frame with only the HR counts exceeded that percentile.

Percentiles <- summarize(group_by(HR.Data, yearID),
                         P = quantile(HR, .97, na.rm=TRUE))
HR.Data <- merge(HR.Data, Percentiles)
HR.Data.Top <- filter(HR.Data, HR >= P)

As in the NY Times graph, I want to identify the home run values that broke the career single-season record. I do this in three steps. First, I use summarize again to find the HR leader for each season, I use the mutate function, to find the cumulative maximums, and the filter function to create a data frame of the home run record breakers.

HR.Best <- summarize(group_by(HR.Data, yearID),
                     Leader = max(HR, na.rm=TRUE))
HR.Best <- mutate(HR.Best, Hist.Leader=cummax(Leader))
Career.Best <- filter(HR.Best, Leader==Hist.Leader)

Now I am ready to graph. Since the NY Times graph is relatively wide, I create a plotting frame (using quartz on my Mac; Windows users use window ) with width 10 inches and height 5 inches. I construct a scatterplot of the HR values of the “top” players against season and add a smoothing curve. From the Career.Best data frame, I add red dots (drawn sightly large for emphasis) for the HR record breakers.

quartz(width=10, height=5)
ggplot(HR.Data.Top, aes(yearID, HR)) +
  geom_point() + geom_smooth(color="red") +
  geom_point(data=Career.Best, aes(yearID, Hist.Leader),
             color="red", size=3)


From a statistical perspective, I am most interested in the NY Times prediction that there is an even (50% chance) the HR season record will last 49 years. I’m not sure of the details of the extreme value model that led to this prediction, but I suspect that this model makes some assumptions about the distribution of home run hitting in the future, and the prediction is likely sensitive to these model assumptions.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: