Recently, the New York Times had an interesting article about baseball records where they present some nice graphs showing the pattern of different types of performance (HR, RBI, AVG) of the top players over time. I thought it would helpful to demonstrate the ease of creating these types of graphs using the
ggplot2 package in R. We’ll focus on the home run counts.
First, we read in the relevant data, the Batting data downloaded from the Lahman database that I have stored on my computer.
Batting <- read.csv("~/OneDriveBusiness/lahman-csv_2015-01-24/Batting.csv")
I use the
summarize function to collapse the HR and AB counts for all players over the
stint variable, and only consider seasons from 1901 (when there were two leagues).
library(dplyr) HR.Data <- summarize(group_by(Batting, yearID, playerID), AB=sum(AB), HR=sum(HR)) HR.Data <- filter(HR.Data, yearID >= 1901)
Looking at the NY graph, it appears that they graphed only the top HR totals. With some trial and error, it seemed that this approximately corresponded to the 97th percentile of the HR totals. I use the
summarize function again to find the 97th percentile of HR for each season and I merge these values with the HR data. I use the
filter function to create a data frame with only the HR counts exceeded that percentile.
Percentiles <- summarize(group_by(HR.Data, yearID), P = quantile(HR, .97, na.rm=TRUE)) HR.Data <- merge(HR.Data, Percentiles) HR.Data.Top <- filter(HR.Data, HR >= P)
As in the NY Times graph, I want to identify the home run values that broke the career single-season record. I do this in three steps. First, I use
summarize again to find the HR leader for each season, I use the
mutate function, to find the cumulative maximums, and the
filter function to create a data frame of the home run record breakers.
HR.Best <- summarize(group_by(HR.Data, yearID), Leader = max(HR, na.rm=TRUE)) HR.Best <- mutate(HR.Best, Hist.Leader=cummax(Leader)) Career.Best <- filter(HR.Best, Leader==Hist.Leader)
Now I am ready to graph. Since the NY Times graph is relatively wide, I create a plotting frame (using
quartz on my Mac; Windows users use
window ) with width 10 inches and height 5 inches. I construct a scatterplot of the HR values of the “top” players against season and add a smoothing curve. From the
Career.Best data frame, I add red dots (drawn sightly large for emphasis) for the HR record breakers.
library(ggplot2) quartz(width=10, height=5) ggplot(HR.Data.Top, aes(yearID, HR)) + geom_point() + geom_smooth(color="red") + geom_point(data=Career.Best, aes(yearID, Hist.Leader), color="red", size=3)
From a statistical perspective, I am most interested in the NY Times prediction that there is an even (50% chance) the HR season record will last 49 years. I’m not sure of the details of the extreme value model that led to this prediction, but I suspect that this model makes some assumptions about the distribution of home run hitting in the future, and the prediction is likely sensitive to these model assumptions.