# Graphing HR Totals

Recently, the New York Times had an interesting article about baseball records where they present some nice graphs showing the pattern of different types of performance (HR, RBI, AVG) of the top players over time. I thought it would helpful to demonstrate the ease of creating these types of graphs using the ` ggplot2 ` package in R. We’ll focus on the home run counts.

First, we read in the relevant data, the Batting data downloaded from the Lahman database that I have stored on my computer.

```Batting <- read.csv("~/OneDriveBusiness/lahman-csv_2015-01-24/Batting.csv")
```

I use the ` summarize ` function to collapse the HR and AB counts for all players over the ` stint ` variable, and only consider seasons from 1901 (when there were two leagues).

```library(dplyr)
HR.Data <- summarize(group_by(Batting, yearID, playerID),
AB=sum(AB), HR=sum(HR))
HR.Data <- filter(HR.Data, yearID >= 1901)
```

Looking at the NY graph, it appears that they graphed only the top HR totals. With some trial and error, it seemed that this approximately corresponded to the 97th percentile of the HR totals. I use the ` summarize ` function again to find the 97th percentile of HR for each season and I merge these values with the HR data. I use the ` filter ` function to create a data frame with only the HR counts exceeded that percentile.

```Percentiles <- summarize(group_by(HR.Data, yearID),
P = quantile(HR, .97, na.rm=TRUE))
HR.Data <- merge(HR.Data, Percentiles)
HR.Data.Top <- filter(HR.Data, HR >= P)
```

As in the NY Times graph, I want to identify the home run values that broke the career single-season record. I do this in three steps. First, I use ` summarize ` again to find the HR leader for each season, I use the ` mutate ` function, to find the cumulative maximums, and the ` filter ` function to create a data frame of the home run record breakers.

```HR.Best <- summarize(group_by(HR.Data, yearID),
```

Now I am ready to graph. Since the NY Times graph is relatively wide, I create a plotting frame (using ` quartz ` on my Mac; Windows users use ` window `) with width 10 inches and height 5 inches. I construct a scatterplot of the HR values of the “top” players against season and add a smoothing curve. From the ` Career.Best ` data frame, I add red dots (drawn sightly large for emphasis) for the HR record breakers.

```library(ggplot2)
quartz(width=10, height=5)
ggplot(HR.Data.Top, aes(yearID, HR)) +
geom_point() + geom_smooth(color="red") +