(This is an early post due to the 4th of July holiday weekend.) I recently discovered the relatively new
broom package. In my world, one is often fitting a number of models to data, and it can be awkward to collect the model output in a convenient way. The objective of the
broom package is to address this issue — it collects model fitting output (say regression coefficients and standard errors from a fitting function such as
lm ) into a data frame. Then one can use other packages such as
ggplot2 to summarize or graph this data frame output.
I’m going to briefly illustrate this package for one of my popular topics — patterns in career trajectories of batters. I am going to focus on the players in the 500 home run club (those who have hit at least 500 career home runs) and see how their home run rates (HR / AB) vary as a function of their age. By the way, one of the chapters in Analyzing Baseball Data with R talks about estimating career trajectories using R. The Baseball-Reference site shows pictures and the leaders in career home runs.
I first have to create a data frame with the appropriate data — here’s the process.
- I start with the
Masterdata frames from the Lahman database. I collect the names of all of the players with at least 500 career home runs through the 2015 season.
I collect all of the player-season data (season, home runs, and at-bats) for all players in the 500 club.
Battingdata frame does not contain the age information. I pick up the birthdates of the players from the
Masterdata frame and merge this information with the player season data frame. From the birthdates, I can compute the player ages for all seasons. (There is a quibble about the “official” age which depends on the birth month and day, and I take in account this issue.)
After this work, I have a single data frame
S500 with variables
HR , and
AB . If I wanted to fit a single quadratic curve to the season home run rate data for all players, I would use the code
lm(I(HR / AB) ~ Age + I(Age ^ 2), data=S500)
Since I believe players have different “true” trajectories, I am interested in running this regression separately for each player in the 500 club. Suppose that I’d like to store the estimated regression coefficients for all the player fits. This is accomplished by the function
tidy in the
library(broom) regressions <- S500 %>% group_by(Name) %>% do(tidy(lm(I(HR / AB) ~ Age + I(Age ^ 2), data=.)))
regressions is a data frame containing the estimates, standard errors, etc. for each of the player fits:
head(regressions) Name term estimate std.error statistic <chr> <chr> <dbl> <dbl> <dbl> 1 Albert Pujols (Intercept) -0.1564145033 1.440626e-01 -1.085740 2 Albert Pujols Age 0.0175182704 1.046082e-02 1.674655 3 Albert Pujols I(Age^2) -0.0003347728 1.863632e-04 -1.796347 4 Alex Rodriguez (Intercept) -0.2894062182 7.684082e-02 -3.766308 5 Alex Rodriguez Age 0.0255148619 5.624347e-03 4.536502 6 Alex Rodriguez I(Age^2) -0.0004430719 9.927357e-05 -4.463140 Variables not shown: p.value <dbl>.
Using this data frame, I can estimate, for each model fit, the age at which the player achieves peak performance. I use the
summarize function in the
dplyr package to compute these estimates and then use
ggplot2 to display these peak ages.
S <- summarize(group_by(regressions, Name), peak_age=- estimate / 2 / estimate) ggplot(S, aes(Name, peak_age)) + geom_point() + coord_flip() + ylim(20, 40) + ggtitle("Age of Peak Performance of Home Run Rate")
Most players peak in home run rate in their late 20’s or early 30’s. Note that Ernie Banks and Albert Pujols was relatively young when he hit his (estimated) home run peak, in contrast to Barry Bonds who peaked (with respect to home run rate) in his late 30’s.
By use of the
augment function, I can obtain the raw data and the predicted response for all the cases in all of the regressions — this is saved in the data frame
individual <- S500 %>% group_by(Name) %>% do(augment(lm(I(HR / AB) ~ Age + I(Age ^ 2), data=.)))
head(individual) Name I.HR.AB. Age I.Age.2. .fitted .se.fit <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Albert Pujols 0.06271186 21 441 0.06383435 0.008159784 2 Albert Pujols 0.05762712 22 484 0.06695739 0.006197427 3 Albert Pujols 0.07275804 23 529 0.06941088 0.004872128 4 Albert Pujols 0.07770270 24 576 0.07119483 0.004241007 5 Albert Pujols 0.06937394 25 625 0.07230923 0.004171649 6 Albert Pujols 0.09158879 26 676 0.07275408 0.004366893 Variables not shown: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>.
With the basic data frame
S500 and the predicted values for all of the regressions stored in the data frame
individual , it is straightforward to use the
ggplot2 package to graph the home run rates and the fitted trajectories.
ggplot(individual, aes(Age, .fitted)) + geom_line(color="red", size=1.5) + facet_wrap(~ Name) + geom_point(data=S500, aes(Age, HR / AB), color="blue") + ggtitle("Career Trajectories of the 500 Home Run Club") + theme_fivethirtyeight()
This graph shows clearly that there are different career paths to 500 or more home runs.
- A typical trajectory (like A-Rod) is to have a gradual increase in the home run rate, peak in one’s late 20’s or early 30’s, and then decline until retirement. The rate of increase and decline can vary between players.
Another path to 500 is to have a long career and consistently hit home runs — players like Eddie Murray or Hank Aaron are of this type. I don’t know if one would call Eddie Murray a great home run hitter, but he hit a reasonable number of home runs each season for many seasons.
Another path is the “steriods” path — there is an indication from these home run trajectories that particular players around the 2000 season used some type of enhancement to dramatically increase their home run rate in the late part of their career. Players like Barry Bonds and Mark McGwire displayed unusual trajectories of home run rates.
For more information about the
broom package, I’d encourage you to read David Robinson’s paper.
All of the R code to generate this post and graphs can be found on my gist site.