(This is an early post due to the 4th of July holiday weekend.) I recently discovered the relatively new broom
package. In my world, one is often fitting a number of models to data, and it can be awkward to collect the model output in a convenient way. The objective of the broom
package is to address this issue — it collects model fitting output (say regression coefficients and standard errors from a fitting function such as lm
) into a data frame. Then one can use other packages such as dplyr
or ggplot2
to summarize or graph this data frame output.
I’m going to briefly illustrate this package for one of my popular topics — patterns in career trajectories of batters. I am going to focus on the players in the 500 home run club (those who have hit at least 500 career home runs) and see how their home run rates (HR / AB) vary as a function of their age. By the way, one of the chapters in Analyzing Baseball Data with R talks about estimating career trajectories using R. The Baseball-Reference site shows pictures and the leaders in career home runs.
I first have to create a data frame with the appropriate data — here’s the process.
- I start with the
Batting
andMaster
data frames from the Lahman database. I collect the names of all of the players with at least 500 career home runs through the 2015 season. -
I collect all of the player-season data (season, home runs, and at-bats) for all players in the 500 club.
-
The
Batting
data frame does not contain the age information. I pick up the birthdates of the players from theMaster
data frame and merge this information with the player season data frame. From the birthdates, I can compute the player ages for all seasons. (There is a quibble about the “official” age which depends on the birth month and day, and I take in account this issue.)
After this work, I have a single data frame S500
with variables Name
, yearID
, HR
, and AB
. If I wanted to fit a single quadratic curve to the season home run rate data for all players, I would use the code
lm(I(HR / AB) ~ Age + I(Age ^ 2), data=S500)
The tidy
function
Since I believe players have different “true” trajectories, I am interested in running this regression separately for each player in the 500 club. Suppose that I’d like to store the estimated regression coefficients for all the player fits. This is accomplished by the function tidy
in the broom
package:
library(broom) regressions <- S500 %>% group_by(Name) %>% do(tidy(lm(I(HR / AB) ~ Age + I(Age ^ 2), data=.)))
The output regressions
is a data frame containing the estimates, standard errors, etc. for each of the player fits:
head(regressions) Name term estimate std.error statistic <chr> <chr> <dbl> <dbl> <dbl> 1 Albert Pujols (Intercept) -0.1564145033 1.440626e-01 -1.085740 2 Albert Pujols Age 0.0175182704 1.046082e-02 1.674655 3 Albert Pujols I(Age^2) -0.0003347728 1.863632e-04 -1.796347 4 Alex Rodriguez (Intercept) -0.2894062182 7.684082e-02 -3.766308 5 Alex Rodriguez Age 0.0255148619 5.624347e-03 4.536502 6 Alex Rodriguez I(Age^2) -0.0004430719 9.927357e-05 -4.463140 Variables not shown: p.value <dbl>.
Using this data frame, I can estimate, for each model fit, the age at which the player achieves peak performance. I use the summarize
function in the dplyr
package to compute these estimates and then use ggplot2
to display these peak ages.
S <- summarize(group_by(regressions, Name), peak_age=- estimate[2] / 2 / estimate[3]) ggplot(S, aes(Name, peak_age)) + geom_point() + coord_flip() + ylim(20, 40) + ggtitle("Age of Peak Performance of Home Run Rate")
Most players peak in home run rate in their late 20’s or early 30’s. Note that Ernie Banks and Albert Pujols was relatively young when he hit his (estimated) home run peak, in contrast to Barry Bonds who peaked (with respect to home run rate) in his late 30’s.
The augment
function
By use of the augment
function, I can obtain the raw data and the predicted response for all the cases in all of the regressions — this is saved in the data frame individual
.
individual <- S500 %>% group_by(Name) %>% do(augment(lm(I(HR / AB) ~ Age + I(Age ^ 2), data=.)))
head(individual) Name I.HR.AB. Age I.Age.2. .fitted .se.fit <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Albert Pujols 0.06271186 21 441 0.06383435 0.008159784 2 Albert Pujols 0.05762712 22 484 0.06695739 0.006197427 3 Albert Pujols 0.07275804 23 529 0.06941088 0.004872128 4 Albert Pujols 0.07770270 24 576 0.07119483 0.004241007 5 Albert Pujols 0.06937394 25 625 0.07230923 0.004171649 6 Albert Pujols 0.09158879 26 676 0.07275408 0.004366893 Variables not shown: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>.
With the basic data frame S500
and the predicted values for all of the regressions stored in the data frame individual
, it is straightforward to use the ggplot2
package to graph the home run rates and the fitted trajectories.
ggplot(individual, aes(Age, .fitted)) + geom_line(color="red", size=1.5) + facet_wrap(~ Name) + geom_point(data=S500, aes(Age, HR / AB), color="blue") + ggtitle("Career Trajectories of the 500 Home Run Club") + theme_fivethirtyeight()
This graph shows clearly that there are different career paths to 500 or more home runs.
- A typical trajectory (like A-Rod) is to have a gradual increase in the home run rate, peak in one’s late 20’s or early 30’s, and then decline until retirement. The rate of increase and decline can vary between players.
-
Another path to 500 is to have a long career and consistently hit home runs — players like Eddie Murray or Hank Aaron are of this type. I don’t know if one would call Eddie Murray a great home run hitter, but he hit a reasonable number of home runs each season for many seasons.
-
Another path is the “steriods” path — there is an indication from these home run trajectories that particular players around the 2000 season used some type of enhancement to dramatically increase their home run rate in the late part of their career. Players like Barry Bonds and Mark McGwire displayed unusual trajectories of home run rates.
For more information about the broom
package, I’d encourage you to read David Robinson’s paper.
All of the R code to generate this post and graphs can be found on my gist site.
Great graphics. Interesting in regards to the steroids issue. Thanks. (McGwire’s, Robinson’s, and Matthews’s dots are missing on the first chart.)
Steve: Thanks. For some hitters, the peak age estimates from the quadratic fit were not reasonable (outside of the 20 to 40) interval, so I didn’t show them in the graph. This motivates the need for multilevel modeling in this situation so one would get better estimates of peak age.