The broom Package and Home Run Trajectories

(This is an early post due to the 4th of July holiday weekend.) I recently discovered the relatively new broom package. In my world, one is often fitting a number of models to data, and it can be awkward to collect the model output in a convenient way. The objective of the broom package is to address this issue — it collects model fitting output (say regression coefficients and standard errors from a fitting function such as lm ) into a data frame. Then one can use other packages such as dplyr or ggplot2 to summarize or graph this data frame output.

I’m going to briefly illustrate this package for one of my popular topics — patterns in career trajectories of batters. I am going to focus on the players in the 500 home run club (those who have hit at least 500 career home runs) and see how their home run rates (HR / AB) vary as a function of their age. By the way, one of the chapters in Analyzing Baseball Data with R talks about estimating career trajectories using R. The Baseball-Reference site shows pictures and the leaders in career home runs.

I first have to create a data frame with the appropriate data — here’s the process.

  1. I start with the Batting and Master data frames from the Lahman database. I collect the names of all of the players with at least 500 career home runs through the 2015 season.

  2. I collect all of the player-season data (season, home runs, and at-bats) for all players in the 500 club.

  3. The Batting data frame does not contain the age information. I pick up the birthdates of the players from the Master data frame and merge this information with the player season data frame. From the birthdates, I can compute the player ages for all seasons. (There is a quibble about the “official” age which depends on the birth month and day, and I take in account this issue.)

After this work, I have a single data frame S500 with variables Name , yearID , HR , and AB . If I wanted to fit a single quadratic curve to the season home run rate data for all players, I would use the code

lm(I(HR / AB) ~ Age + I(Age ^ 2), data=S500)

The tidy function

Since I believe players have different “true” trajectories, I am interested in running this regression separately for each player in the 500 club. Suppose that I’d like to store the estimated regression coefficients for all the player fits. This is accomplished by the function tidy in the broom package:

regressions <- S500 %>% group_by(Name) %>%
  do(tidy(lm(I(HR / AB) ~ Age + I(Age ^ 2), data=.)))

The output regressions is a data frame containing the estimates, standard errors, etc. for each of the player fits:

            Name        term      estimate    std.error statistic
           <chr>       <chr>         <dbl>        <dbl>     <dbl>
1  Albert Pujols (Intercept) -0.1564145033 1.440626e-01 -1.085740
2  Albert Pujols         Age  0.0175182704 1.046082e-02  1.674655
3  Albert Pujols    I(Age^2) -0.0003347728 1.863632e-04 -1.796347
4 Alex Rodriguez (Intercept) -0.2894062182 7.684082e-02 -3.766308
5 Alex Rodriguez         Age  0.0255148619 5.624347e-03  4.536502
6 Alex Rodriguez    I(Age^2) -0.0004430719 9.927357e-05 -4.463140
Variables not shown: p.value <dbl>.

Using this data frame, I can estimate, for each model fit, the age at which the player achieves peak performance. I use the summarize function in the dplyr package to compute these estimates and then use ggplot2 to display these peak ages.

S <- summarize(group_by(regressions, Name),
     peak_age=- estimate[2] / 2 / estimate[3])
ggplot(S, aes(Name, peak_age)) + geom_point() +
  coord_flip() + ylim(20, 40) +
  ggtitle("Age of Peak Performance of Home Run Rate")

Most players peak in home run rate in their late 20’s or early 30’s. Note that Ernie Banks and Albert Pujols was relatively young when he hit his (estimated) home run peak, in contrast to Barry Bonds who peaked (with respect to home run rate) in his late 30’s.

The augment function

By use of the augment function, I can obtain the raw data and the predicted response for all the cases in all of the regressions — this is saved in the data frame individual .

individual <- S500 %>% group_by(Name) %>%
  do(augment(lm(I(HR / AB) ~ Age + I(Age ^ 2), data=.)))
           Name   I.HR.AB.   Age I.Age.2.    .fitted
          <chr>      <dbl> <dbl>    <dbl>      <dbl>       <dbl>
1 Albert Pujols 0.06271186    21      441 0.06383435 0.008159784
2 Albert Pujols 0.05762712    22      484 0.06695739 0.006197427
3 Albert Pujols 0.07275804    23      529 0.06941088 0.004872128
4 Albert Pujols 0.07770270    24      576 0.07119483 0.004241007
5 Albert Pujols 0.06937394    25      625 0.07230923 0.004171649
6 Albert Pujols 0.09158879    26      676 0.07275408 0.004366893
Variables not shown: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd
  <dbl>, .std.resid <dbl>.

With the basic data frame S500 and the predicted values for all of the regressions stored in the data frame individual , it is straightforward to use the ggplot2 package to graph the home run rates and the fitted trajectories.

ggplot(individual, aes(Age, .fitted)) + 
  geom_line(color="red", size=1.5) +
  facet_wrap(~ Name) +
  geom_point(data=S500, aes(Age, HR / AB), color="blue") +
  ggtitle("Career Trajectories of the 500 Home Run Club") +


This graph shows clearly that there are different career paths to 500 or more home runs.

  1. A typical trajectory (like A-Rod) is to have a gradual increase in the home run rate, peak in one’s late 20’s or early 30’s, and then decline until retirement. The rate of increase and decline can vary between players.

  2. Another path to 500 is to have a long career and consistently hit home runs — players like Eddie Murray or Hank Aaron are of this type. I don’t know if one would call Eddie Murray a great home run hitter, but he hit a reasonable number of home runs each season for many seasons.

  3. Another path is the “steriods” path — there is an indication from these home run trajectories that particular players around the 2000 season used some type of enhancement to dramatically increase their home run rate in the late part of their career. Players like Barry Bonds and Mark McGwire displayed unusual trajectories of home run rates.

For more information about the broom package, I’d encourage you to read David Robinson’s paper.

All of the R code to generate this post and graphs can be found on my gist site.


2 responses

  1. Great graphics. Interesting in regards to the steroids issue. Thanks. (McGwire’s, Robinson’s, and Matthews’s dots are missing on the first chart.)

    1. Steve: Thanks. For some hitters, the peak age estimates from the quadratic fit were not reasonable (outside of the 20 to 40) interval, so I didn’t show them in the graph. This motivates the need for multilevel modeling in this situation so one would get better estimates of peak age.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: