Monthly Archives: July, 2018

Response to comments and home run update


I’ve been away at an oversees meeting, so I haven’t been active on this blog recently.  I’ll use this post to respond to several questions from readers and then give an update on the 2018 home run hitting at this halfway point in the season.

Adding an Age Variable

Zach writes:

I’m trying to add an “Age” column in the Lahman batting.csv file. My idea is that I can use a combination of getinfo and the sapply function. I’m comfortable using the getinfo function for individual players. I’ve attempted to adapt the function to do this but I’m struggling. Any suggestions?

I do talk about adding an Age variable to the Batting data frame in our baseball/R book.  Here’s a R function get.stats using the tidyverse collection of packages.

A player’s age for a season is defined to his age on July 1. The Master data frame in the Lahman package has a variable birthYear. I define a new variable birthyear that is equal to birthYear + 1 if the player’s birthMonth is 7 or later, otherwise birthyear is equal to birthYear. Then you define a player’s age as the difference between the yearID and birthyear. This function also computes some traditional measures of performance for each age.

get.stats % filter(playerID == %>%
     inner_join(select(Master, playerID,
                       birthMonth, birthYear),
                       by="playerID") %>%
     mutate(birthyear = ifelse(birthMonth >= 7,
                            birthYear + 1, birthYear),
            Age = yearID - birthyear,
            SLG = (H - X2B - X3B - HR +
                     2 * X2B + 3 * X3B + 4 * HR) / AB,
            OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
            OPS = SLG + OBP) %>%
     select(Age, SLG, OBP, OPS)

I illustrate this function for Tony Gwynn whose playerID is “gwynnto01”.

TG <- get.stats("gwynnto01")
  Age       SLG       OBP       OPS
1  22 0.3894737 0.3365854 0.7260591
2  23 0.3717105 0.3545455 0.7262560
3  24 0.4438944 0.4095665 0.8534609
4  25 0.4083601 0.3641791 0.7725392
5  26 0.4672897 0.3805436 0.8478334
6  27 0.5110357 0.4469027 0.9579383

Many seasons of Retrosheet data

Another reader asks: “what happens if I want to download more than one season at a time?”

I have described how to download Retrosheet play-by-play data for a single season in a previous post.  Although this post was written four years ago, it seems to work fine.   Retrosheet does allow you to download multiple seasons at once.  I’d suggest to look at the code of the function parse.retrosheet2.pbp.R — I think a straightforward modification of this function will work for multiple seasons.

Home Run Update Through Games of June 30

2018 home run hitting is still trailing the home run pattern of 2017.   Here is a graph of the cumulative in-play home run rate for games played through June 30.  It is interesting that the 2017 HR rate climbed during June; in 2018 the HR rate appears to have leveled off in recent weeks around 4.4 percent.


Below I plot the actual in-play home run rate for each week of the 2017 and 2018 seasons.  In the last three weeks (week numbers 20, 21, 22), the 2018 rate has been significantly lower than the 2017 rate.


Further work will be needed to find the characteristics that are leading to this drop in home run hitting this season.  In some ways, the variability of home run hitting in recent seasons has been a mystery.