I’ve been away at an oversees meeting, so I haven’t been active on this blog recently. I’ll use this post to respond to several questions from readers and then give an update on the 2018 home run hitting at this halfway point in the season.
Adding an Age Variable
I’m trying to add an “Age” column in the Lahman batting.csv file. My idea is that I can use a combination of getinfo and the sapply function. I’m comfortable using the getinfo function for individual players. I’ve attempted to adapt the function to do this but I’m struggling. Any suggestions?
I do talk about adding an Age variable to the Batting data frame in our baseball/R book. Here’s a R function
get.stats using the
tidyverse collection of packages.
A player’s age for a season is defined to his age on July 1. The
Master data frame in the Lahman package has a variable
birthYear. I define a new variable
birthyear that is equal to
birthYear + 1 if the player’s
birthMonth is 7 or later, otherwise
birthyear is equal to
birthYear. Then you define a player’s
age as the difference between the
birthyear. This function also computes some traditional measures of performance for each age.
get.stats % filter(playerID == player.id) %>% inner_join(select(Master, playerID, birthMonth, birthYear), by="playerID") %>% mutate(birthyear = ifelse(birthMonth >= 7, birthYear + 1, birthYear), Age = yearID - birthyear, SLG = (H - X2B - X3B - HR + 2 * X2B + 3 * X3B + 4 * HR) / AB, OBP = (H + BB + HBP) / (AB + BB + HBP + SF), OPS = SLG + OBP) %>% select(Age, SLG, OBP, OPS) }
I illustrate this function for Tony Gwynn whose
playerID is “gwynnto01”.
TG <- get.stats("gwynnto01") head(TG) Age SLG OBP OPS 1 22 0.3894737 0.3365854 0.7260591 2 23 0.3717105 0.3545455 0.7262560 3 24 0.4438944 0.4095665 0.8534609 4 25 0.4083601 0.3641791 0.7725392 5 26 0.4672897 0.3805436 0.8478334 6 27 0.5110357 0.4469027 0.9579383
Many seasons of Retrosheet data
Another reader asks: “what happens if I want to download more than one season at a time?”
I have described how to download Retrosheet play-by-play data for a single season in a previous post. Although this post was written four years ago, it seems to work fine. Retrosheet does allow you to download multiple seasons at once. I’d suggest to look at the code of the function
parse.retrosheet2.pbp.R — I think a straightforward modification of this function will work for multiple seasons.
Home Run Update Through Games of June 30
2018 home run hitting is still trailing the home run pattern of 2017. Here is a graph of the cumulative in-play home run rate for games played through June 30. It is interesting that the 2017 HR rate climbed during June; in 2018 the HR rate appears to have leveled off in recent weeks around 4.4 percent.
Below I plot the actual in-play home run rate for each week of the 2017 and 2018 seasons. In the last three weeks (week numbers 20, 21, 22), the 2018 rate has been significantly lower than the 2017 rate.
Further work will be needed to find the characteristics that are leading to this drop in home run hitting this season. In some ways, the variability of home run hitting in recent seasons has been a mystery.