I was rereading Russell Carleton’s excellent book The Shift and he makes an interesting observation in Chapter 6 . He states that over the course of a season, plate discipline (that is, the ability of a hitter to make good decisions within the strike zone) erodes. To demonstrate this, he looks at data from 2013 through 2017 and looks at whether each pitch produces a strike (bad decision) or a ball (good decision). During this period, Carleton found that hitters made good decisions on 61.7 of all pitches. Using logistic regression he finds that batters make good decisions on 61.96% of the pitches early in the season but it declines to 61.7% of all pitches at the end of the season. Certainly, this is a small change, but small changes can accumulate over many pitchers and perhaps some managers do better in handling this erosion of plate discipline.
Since this is an interesting pattern, I thought I could try to do a similar exploration using 2018 season data. This provides a good opportunity to demonstrate sample code from the tidyverse packages that we illustrate in the 2nd edition of Analyzing Baseball Data with R. I’m interested in exploring the global change in plate discipline during a season but also interested in how the changes in plate discipline vary among players.
I start by reading in the Statcast data that provides information on each pitch in the 2018 season.
library(tidyverse) sc <- read_csv("../StatcastData/statcast2018new.csv")
I have to figure out what Carleton meant by good and bad decisions on pitches. I am going to focus on called pitches — in the Statcast data, the description variable can be “ball”, “blocked_ball” or “called_strike” on these called pitches and I am considering a success getting a “ball”. In this data, 63.7% of the called pitches were balls (I’m ignoring blocked balls) which is similar in value to Carleton’s 61.7% statistic. Of course, you are welcome to use alternative measures of good batter decisions.
Below I focus on only the called pitches. I define a new variable Month that is the number of days since opening data divided by 30. The lubridate package is handy for working with dates.
library(lubridate) sc %>% filter(description %in% c("ball", "blocked_ball", "called_strike")) %>% mutate(Months = (as.numeric(ymd(game_date) - ymd("2018-03-29"))) / 30) %>% select(description, player_name, Months) -> sc_called
Following Carleton’s suggestion, I fit a logistic regression of the form
where is the probability of a good decision (ball) and Month is the number of days since opening day (in month units). I fit this model using the glm function — the fitted model is
fit <- glm(description == "ball" ~ Months, data = sc_called, family = binomial) fit$coef (Intercept) Months 0.549352270 0.003747663
Here the intercept 0.549 is an estimate of the probability of a called ball (on the logit scale) at the beginning of the season and 0.0037 is the increase in this logit for each month. Two quick observations: (1) in contrast to Carleton’s finding, there is actually an overall increase in the success probability, and (2) the overall size of the effect is small.
Plate Discipline for Individual Players
A next exploration step is to see how the change in plate discipline varies among players. I decide on limiting my analysis to hitters who had at least 1000 called pitches in the 2018 season. The data frame containing the pitches for these regulars is called sc_called_1000.
sc_called %>% group_by(player_name) %>% summarize(Called = n()) %>% inner_join(sc_called) %>% filter(Called >= 1000) -> sc_called_1000
For each of these players, I want to run a logistic regression model of the same form. Here is where the broom package is especially helpful. In the following tidyverse code, I split the data frame by player name, use the map function to run these multiple logistic regressions, and then use the map_df function to put the regression output in a tidy format.
For each player and each regression term, I display the estimate, standard error, etc.
regressions % split(pull(., player_name)) %>% map(~ glm(description == "ball" ~ Months, data = ., family = binomial)) %>% map_df(tidy, .id = "Name") %>% as_tibble() head(regressions) # A tibble: 6 x 6 Name term estimate std.error statistic p.value 1 Aaron Hicks (Interc… 0.702 0.125 5.62 1.95e-8 2 Aaron Hicks Months -0.0117 0.0330 -0.355 7.23e-1 3 Aaron Judge (Interc… 0.520 0.100 5.20 1.98e-7 4 Aaron Judge Months 0.0582 0.0370 1.57 1.16e-1 5 Adrian Bel… (Interc… 0.517 0.126 4.11 3.93e-5 6 Adrian Bel… Months -0.0308 0.0343 -0.898 3.69e-1
Last. for producing the plot that will be drawn, I’d like the intercept and slope estimates to be variables for each player — the spread function will put the estimates in this format.
regressions %>% select(Name, term, estimate) %>% spread(term, estimate) -> reg names(reg) <- "Intercept" head(reg) # A tibble: 6 x 3 Name Intercept Months 1 Aaron Hicks 0.702 -0.0117 2 Aaron Judge 0.520 0.0582 3 Adrian Beltre 0.517 -0.0308 4 Alex Bregman 0.551 -0.0215 5 Alex Gordon 0.386 0.0582 6 Andrelton Simmons 0.558 0.00566
Below I construct a scatterplot of the intercept and slope estimates for all of these hitters who faced at least 1000 called pitches in 2018. I add a red line corresponding to no erosion in plate discipline and I label some unusual points.
library(ggrepel) ggplot(reg_18, aes(Intercept, Months, label = Name)) + geom_point() + geom_label_repel(data = filter(reg_18, Intercept > 1.0 | Intercept 0.10)) + geom_hline(yintercept = 0, color = "red") + ggtitle("Intercept and Slope Estimates for the Individual Logistic Regressions")
As one might think, there are sizable differences in plate discipline that we are measuring by the intercept estimates. Freddie Freeman, for example, has great discipline, while Logan Forsythe appears to have poor discipline. We also see variation in the slope estimates. Rougned Odor and Kole Calhoun have large slope estimates which means that the chance of a ball increased during the season. In contrast, Mike Moustakas has a slope estimate about – 0.07 indicating that his plate discipline deteriorated during the season.
Making Sense of these Estimates
Before we get excited about the variation in player estimates, several comments should be made. First, the standard errors of the intercepts are much smaller (relative to the sizes of the estimates) than the standard errors of the slopes. This means that we are much less confident that the differences in the slope estimates correspond to real differences between players.
To make this point clear, I did the same exploration for the pitch data in the 2017 estimates, collecting the slope and intercept estimates for all players facing 1000 called pitches. I merged the estimates for the 2017 and 2018 seasons. Below I have graphed scatterplots of the intercept estimates (top) and slope estimates (bottom). We see that there is a reasonable positive association (correlation = 0.513) in the intercept estimates for the two seasons — this indicates that plate discipline (not swinging at pitches outside of the strike zone) is truly an ability of a player. In contrast, there is little association (correlation = 0.056) in the slope estimates — this means that erosion of plate discipline overall the season is not really a skill of a player.
I have written a function that computes the individual player logistic regression estimates using Statcast data for a specific season. All of the R code is available on my GithubGist site.