Last week, I received the following comment on my blog:
In chapter 1, you guys state that “In 2011, hitters compiled a .253 batting average on plate appearances where they fell behind 0-2. Conversely they hit .479 after going ahead 2-0.” I’m trying to replicate those numbers and even using the pbp11rc.csv file, I can’t even come close. Instead of batter average, did you mean OBP?
First, this must be a typo — I can’t imagine that hitters have a AVG of .479 after going ahead 2-0 (this must have been written by my coauthor Max!). All kidding aside, this made me reflect on how we computed these AVG that pass through different counts. In our book we used Retrosheet play-by-play where the pitch sequence is recorded as a string constant. Given the new Statcast data, these type of count effects are more easily found since each observation in the Statcast data is a pitch rather than a plate appearance. Anyway, I thought it would be worthwhile to explain the nuts and bolts on finding count effects using the 2017 Statcast data. Then I’ll show you some graphs and insight on count effects
Load some packages and the Statcast data
We begin by loading in the
stringr packages and the Statcast data from the 2017 season.
library(tidyverse) library(stringr) sc <- read_csv("statcast2017.csv")
Define some new variables
I create a unique identifier
pa_id for each plate appearance. I create a variable
count that is the balls-strikes count, and create indicator variables
O that indicate if a hit or out occurred during the PA. I sort the PAs and pitch values and store the new data frame into the variable
sc %>% mutate(pa_id = paste(game_date, away_team, home_team, at_bat_number), count = paste(balls, strikes, sep="-"), H = ifelse(events %in% c("single", "double", "triple", "home_run"), 1, 0), O = ifelse(events %in% c("double_play", "field_error", "field_out", "fielders_choice_out", "force_out", "grounded_into_double_play", "other_out", "strikeout", "strikeout_double_play", "triple_play"), 1, 0)) %>% arrange(pa_id, pitch_number) -> sc
Create specific count indicators
Next I use the
group_by function to divide the data into individual PAs and use the variables
c20, etc. to indicate if 0-2, 2-0, etc. counts occurred during the plate appearance. Also the variables
SO indicate if a hit, out, home run, or strikeout occurred during the PA.
sc %>% group_by(pa_id) %>% summarize(c02 = ifelse("0-2" %in% count, TRUE, FALSE), c20 = ifelse("2-0" %in% count, TRUE, FALSE), c01 = ifelse("0-1" %in% count, TRUE, FALSE), c10 = ifelse("1-0" %in% count, TRUE, FALSE), c21 = ifelse("2-1" %in% count, TRUE, FALSE), c11 = ifelse("1-1" %in% count, TRUE, FALSE), c12 = ifelse("1-2" %in% count, TRUE, FALSE), c22 = ifelse("2-2" %in% count, TRUE, FALSE), c31 = ifelse("3-1" %in% count, TRUE, FALSE), c32 = ifelse("3-2" %in% count, TRUE, FALSE), c30 = ifelse("3-0" %in% count, TRUE, FALSE), Hit = ifelse(1 %in% H, 1, 0), Out = ifelse(1 %in% O, 1, 0), HR = ifelse("home_run" %in% events, 1, 0), SO = ifelse("strikeout" %in% events, 1, 0)) -> S
Compute the count effects
Next I write a short function
count_effect that takes as input the specific count (like 1-2) and finds the home run rate, AVG, BABIP, home run rate on balls in play, and strikeout rate for all PAs that pass through that specific count.
count_effect % summarize(N = n(), HR_Rate = sum(HR) / (sum(Hit) + sum(Out)), AVG = sum(Hit) / (sum(Hit) + sum(Out)), BABIP = sum(Hit) / (sum(Hit) + sum(Out) - sum(SO)), HR_BIP = sum(HR) / (sum(Hit) + sum(Out) - sum(SO)), SO_Rate = sum(SO) / (sum(Hit) + sum(Out))) }
Map the function over all counts
I apply the
map_df function to apply this
count_effect function over all of the possible counts — the output (displayed) is a data frame with all of the count effects.
all_counts % mutate(Count = all_counts, N.pitches = c(1, 1, 2, 2, 2, 3, 3, 4, 4, 5)) -> df
df N HR_Rate AVG BABIP HR_BIP SO_Rate Count N.pitches 1 74207 0.0441 0.272 0.345 0.0558 0.210 1-0 1. 2 92998 0.0291 0.224 0.330 0.0427 0.319 0-1 1. 3 26247 0.0517 0.289 0.355 0.0634 0.185 2-0 2. 4 74940 0.0342 0.238 0.337 0.0486 0.295 1-1 2. 5 37921 0.0188 0.166 0.322 0.0364 0.484 0-2 2. 6 39196 0.0410 0.256 0.347 0.0556 0.262 2-1 3. 7 54394 0.0224 0.176 0.324 0.0411 0.457 1-2 3. 8 16533 0.0483 0.277 0.354 0.0617 0.217 3-1 4. 9 44910 0.0270 0.195 0.338 0.0467 0.422 2-2 4. 10 26081 0.0339 0.219 0.346 0.0535 0.367 3-2 5.
Batting averages passing through all counts
Based on this table, I’ll illustrate several plots. Here I plot the mean AVG of PAs that pass through each count with at least one ball or strike recorded. The red line corresponds to the mean AVG for the 2017 season.
If you divide the points in the graph by the number of strikes, I see three lines with about the same slope (see below). Doing some quick calculation, we can conclude that for a fixed number of strikes, adding a ball to the count will increase the AVG by 17-18 points.
In-play home run rates
Here I graph the home run rate on balls in play that pass through the different counts. Again, I see three parallel lines and I find the slope of each line. Here for a fixed number of strikes, adding a ball to the count raises the HR BIP rate by about 0.007.
If the R code on this page gets garbled, then you can find in on my GithubGist page.
I include the code I used for constructing the graphs. Also I provide a link where you can download the Statcast data from my website.
- Most batting measures are strongly affected by the count — this provides a concrete illustration of what it means to be a pitcher’s count or a batter’s count.
- Here I am focusing on PA events that pass through every possible count. In other studies, you may be interested in the specific count on the pitch that is put into play. For example, if you were exploring count effects in exit velocities, you would look at the count on the pitches put in play. Batters tend to hit 3-0 pitches much harder than the 0-2 pitches.
- Next week, I’ll explore simple ways of answering “What is the optimal launch angle to get a base hit?”
This is great. Do you have the statcast 2017 file stored somewhere?
Yes, I just added the link to the R script that I posted on my github gist site.
Last spring a team from the Booth School of Business put together a similar analysis which evaluated the effect of count on the expected value of a plate appearance. We performed regression analysis to control for several factors. In this paper, we used a metric defined as slugging + walks. Future iterations/abstracts used wOBA which is a better metric.