Monthly Archives: March, 2018

Count Effects from Statcast Data


Last week, I received the following comment on my blog:

In chapter 1, you guys state that “In 2011, hitters compiled a .253 batting average on plate appearances where they fell behind 0-2. Conversely they hit .479 after going ahead 2-0.” I’m trying to replicate those numbers and even using the pbp11rc.csv file, I can’t even come close. Instead of batter average, did you mean OBP?

First, this must be a typo — I can’t imagine that hitters have a AVG of .479 after going ahead 2-0 (this must have been written by my coauthor Max!).  All kidding aside, this made me reflect on how we computed these AVG that pass through different counts.  In our book we used Retrosheet play-by-play where the pitch sequence is recorded as a string constant.  Given the new Statcast data, these type of count effects are more easily found since each observation in the Statcast data is a pitch rather than a plate appearance.  Anyway, I thought it would be worthwhile to explain the nuts and bolts on finding count effects using the 2017 Statcast data.  Then I’ll show you some graphs and insight on count effects

Load some packages and the Statcast data

We begin by loading in the tidyverse and stringr packages and the Statcast data from the 2017 season.

sc <- read_csv("statcast2017.csv") 

Define some new variables

I create a unique identifier pa_id for each plate appearance. I create a variable count that is the balls-strikes count, and create indicator variables H and O that indicate if a hit or out occurred during the PA. I sort the PAs and pitch values and store the new data frame into the variable sc.

 sc %>% 
  mutate(pa_id = paste(game_date, away_team,
                home_team, at_bat_number),
         count = paste(balls, strikes, 
         H = ifelse(events %in%
                c("single", "double", "triple",
                  "home_run"), 1, 0),
         O = ifelse(events %in%
                c("double_play", "field_error",
                  "field_out", "fielders_choice_out",
                  "force_out", "grounded_into_double_play",
                  "other_out", "strikeout", 
                  "strikeout_double_play", "triple_play"),
                           1, 0))  %>% 
  arrange(pa_id, pitch_number) -> sc

Create specific count indicators

Next I use the group_by function to divide the data into individual PAs and use the variables c02, c20, etc. to indicate if 0-2, 2-0, etc. counts occurred during the plate appearance. Also the variables Hit, Out, HR, SO indicate if a hit, out, home run, or strikeout occurred during the PA.

sc %>% 
  group_by(pa_id) %>% 
  summarize(c02 = ifelse("0-2" %in% count,
                         TRUE, FALSE),
            c20 = ifelse("2-0" %in% count,
                         TRUE, FALSE),
            c01 = ifelse("0-1" %in% count,
                         TRUE, FALSE),
            c10 = ifelse("1-0" %in% count,
                         TRUE, FALSE),
            c21 = ifelse("2-1" %in% count,
                   TRUE, FALSE),
            c11 = ifelse("1-1" %in% count,
                   TRUE, FALSE),
            c12 = ifelse("1-2" %in% count,
                   TRUE, FALSE),
            c22 = ifelse("2-2" %in% count,
                   TRUE, FALSE),
            c31 = ifelse("3-1" %in% count,
                         TRUE, FALSE),
            c32 = ifelse("3-2" %in% count,
                         TRUE, FALSE),
            c30 = ifelse("3-0" %in% count,
                         TRUE, FALSE),
            Hit = ifelse(1 %in% H, 1, 0),
            Out = ifelse(1 %in% O, 1, 0),
            HR = ifelse("home_run" %in% events,
               1, 0),
            SO = ifelse("strikeout" %in% events,
               1, 0)) -> S

Compute the count effects

Next I write a short function count_effect that takes as input the specific count (like 1-2) and finds the home run rate, AVG, BABIP, home run rate on balls in play, and strikeout rate for all PAs that pass through that specific count.

count_effect % 
    summarize(N = n(),
           HR_Rate = sum(HR) / (sum(Hit) + sum(Out)),
           AVG = sum(Hit) / (sum(Hit) + sum(Out)),
           BABIP = sum(Hit) / (sum(Hit) + sum(Out) -
           HR_BIP = sum(HR) / (sum(Hit) + sum(Out) -
           SO_Rate = sum(SO) / (sum(Hit) + sum(Out)))

Map the function over all counts

I apply the map_df function to apply this count_effect function over all of the possible counts — the output (displayed) is a data frame with all of the count effects.

all_counts % 
  mutate(Count = all_counts,
  N.pitches = c(1, 1, 2, 2, 2, 3, 3, 4, 4, 5)) -> df
       N HR_Rate   AVG BABIP HR_BIP SO_Rate Count N.pitches

 1 74207  0.0441 0.272 0.345 0.0558   0.210 1-0          1.
 2 92998  0.0291 0.224 0.330 0.0427   0.319 0-1          1.
 3 26247  0.0517 0.289 0.355 0.0634   0.185 2-0          2.
 4 74940  0.0342 0.238 0.337 0.0486   0.295 1-1          2.
 5 37921  0.0188 0.166 0.322 0.0364   0.484 0-2          2.
 6 39196  0.0410 0.256 0.347 0.0556   0.262 2-1          3.
 7 54394  0.0224 0.176 0.324 0.0411   0.457 1-2          3.
 8 16533  0.0483 0.277 0.354 0.0617   0.217 3-1          4.
 9 44910  0.0270 0.195 0.338 0.0467   0.422 2-2          4.
10 26081  0.0339 0.219 0.346 0.0535   0.367 3-2          5.

Batting averages passing through all counts

Based on this table, I’ll illustrate several plots. Here I plot the mean AVG of PAs that pass through each count with at least one ball or strike recorded.  The red line corresponds to the mean AVG for the 2017 season.


If you divide the points in the graph by the number of strikes, I see three lines with about the same slope (see below).  Doing some quick calculation, we can conclude that for a fixed number of strikes, adding a ball to the count will increase the AVG by 17-18 points.

AVG_count copy

In-play home run rates

Here I graph the home run rate on balls in play that pass through the different counts.  Again, I see three parallel lines and I find the slope of each line.  Here for a fixed number of strikes, adding a ball to the count raises the HR BIP rate by about 0.007.



If the R code on this page gets garbled, then you can find in on my GithubGist page.
I include the code I used for constructing the graphs. Also I provide a link where you can download the Statcast data from my website.

  1.  Most batting measures are strongly affected by the count — this provides a concrete illustration of what it means to be a pitcher’s count or a batter’s count.
  2. Here I am focusing on PA events that pass through every possible count.  In other studies, you may be interested in the specific count on the pitch that is put into play.  For example, if you were exploring count effects in exit velocities, you would look at the count on the pitches put in play.  Batters tend to hit 3-0 pitches much harder than the 0-2 pitches.
  3. Next week, I’ll explore simple ways of answering “What is the optimal launch angle to get a base hit?”