I recently gave a talk on catcher framing at a local university. I presented graphs showing contour plots over the zone where the line corresponded to the region where the fitted probability of a called strike was equal to 0.5. I called this the actual strike zone, contrasting this with the rectangular strike zone region. I was asked during the talk how one constructs these graphs and that motivated this post (I hope that particular person is reading this post!).
A second motivation for writing this post is to correct a small error in our Catcher Framing chapter in the 2nd edition of ABWR. We focused on the pitches that were called (no swings) and modeled the probability that the called pitch was a strike. We incorrectly used the Statcast “type” variable in that chapter — in this post, I correctly identify the called pitches using the Statcast “description” variable. A current list of errata in the 2nd edition of ABWR can be found here, and we appreciate if you find any other errors or confusions in the book.
By the way, there have been similar posts on this subject in this blog. In this post, I explore count and umpire effects on the “true” strike zone.
Why Does One Write a R Package?
I’ve written close to 200 posts in this blog and each post corresponds to some R exploration of baseball data. Each post has a dedicated folder on my computer where I store the data, R scripts or Markdown files, and graphic image files.
My method of organizing my R work is okay, but if you find yourself reusing particular pieces of code, I would recommend putting your common functions into a R package. It is easy to construct a R package and maintain your package on a Github repository. Also, I find that it is easier to reuse and improve my earlier work when the functions are contained in a package. Here I am going to illustrate several functions in a new package called CalledStrike. (This is a very new package and I hope to add more functions in the coming months.)
A Contour Graph of the Probability of a Called Strike
Suppose I am interested in learning about the pattern of called strikes of a specific pitcher, say the 2018 Aaron Nola. Here is the step-by-step process:
- (Scrape) I use the baseballr package to scrape all of the Statcast data for Aaron Nola from Baseball Savant — my data frame is called Aaron.
- (Filter) I want to only consider pitches where the batter doesn’t swing.
- (Model) I am interested in seeing how the probability of a called strike depends on the location of the pitch — I fit the generalized additive model log(p / (1 – p)) = s(plate_x, plate_z) where s() is a smooth function of the location.
- (Graph) Over a grid of values, I find the fitted probability of a strike. Using these values, I construct a contour plot. I will use level values of 0.5 and 0.9 — the border of the actual zone is where the probability of a strike is 0.50 and the border of the “likely is a strike” zone is where the probability of a strike is 0.90.
In my CalledStrike package, I write small functions (specifically the functions setup_called(), gam_fit(), plot()) that implement each of these steps and apply these functions by piping operations. Notice below that I add a title to my graph and the function centertitle() adds the ggplot2 code to center my title with a specific font size and color.
Below I use these commands to compare the actual and likely called strike zones for Aaron Nola, Clayton Kershaw, Lucas Giolito, and Dylan Bundy. I deliberately chose two pitchers with excellent 2018 seasons and two pitchers with subpar seasons. It is interesting that the better pitchers tend to get a larger called strike region. For example, both Nola and Kershaw tend to get more called strikes on low pitches.
Both Kershaw and Nola are well-known for their curve balls. It is easy to revise my code to focus on a particular pitch type. One thing that stands out in the graph below is that Kershaw appears to have an unusually wide strike zone for his curve balls.
Misses on Swung Pitches
Above I am focusing on called strikes and balls. If the batter swings at the pitch, then I am interested in modeling the probability that he misses on his swing. To give you a sense of the data, I have plotted all of the pitch locations on Nola curve balls where the batter swings, and I have colored the point by the outcome (a lighter color corresponds to a miss).
Following my earlier approach, I can use a generalized additive model where the logit of a probability of a miss is represented as a smooth function of the location coordinates. Below I show .1, .3, .5 contours for the probability of a miss on a curve ball for Aaron Nola (left) and all right-handed pitchers (right). Notice that the contour lines for Nola are lines — we have limited data for Nola and the fitting procedure will default to a linear model. We get better-looking contour lines for the dataset consisting of all swung curve balls for right handers. The basic message is that hitters will tend to miss curve balls (thrown by right-handers) that are in the lower-right section of the zone. Nola’s curve ball is effective, so note that the 0.5 probability of miss occurs higher in the zone than for the typical right-hander curve ball.
All of the R code for this particular study including the commands to collect the Statcast data is available on my Github Gist site.
Hopefully I got some readers interested to creating their own graphs. What is fascinating about called balls and strikes is that the pattern depends on a number of factors such as …
- the pitcher (we saw this in our comparison of the four pitchers)
- the side of the batter
- the individual batter
- the umpire (each umpire has his own strike zone)
- the catcher (this is the basis for the whole discussion of framing)
- the count (the actual strike zone shrinks when the pitcher has the advantage)
I encourage the reader to use the CalledStrike package to do his or her own exploration using the Statcast data.