Many measures of batting performance in baseball are expressed in terms of rates, such as batting average, on-base percentage, home run rate, swing rate, etc. Also we know that these rates are strongly dependent on the location of the pitch around the zone. To explore how any baseball rate measure depends on the zone location, a first step is to divide the zone area into bins and then find the counts and rates in each bin. This post is devoted to a discussion of a generic R function that will compute and graph an arbitrary baseball rate over bins that cover a rectangle of pitch locations. (By the way, this post was inspired by an email from an interested reader Josh who asked about this binning process.)
The bin_rate_plot() function
I wrote a general function bin_rate_plot() — there are four inputs to this function:
- the Statcast data frame bh
- the binary variable (0/1) Outcome — this could be an indicator of a Hit, a Home Run, a Called Strike, a Swinging Strike, depending on the rate of interest
- xpars, zpars — these define the rectangle and number of bins in each direction. By default, xpars = c(-1.3, 1.3, 25) which means we are focusing on 25 bins in the horizontal direction between -1.3 and 1.3 and zpars = c(1.15, 3.95, 25) means we are looking at 25 bins in the vertical direction between 1.15 and 3.95
- a title for the graph
The output of this function is (1) a data frame giving the bins and corresponding count of occurrences of the outcome and the sample sizes and (2) a graph of these rates over the zone.
Looking inside bin_rate_plot()
Essentially there are two operations in the function. I use two applications of the cut_interval() function (from the ggplot2 package) to create the bins in the x and z directions. Then by use of group_by() and summarize() (from the dplyr package), I find the rates of the outcome in each bin. Once I create this summary data frame, then I use the ggplot2 package to construct a scatterplot of the midpoints of the bin locations where the color of the plotting symbol corresponds to the value of the rate.
If you have some Statcast data and the function bin_rate_plot(), you can play with this to get some interesting graphs. Here are some examples:
Home Run Rates
First I load in the 2019 Statcast data frame containing information on 732,473 pitches. I define three new variables Called_Strike, Foul, and HR that are indicator functions of those respective pitch outcomes.
Then to construct a plot of these binned home run rates, I type
We see (as expected) that the home run rates is about 4% on a pitch thrown in the middle of the zone.
Maybe it is more relevant to consider the rate of a home run on a ball that is put into play. I filter out the balls that are not in play — here is a graph of the the new home run rate over bins. It is harder to see the prime home run location, but note that the HR in-play rates can be as high as 20%.
Where are Called Strikes?
Suppose a pitcher wants a called strike on a four-seam fastball — where should he throw it? Here is a plot of the called strike rates to left-handed hitters on four-seamers. We see that lefties tend to lay off the balls that are low and outside.
I’ve modified this graph to focus on called strike rates on four-seamers to right-handed hitters. Again, as one might suspect, the rates are highest for pitches thrown outside and low in the zone.
Here I display the called strike rates on curve balls — the called strikes tend to happen high in the zone.
Where are Foul Balls?
Here I look at the rates of foul balls by right-handed hitters. Here it seems that these batters tend to foul balls that are inside and high in the zone.
- The bin_rate_plot() function and the examples presented here can be found on my Github Gist site.
- The bin_rate_plot() function can be used to explore a variety of batting rates such as swing rates, contact rates, hit rates, home run rates, etc. It can also be used to explore pitching rates such as swinging strike rates and umpire rates such as called-strike rates.
- To get a reasonable plot, you need to have a sufficient number of bins with counts. For example, if you focus on a single player for one season, you may get a graph that doesn’t look very smooth.
- One way to get a smoother representation of these rate patterns is to fit a model. One model that I like is a generalized additive model (GAM) where the logit of the rate is a smooth function of the x and z variables. In a previous post, I illustrate using my CalledStrike package to produce attractive GAM plots such as this one below. This shows the in-play home run rate for right-handed batters. (Compare this graph of the in-play home run rates with my binned graph above.)