Williams and Underwood’s famous book The Science of Hitting (originally published in 1970) contains the following remarkable graph that shows Ted Williams’ batting average in different areas of the strike zone. (I wonder how this data was collected over 45 years ago?)
In this blog post, I’ll provide a general overview of how to construct a similar graph using the ggplot2
package with pitchFX data, and use that as a springboard to present alternative graphical views of areas of “hot” and “cold” hitting.
The Data
Using the pitchRx
package, I downloaded pitch data for five months of the 2016 season. I create a data frame with the variables gameday_link, num, Batter, X, Z,
and Event
. For my purposes, all I need is the name of the batter, the pitch location (X and Z) and the Event. Actually, the data frame only contains information on the last pitch in each plate appearance, since I’m interested in the relationship between the location of the final pitch and the PA outcome.
Mike Trout
Here are the steps to produce a “Williams/Underwood” style of graph for Mike Trout’s batting average in the 2016 season
- Restrict the events to only official at-bats (remove walks, etc) and define a Hit variable which is 1 for a hit and 0 for an out.
- Since we have a limited amount of AB for Trout for the 2016 season, some smoothing of the probabilities is desired. I fit a generalized additive model with a logistic link (using the
gam
function) to the 1/0 data using the (X, Z) location as covariates. - I define a grid of X, Z values similar to what was used in Williams and Underwood’s display. For each value, I estimate the hit probability and use
ggplot2
to display the fitted averages on top of the points.
A Contour Plot
Although the above graph is nice, it is hard to detect the sweet spot (actually Williams and Underwood used different colors to help the reader find the hot zone). But if one uses a finer grid of points, say 50 by 50, we can create alternative displays that better communicate hot and cold zones.
One possibility is a contour plot. Here I construct a contour plot (using the geom_contour
function) specifying contour lines at .2, .3, and .4. For a pretty large area in the middle of the strike zone, Trout is a .400+ hitter.
Categorizing AVG
If we are really interested in ranges of batting averages, then a reasonable thing to do is to create a new variable that categorizes the AVG into the intervals (0, 100), (100, 200), etc, and then plot the categorized values using different colors.
Celebrating the Cubs with Heat Maps
One of my colleagues mentioned heat maps and R in the same sentence (he prefers MATLAB so it was notable that he mentioned R). It is straightforward to create heat maps of AVG over the strike zone using the geom_tile
function. In honor of the Cubs clinching a playoff spot, I show heat maps for two right handed hitters, Kris Bryant and Addison Russell, and two left-handed hitters Anthony Rizzo and Chris Coghlan. Remember that these graphs are from the catcher’s perspective. It is interesting that all of the hitters seem to like balls in the middle and outside of the strike zone (for the righties, outside is on the right, and for lefties, outside is on the left).
Things to Try
Hopefully this post has gotten folks interested in trying these displays. Here are some directions for future work.
- Instead of hit/out, one could define a success as hitting a home run, and see the sweet spot for hitting a home run for different batters.
- Or one could measure the value of a PA by the run value and look at contours of run value by the location of the last pitch
- Since Statcast provides the exit velocity for each ball, it would be interesting to look at contours of exit velocity of balls-in-play by pitch location.
- How does a batter’s sweet spot vary by the arm of the pitcher?
R Code
A function heat_plot
that will produce this type of heat graph can be found at my Github gist site. The inputs are the name of the batter, the data frame containing the pitchRx data, and an indicator if you want to see probability of a hit or the probability of a home run.