Heat Maps for Batting Rates

Introduction

Back in September 2016, I wrote a post describing how to construct a heat map for batting averages over the zone. This post was inspired by the famous graph of Ted Williams’ batting averages over the zone published in the famous The Science of Hitting book. Currently, I am illustrating heat maps in the graphics chapter for a revision of R by Example. I thought some readers of this blog would be interested in the process of constructing these type of graphs from Statcast data.

The Data

I collected Statcast data for six seasons (2018 through 2023) and collected variables batter, plate_x, plate_z and event for all balls put into play. The pair (plate_x, plate_z) gives the location of the pitch about the zone and the event variable gives the outcome — we focus on the hit and home run outcomes. We focus on the batters who had at least 1500 balls in play (BIP) over this period. We are interested in studying the patterns across the zone of hit or home run rates for a specific player.

Binning Balls in Play

I wrote a function bin_rates() that divides the strike zone region into subregions and counts BIP, hits and home runs in each subregion. The inputs to the function are the Statcast data frame, a vector giving the breakpoints of plate_x, and the vector of breakpoints of plate_z.

Displaying Counts over Bins

The function bin_plot() displays the different counts over the subregions. Here is a illustration of counts of balls in play for Bryce Harper.

Here is the corresponding display of home run rates (in percentages).

A Useful Stat

On the surface, it appears that Harper’s favored region in hitting home runs is the “high-outside” region (remember Harper is a left-handed batter) since 14.3 is the highest rate in the figure. But this statement is misleading since that percentage of 14.3 is only based on 14 balls in play. (That raises the interesting question — which rate is more “significant” — 2 out of 14, or 35 out of 247?). One simple proposal for a more meaningful measure is to consider the standardized rate

Z = \frac{Rate - p}{\sqrt{p (1 - p) / BIP}}

where Rate is the fraction, BIP is the count of balls in play, and p is the overall home run rate. (For the statistical reader, Z is just the traditional test statistic of the hypothesis that the true rate is equal to p.) If we compute these Z scores, we get the following display. Positive values in the display correspond to locations with above-average rates and negative values are locations where home runs are relatively rare. By the way, note that the large 14.3 value in the upper left corner (based on 14 BIP) only corresponds to a Z score of 0.7.

Heat Map

Once we have a reasonable measure of performance like the Z score over the subregions, it is straightforward (using the geom_tile() geometric object in ggplot2) to construct a heat map. The function bin_plot_hm() contains the complete ggplot2 code. Hot regions are red and cold regions are green or blue. This display clearly shows that Harper’s hot zone for home run hitting is the left-central middle area of the zone. It is rare for Harper to hit a home run in the low outside region of the zone.

The Shiny App

I wrote a Shiny app that is helpful for see these in-play home run or hit rates for any batter of interest. In this app, one selects a player and number of bins along each dimension (options are 4, 5, 6, 8, 10). One can display the balls in play, hits, hit rates, home runs, home run rates and Z scores, and show the heat maps for the Z scores.

  • If you are interested in seeing the Shiny app code with the associated work functions, go here.
  • If you are just interested in trying the app out, you can run the Shiny app by typing in the RStudio console window
shiny::runGitHub("bayesball/InPlayBatterRates")

Comments

  • (How many subregions?) Trying this out for a number of players, it seems best to use this approach with a small number of bins along each dimension. If you use too many subregions, the BIP counts can be small and you’ll see a lot of noise in the rates.
  • (Compare players.). These heat maps are useful in comparing the hot and cold regions of different players. The Shiny app (by use of the Download Data button) allows one to download the counts and rates across subregions for a player. By downloading these data frames for several players, one can use ggplot2 with the facet feature (such as facet_wrap()) to compare heat maps.
  • (Other methods?). Another way to visualize these rates is by use of a statistical model. For example, one could use a generalized additive model to predict the probability of a home run as a smooth function of the plate coordinates. Then one could display the fitted probabilities over the grid by use of a contour or heat map graph. I used this approach in this earlier post.
  • (Look at our book.). We illustrate in detail this same general approach in the Home Run Hitting chapter of our new 3rd edition of Analyzing Baseball Data with R. In that chapter, we are considering balls in play and home run counts over regions of the space defined by launch angle and exit velocity. We also illustrate the use of models to predict the probability of a home run given values of the launch variables.

Leave a comment