Monthly Archives: November, 2020

Baseball Density Estimates Using ggplot2

Bivariate Density Estimates

In baseball, we are interested in learning about pitch location and locations of balls put into play. Basic graphical methods like scatterplots don’t work very well due to the overplotting of the large number of data points. An attractive alternative method of plotting is to construct a bivariate density estimate of the location values and then display the density estimate as a contour graph. The kde2d() function is a good general purpose function for constructing a bivariate kernel density estimate that is incorporated by the ggplot2 package in its geom_density2d() and geom_density2d_filled() geometric objects. Personally, I prefer the filled contour plots since it is easier to pick out regions of high probability density in these graphs.

Need Large Samples

I don’t use these graphical methods that much in my exploratory work since one needs a good amount of data to get smooth density estimates. But fortunately this is not a problem with Statcast data — for example, my dataset for the 2019 season illustrated here has over 732,000 pitches and over 125,000 balls put into play. The purpose of this post to illustrate the use of these density estimates for several interesting explorations of pitch location and BIP location data.

Locations of Pitches of Different Types

Pitchers use a variety of different pitch types and these pitch types tend to be thrown to different zone locations. (The relevant location variables in the Statcast dataset are plate_x and plate_z.) I illustrate this below by showing density estimates of the locations for nine different pitch types for right-arm pitchers. Fastballs tend to be thrown in the middle of the zone, while off-speed pitches (changeups, curveballs, sliders) tend to be thrown low in the zone. More specifically, sliders are thrown low to the right (from the catcher’s perspective), while changeups tend to be thrown low to the left.

The pitch location story for southpaws is a mirror image of the locations for right-handers.

Locations of Called Strikes

Next let’s consider the locations of called strikes. These locations depend on both the arm of the pitcher (Statcast variable p_throws) and the batter side (Statcast variable stand). When a pitcher throws a pitch to a batter of the same side (either R against R or L against L), the locations of the called strikes tend to be low and away. In contrast, when a pitcher is facing a batter of the opposite side (either R against L or L against R), then the called strike locations tend to be outside (both low and middle of the zone). If you look carefully, then there are some interesting patterns in the locations away from the most likely location and these subtle patterns are mirrored in the other same-side or opposite-side matchup.

Locations of Swinging Strikes

Where are the locations of the swinging strikes? It depends on the matchup. If a pitcher is throwing against a batter on the same side, then the swinging strikes tend to be in the outside lower corner of the zone — many of these swinging strikes are outside of the zone. In contrast, if a pitcher is throwing against a batter of the opposite side, then there are two likely regions for swinging strikes — low middle and high outside. Again, one sees mirror images of these locations — for example, compare R against L with L against R, or compare L against L with R against R.

Field Locations of Batted Balls

In the Statcast dataset, there are two variables hc_x and hc_y that provide the field locations of balls put into play. To get reasonable looking graphs, I do a simple linear transformation on these variables. Here is a density estimate graph of the field locations of all balls put into play for the 2020 season.

Since I didn’t find this graph that informative, let’s divide the batted balls by the launch angle — I use the Statcast definition to divide the batted balls into groundballs, fly balls, line drives and popups. Since the location of these batted balls also depends on the batter side, I create two displays showing the locations for each of the four batted ball types for each batter side. What do we see? Left-handed batters tend to pull ground balls and line drives, but they tend to hit popups and fly balls to the opposite side. The graph for right-handed batters, as expected, is a mirror image of the graph for left-handed hitters.

Some Comments

  • R code? If you have some Statcast data, then these graphs are easy to create using the ggplot2 package. You can find the code for these graphs in my Github Gist site.
  • Conditional estimates. When one uses different facets like above, then the default contours will display values over the joint density and it may be hard to see the patterns in particular panels. By specifying the contour_var = "ndensity" argument in the geom_density2d_filled() function, the function will normalize the density values within each panel. With this adjustment one is visualizing a density estimate conditional on the value of the variable in that facet. That is what I did in my plots above since I think the objective is to see the most likely locations within each panel.
  • Density Estimates at the Individual Level. Actually teams really want to look at these types of pitch location graphs at the individual level. For example, where does Bryce Harper tend to have a swinging strike? Here is Harper’s graph using 2019 data. Actually in this case, Harper’s probable swinging strike locations does appear similar to our graph for all left-handed hitters.
  • What To Do About These Individual Fits? If you don’t have much data, then one will tend to get poor looking contour displays at the individual level that are hard to interpret. What one would like to do is to improve these individual displays by shrinking or adjusting them towards the display for all players that we see here. One can achieve this by use of a Bayesian multilevel model. Once I figure this details of doing this on R, I’ll illustrate how this is done.
  • Revisions to the CalledStrike package. Recently I have revised my CalledStrike package that is useful for visualizing smooth fits of specific metrics (say launch speed or batting average) over the zone. In a future post, I’ll illustrate the use of functions in the CalledStrike package to produce these smoothed fits.