# Baseball Density Estimates Using ggplot2

#### Bivariate Density Estimates

In baseball, we are interested in learning about pitch location and locations of balls put into play. Basic graphical methods like scatterplots don’t work very well due to the overplotting of the large number of data points. An attractive alternative method of plotting is to construct a bivariate density estimate of the location values and then display the density estimate as a contour graph. The `kde2d()` function is a good general purpose function for constructing a bivariate kernel density estimate that is incorporated by the `ggplot2` package in its `geom_density2d()` and `geom_density2d_filled()` geometric objects. Personally, I prefer the filled contour plots since it is easier to pick out regions of high probability density in these graphs.

#### Need Large Samples

I don’t use these graphical methods that much in my exploratory work since one needs a good amount of data to get smooth density estimates. But fortunately this is not a problem with Statcast data — for example, my dataset for the 2019 season illustrated here has over 732,000 pitches and over 125,000 balls put into play. The purpose of this post to illustrate the use of these density estimates for several interesting explorations of pitch location and BIP location data.

#### Locations of Pitches of Different Types

Pitchers use a variety of different pitch types and these pitch types tend to be thrown to different zone locations. (The relevant location variables in the Statcast dataset are `plate_x` and `plate_z`.) I illustrate this below by showing density estimates of the locations for nine different pitch types for right-arm pitchers. Fastballs tend to be thrown in the middle of the zone, while off-speed pitches (changeups, curveballs, sliders) tend to be thrown low in the zone. More specifically, sliders are thrown low to the right (from the catcher’s perspective), while changeups tend to be thrown low to the left.

The pitch location story for southpaws is a mirror image of the locations for right-handers.

#### Locations of Called Strikes

Next let’s consider the locations of called strikes. These locations depend on both the arm of the pitcher (Statcast variable `p_throws`) and the batter side (Statcast variable `stand`). When a pitcher throws a pitch to a batter of the same side (either R against R or L against L), the locations of the called strikes tend to be low and away. In contrast, when a pitcher is facing a batter of the opposite side (either R against L or L against R), then the called strike locations tend to be outside (both low and middle of the zone). If you look carefully, then there are some interesting patterns in the locations away from the most likely location and these subtle patterns are mirrored in the other same-side or opposite-side matchup.

#### Locations of Swinging Strikes

Where are the locations of the swinging strikes? It depends on the matchup. If a pitcher is throwing against a batter on the same side, then the swinging strikes tend to be in the outside lower corner of the zone — many of these swinging strikes are outside of the zone. In contrast, if a pitcher is throwing against a batter of the opposite side, then there are two likely regions for swinging strikes — low middle and high outside. Again, one sees mirror images of these locations — for example, compare R against L with L against R, or compare L against L with R against R.

#### Field Locations of Batted Balls

In the Statcast dataset, there are two variables `hc_x` and `hc_y` that provide the field locations of balls put into play. To get reasonable looking graphs, I do a simple linear transformation on these variables. Here is a density estimate graph of the field locations of all balls put into play for the 2020 season.

Since I didn’t find this graph that informative, let’s divide the batted balls by the launch angle — I use the Statcast definition to divide the batted balls into groundballs, fly balls, line drives and popups. Since the location of these batted balls also depends on the batter side, I create two displays showing the locations for each of the four batted ball types for each batter side. What do we see? Left-handed batters tend to pull ground balls and line drives, but they tend to hit popups and fly balls to the opposite side. The graph for right-handed batters, as expected, is a mirror image of the graph for left-handed hitters.

• R code? If you have some Statcast data, then these graphs are easy to create using the `ggplot2 `package. You can find the code for these graphs in my Github Gist site.
• Conditional estimates. When one uses different facets like above, then the default contours will display values over the joint density and it may be hard to see the patterns in particular panels. By specifying the `contour_var = "ndensity"` argument in the `geom_density2d_filled()` function, the function will normalize the density values within each panel. With this adjustment one is visualizing a density estimate conditional on the value of the variable in that facet. That is what I did in my plots above since I think the objective is to see the most likely locations within each panel.
• Revisions to the CalledStrike package. Recently I have revised my `CalledStrike` package that is useful for visualizing smooth fits of specific metrics (say launch speed or batting average) over the zone. In a future post, I’ll illustrate the use of functions in the `CalledStrike` package to produce these smoothed fits.
```