For this post, I thought I’d do something a little different. I’d start with a baseball question and describe the process for getting the relevant data and constructing a reasonable graph to address the question.
There is a lot of talk nowadays about a hitter’s launch speed. Thinking about this, I’m interested in how the launch speed of a batted ball depends on the location of the pitch. Certainly, one would anticipate that the batted ball’s launch speed is greatest for pitches in the middle of the strike zone, but I’m interested how the launch speed varies for other pitch locations.
Currently, the best source of data is through the Statcast system and Bill Pettit has written a package baseballr that makes it easy for reading in the relevant data. Using the function scrape_statcast_savant_batter_all(), I collected all of the pitch-by-pitch data for all games played over the weekend games from August 4 though August 6.
To get the data in a reasonable form, some preliminary work has to be done.
- I only am interested in batted balls, so I use the filter() function to restrict attention to only the pitch data where the ball is put into play.
- For some reason, the x and z pitch location variables were in character format, so I convert both to numeric type.
Now I can construct some initial plots. To make sure I have reasonable data, I graph the pitch locations for all batted balls. As one would expect, practically all of the batted balls are on pitches within the strike zone.
Next, I redraw this graph mapping the color of the point to the bat launch speed variable (I have categorized the launch speed into four groups where the cutpoints are given by the quartiles.) .
A better graph
I don’t think this colorized graph is very helpful, so I think about alternative displays. A statistical model can be helpful in understanding the relationship between launch speed and pitch location. Since this relationship is likely nonlinear, a generalized additive model is used. Essentially, we are saying that launch speed is an arbitrary function s(px, pz) (plus a random error) where px and pz are the horizontal and vertical coordinates. Once this model is fit, then I want to predict the launch speed over a grid of values of (px, pz) and then I can use a contour graph to show predicted launch speeds at specific values.
Here’s my graph — this shows the predicted launch speed where the levels of the contour lines are at launch speeds of 90, 85, 80, and 75 mph.
The pattern of this contour graph is what we expected, but there are some interesting take-aways:
- The sweet spot (the area with the greatest launch speeds) is in the low-middle section of the strike zone.
- The edge of the strike zone corresponds, roughly, with a launch speed of 85 mph.
- There is some asymmetry in the contour lines — this likely corresponds to the larger number of hitters who bat right-handed. (This could easily be checked by two fits — one to right-handed hitters and a second fit to lefties.)
Code and final remarks
I have posted my R code here. Given the ease of obtaining this data through the baseballr package, hopefully I have encouraged the interested reader to do his/her own analyses. Pitch location is one of the most aspects of pitching and hitting and we can easily associate pitch location with batted ball variables using this data.