In most sports, a particular accomplishment is easy to interpret since there is a precise statement of achievement. If a NBA player makes a three-point shot, we imagine a player making a shot from a distance more than 23 feet and 9 inches from the basket. A successful pole vault at 6 meters? This brings up an image of a vaulter clearing a height of 6 meters on the attempt. In both cases, there is a clear target that is reached to be a success.
In contrast, there are different types of home runs in baseball. One might think about a powerful home run by Bryce Harper that clears the center-field fence. Or one might think about a line drive hit by Didi Gregorius that barely reaches the right-field stands. Due to the differing distances to left, center and right fences in Major League ballparks, all home runs are not the same although they count the same in terms of producing runs.
How can we sort out the different type of home run hitters in baseball? Is there a way to distinguish the big sluggers like Mike Trout, Aaron Judge and Bryce Harper from the other hitters who hit home runs in the short sides of the field?
This brief discussion raises several questions that I will address.
- We know that home runs are primarily a function of two launch variables — the exit velocity and the launch angle. How good is the “two launch variable” model for predicting home runs?
- What is a reasonable way to measure a player’s ability to get more or fewer home runs than one would expect from this two-variable model?
- Is this extra ability a skill? In other words, are there players who tend to hit more home runs that predicted from the two-variable model?
- Can we distinguish the “extra ability” home run hitters from the well-known home run sluggers?
The Two-Variable Model
Consider the probability of hitting a home run on a ball put in play. For the balls put in play for a particular season, we model the home run probabilities by the GAM model
where LA and LS are respectively the launch angle and exit velocity measurements, and s() is a smooth function.
One can use a cross-validation scheme to assess if this two-variable model provides good predictions of the total home run out. For a particular Statcast season, we randomly split the balls in play into two halves that we call the “train” and “test” datasets. We fit the GAM model using the “train” dataset and use the fitted model to predict the total home run count in the “test” dataset. For each of the five Statcast seasons 2015 through 2019, we performed 10 random splits. For each prediction exercise, we constructed a 90% prediction interval. In the 50 cross-validation exercises, the actual total home run was inside the prediction interval 38 times for a coverage rate of 76. The takeaway is that this two-variable model appears to provide a reasonable prediction of the total home run count for a specific season.
Deviations from Model Fit
To measure deviations from the fit, we apply a standardized residual that the reader has seen in recent posts of this blog. For each player, we compute the residual
where HR is the observed home run count and E denotes the “expected” home run count found by summing over all the predicted home run probabilities for all in-play events.
What did I find when I computed these residuals for all player/seasons in the Statcast era (2015 through 2019)?
- Most of the Z residuals fell between −2 and 2 which indicates the model is satisfactory. But there were 36 values of Z (player-seasons) larger than 2, and only seven Z-scores smaller than −2.
- A few players had more than one season with Z > 2. This suggests that these particular players had an ability to hit home runs beyond what is predicted based on their launch angle and exit velocity measurements.
- If one graphs the Z-scores for consecutive seasons, the scores are positively correlated indicating there is generally some home run hitting ability beyond what is explained by the two launch variables.
To demonstrate this last point, here is a scatterplot of the Expected HR rate against the Z-scores for the 2019 hitters who had at least 300 balls in-play. We see a negative trend. Batters with modest expected rates tend to have positive residuals and big home run hitters with high expected rates tend to have small residuals. I’ve labeled five points where the player/season Z-score exceeds 2.5.
Graphical display (Shiny app)
It would seem likely that these “high Z-score” hitters are talented in hitting home runs to right or left field where the distance from home run was relatively small. To see this effect for individual hitters, I constructed a Shiny app that displays the field locations for a player of interest of two types of balls in play.
- The home runs where the color of the point corresponds to the probability of a HR from the GAM model. This display would identify the “cheap” home runs hit to left or right field.
- The balls in play that were not home runs where the probability of a HR was at least 0.5. These are likely the flyballs that land on the warning track.
Several Interesting Hitters
Here is a snapshot of the Shiny app for Max Kepler. We see that practically all of Kepler’s home runs are hit to right-field. Note that many of the points are colored yellow, indicating that these balls had small predicted home run probabilities from the two-variable model. Looking at the table in the bottom left, we see that in the 2016, 2017 and 2019 seasons, Kepler hit 17, 19, and 36 home runs which exceed the respectively expected home run counts of 6.5, 12, and 22.6 from the model.
Another home run hitter with multiple high Z-scores was Yuli Gurriel. Most of Gurriel’s home runs are hit to left field although he hit four home runs to right in the 2019 season. In that season the GAM two-variable model provides poor predictions — based on his launch variables, he was predicted to hit 13.5 home runs which is much smaller than Gurriel’s actual HR count of 31.
The displays for the well-known home run sluggers have a different pattern. For example, this Shiny snapshot shows the home run hitting for Aaron Judge for the 2016 through 2019 seasons. Most of Judge’s home runs had high HR probabilities using the GAM model. Moreover, there were a number of “not home runs”, especially during the 2017 season, where the probability of a HR was high. Judge’s home run count of 52 in the 2017 season was actually smaller than the expected home run count of 60.6. For the 2018 and 2019 seasons, the observed HR counts were similar to what would be predicted based on the exit velocity and launch angle measurements.
Here is the display for Nelson Cruz. Like Judge, Cruz had a number of “no-HRs” where the probability of home run was large. In Cruz’s case, for four of the five seasons, the observed home run count was smaller than the predicted count.
- A Famous Home Run Quest. I recently rewatched the HBO movie 61* that describes Mickey Mantle and Roger Maris’ quest to break Babe Ruth’s home run count of 60 in the 1961 season. Some of Maris’ home runs benefited with the “short right-field porch” in Yankees Stadium. I do wonder about the spray angle measurements of Maris’ home runs in the 1961 season — how many of his home runs were pulled to right?
- Is the Two-Variable GAM Model Useful for Predicting Home Run Counts? Players can get “cheap” home runs that not hit particularly hard, and “tough luck” fly ball outs that die on the warning track. This work suggests that one can accurately predict total home run counts for “sluggers” by using the exit velocities and launch angles. The GAM model appears to be less effective in predicting home runs for hitters like Kepler and Gurriel who tend to hit home runs to the extreme locations in the outfield.
- Another Use of Shiny. Here I am using Shiny not to demonstrate statistical calculations to others, but rather to facilitate my exploration of model fits and residuals. Once one has written a function with various inputs, then it is relatively easy exercise to convert the function to a Shiny app.