I had a great time at Saberseminar last weekend in Boston. There were many interesting talks and it was a wonderful opportunity for students to connect with MLB teams (I understand that 25 teams were represented at the meeting.) There were a series of talks on Sunday afternoon (including mine) that focused on the change in the baseball and the increase in home run hitting. My talk was given towards the end of Sunday afternoon and based on some Twitter comments, it seems that there was some misunderstanding about the message of my talk. Anyway, I hope this post will help to clarify the statistical approach I was using in my presentation and explain the point I was trying to make. (Actually writing this post is useful for me — it helps me to improve the exposition of my work.)
By the way, here are the slides to my presentation for the interested reader: bit.ly/hrreport2019 There will be a paper soon available based on the material from the slides.
What’s Causing the 2019 Home Run Increase?
It should be obvious to all baseball fans that 2019 is a crazy season for home run hitting. The interesting question is why. Specifically, what is driving the home run increase? Some talks at Saberseminar focused on the change in the baseball. I don’t dispute the role of the composition of the baseball. In fact, in the work of the MLB Home Run Committee, we found that there was clearly a reduction in the drag coefficient that was contributing to an increase in home runs from the second half of the 2015 season through the 2017 season.
But this doesn’t mean that the change in the baseball is the sole contributor to the home run increase. The point I was trying to make in my Saberseminar talk is that there are several plausible causes for the home run explosion and these other causes (that is, the ones not due to the baseball) may also be playing a big role in the home run phenomenon. Let’s focus here on non-ball reasons for the increase. This post will also be useful in explaining my statistical method for predicting home runs from one season to the next season.
My Prediction Method
The chance of a home run clearly is strongly related to the off-the-bat measurements of launch angle and launch speed. That motivates the consideration of the generalized additive model (GAM)
logit(P(HR)) = s(LA, LS)
where P(HR) is the probability of a home run, LA, LS are the launch angle and launch speed measurements, and s() is a smooth function. By the way, this model really is a helpful way of understanding the ball effect — given values of launch angle and launch speed, you get a reasonable estimate of the chance of a home run. (Since there appears to be a different ball being used from one season to the next, one will have different fitted GAM model using data for the two seasons. If there is more drag one season, the probability of a HR for a given LA, LS will be smaller.)
Here is a description on how one uses this GAM model in fitting and prediction. There is a training step and testing step in this method.
- (Divide Data into 2 Parts) I divide my data into two parts — the training dataset and the test dataset. In the example, suppose I randomly divide the 2018 batted ball data into two parts of equal size.
- (Fit on Training Data) I fit my GAM model to the training dataset — this will let me estimate the chance of a home run for any values of (LA, LS).
- (Make Predictions on Test Data) I use this fitted model to predict the probability of a home run on the test dataset for all batted balls in that dataset. (Note that I am using the (LA, LS) values from the test data.) From these predictions, I can predict the total number of home runs in the test dataset. Actually, by simulation, I can produce a simulated sample from the predictive distribution and find an interval which contains the unknown HR count total with high probability.
- (How Good are My Predictions?) We actually know the number of home runs in my test data. I compare the observed number of home runs with my predictions. If the observed count is within the middle of my predictive distribution, my model has done a pretty good job.
Trying the Method Out on 2018 Statcast Data
I implemented this prediction method using 2018 Statcast data — I am dividing the dataset randomly into two groups, fitting the model on one group, and using the model to predict the home run count in the second group. Here’s the resulting picture. The tan histogram represents my predictions of the total HR count and the observed HR count (red line) is in the middle of the predictive distribution. My model is working well.
Making a Small Adjustment to Off-the-Bat Measures
I am going to repeat what I did above but with one change. Suppose that the hitters in the test data are more home run savvy that the hitters in the training data. In particular, I add 0.3 mph to the launch speed and also add 0.4 degrees to the launch angle to the off-the-bat stats for each hitter. These changes obviously have no effect on my fitted model since I am not changing the launch angles and launch speeds for the training data. Also, there is not a ball effect here since the 2018 balls are used in both the test and training datasets. The question is — what effect do these LA, LS adjustments to the test data have on our predictions? The new picture is below.
Things have dramatically changed. My new home run predictions range from 3000 to 3100 (remember we are only predicting the total for half of the 2018 season) but we observed only 2774 home runs in the test dataset. This is a big error — my predictions are about 300 more than the observed home run count. My model isn’t working very well — the actual 2018 players in the 2nd half were using the old off-the-bat measurements, not the ones I artificially created.
How Did I Decide on the Adjustments?
Okay, if you have read this far, you are probably thinking … this is not surprising since you added a “large” value of 0.3 mph to each launch speed and a “large” value of 0.4 degrees to each launch angle. How did you come up with the values of 0.3 and 0.4?
Well, I just looked at the mean values of launch angle and launch speed for the 2018 and 2019 hitters. If one restricts the data to batted balls with launch angles > 10 degrees and launch speeds > 80 mph, one gets these stats. The 2019 hitters average 0.3 mph higher in launch speed, and average 0.4 degrees higher in launch angle. The values I used in my adjustment were just the same values distinguishing the 2018 and 2019 hitters.
Predicting 2019 Home Runs from a 2018 Model Fit
For another example, suppose I use my GAM model to estimate the HR probability using 2018 data and then using this model to predict the HR count in the first half of 2019. What I am essentially doing in this exercise is using the 2018 balls to predict what will happen in 2019. As the graph shows, my predictions miss the mark — I understate the actual home run count by 250 home runs. Why? There are two possible explanations for the 2019 increase in home runs — the ball and the increase in LA and LS of the hitters. Since my HR probability predictions are using the 2019 (LA, LS) inputs, my prediction of the total HR count is adjusting for the change in off-the-bat measurements. Here the difference between the prediction and the observed is due to the ball effect. In other words, the 2019 balls are causing an increase in about 250 home runs in the 2019 home run count.
The Point is that the HR Explosion is Due to More than the Ball
I did two prediction exercises. In the first exercise, I showed that changes in launch angles and launch speeds can contribute to a big increase in home runs even when one is using the same balls. In the second exercise, I showed that the change in balls can have a big effect even when one adjusts for changes in launch angles and launch speeds.
The point I am trying to make is that there are other factors besides the ball composition that can contribute to the home run explosion. Interestingly, I was the only one to talk about this at Saberseminar. I have demonstrated above in this modeling exercise that subtle changes in the launch angles and launch speeds of the batted balls can result in big changes in the predicted total home run count. These changes happen even if there is no change in the composition of the ball. And actually we are currently observing changes in LA and LS of this magnitude from the 2018 to the 2019 season.
Yes, the ball likely is a contributing factor to the home run increase. But don’t get so preoccupied with tearing up or measuring baseballs without looking at the whole story. The complete story about the home run increase will include some discussion of the changing qualities of the baseball hitters.
All of the R code for this exercise can be found on my GitHub Gist site.
Jim — very simple, yet logically advanced analysis. Your results clearly make the point that something has dramatically changed in 2019 with respect to the HR totals and rates we are seeing this season.
Great job Jim.
One thing that just doesn’t make sense to me is this: every time a see an opposite-field HR on an inside-out swing, I go check Exit Velocity and most of the time they are around 100mph.
Under your estimation, those HR are not particularly aided by the ball, as the EV is pretty high. However, they don’t seem to match with the visuals of the swings and contact.
Do you think there’s a possibility that the new balls are also resulting in higher Exit Velocities? If that’s the case, your model would miss that contribution by the balls, giving all the credit for the higher velocities to the batter.
Again, really enjoyed the post. Good stuff.
Hi Jim. I really enjoyed your article. I’m curious about something in your code. You simulate the number of home runs by counting those estimates that have a higher probability than a random uniform number. My assumption is that this practice includes random variation in the simulation. If this is the case, why do you use the uniform distribution? Sorry if this is too complicated to answer easier. Thanks
Suppose the predicted probability of a home run is 0.3. To simulate 10 PAs with this home run probability, you simulate uniforms from 0 to 1 and predict home run if the simulated draw is smaller than 0.3. Basically, this is the same as a simulated coin flip where the probability of heads is 0.3.
Hope this helps explain this.