In the last two blog posts (part I and part II), we have been gaining some insight into the drop in home run hitting in 2018 by looking at distances traveled on balls in play. We want to see if the distances traveled on balls in play in 2018 are consistent with distances traveled in 2017 adjusting for changes in launch speed and launch angles between the two seasons. Here we illustrate a modeling/predictive approach to this problem. First, we use a fitted generalized additive model to explain the distance traveled knowing values of the launch angle and exit velocity. Next, we explain how this fitted model can be used to predict distances traveled of balls in play given 2018 values of launch angles and exit velocities. Last, we compare these predictions to the actual distances traveled in the 2018 season. Since game time temperature may be a factor in distances traveled, we perform this comparison for each month of the 2018 season.
Modeling Distance of BIP from 2017 Data
We collect all of the balls in play for the 2017 season where the launch angle is positive. We fit the generalized additive model (GAM)
distance = s(launch_angle, launch_velocity) + error
where s() is a smooth function and the errors are assumed normal with mean 0 and standard deviation .
In Part I of this three-part series on distances of balls in play, I display a contour plot of the fitted GAM distances as a function of launch angle and launch velocity.
Predicting Distances for a Month in the 2018 Season
Since we are studying the home run issue, we focus on predicting distances traveled for values of launch angle and launch velocity where home runs are likely. Practically all of the home runs are hit when the launch angle falls between 15 and 40 degrees and the launch speed exceeds 90 mph, so we collect the (launch angle, launch speed) values in this region for a particular month in 2018, say April. We use the 2017 data model to predict the distance traveled for these balls in play.
Now I am not interested in getting a “best” prediction for the distance traveled, but rather I want to simulate a value from the predictive distribution of the distance traveled. I will do this in two steps — first, I will simulate a value of the expected distance traveled, and then I will use this expected distance value to simulate an actual distance that the ball will travel. (In statistical jargon, the first step allows for inferential uncertainty and the second step incorporates the predictive uncertainty in the process. For Bayesian readers, I am simulating a draw from the posterior predictive distribution. )
I simulate a single value of the predictive distribution from the 2017 data model using each of the (launch angle, launch speed) values from April of 2018. So for each of these values, we will have a predicted distance and we also have the actual 2018 distance traveled.
Comparing 2018 Distances with the 2017 Model Predictions
We focus on the differences between observed and predicted distances where a difference is defined as:
Difference = 2018 Distance MINUS Prediction from 2017 Model
Remember we are focusing on hard-hit balls (exit velocity exceeding 90 mph) and launch angles between 15 and 40 degrees — there are approximately 4200-4800 balls in play a month in this group. Below we show the differences (observed minus predicted) for all these balls in play for each month of the 2018 season. A red line is drawn at the value 0 to make it easy to read the locations of the boxplots.
Here are some takeaways from this graph:
- There is a sizeable error in the predictions of distance. One estimates the standard deviation in our GAM model to be 20.5 feet. So there are other factors besides launch angle and launch speed (such as spin, fielding and measurement error) that affect the distance of a batted ball.
- For each month of the 2018 season, the average differences between actual and predictions are negative which means that the 2017 data model systematically overestimates the distances traveled of the 2018 balls in play.
- How big are these effects? Below I show the median difference (in feet) for each month. Distances are most down in the spring (cold) months of April and May, but they are still 5 to 7 feet lower, on average, in the summer months of 2018 than predicted from the 2017 data model.
- We see some interesting outliers where the difference values are smaller than -100 or larger than 100 feet. It would be interesting to explore these outliers to see what might cause these large prediction errors.
- Based on my recent work, it is pretty obvious to me that there has (again) been a change in the characteristics of the baseball that has led to a drop in the distances traveled of the balls in play in 2018 which has led to a drop in home runs. Remember that the recent MLB committee that explored the increase in home run hitting concluded there were changes to the composition of the ball that contributed to the sudden increase in home run hitting from 2015 to 2017.
- This work indicates the size of the effect. Yes, the cold weather in 2018 is a confounding factor, but the average distance of the balls of potential home run balls dropped by 5-7 feet in 2018 (compared with 2017) even in the warmer months of the 2018 season.
- Although I didn’t explain it thoroughly, this illustrates how one simulates a predictive distribution from a fitted model. This approach is flexible and can be used in a variety of ways to check if predictions from a fitted model are consistent with another dataset.