Introduction to xBA
As defined by Major League Baseball, expected batting average (xBA) is an estimate of the probability that a batted ball is a hit based on exit velocity and launch angle. Since the xBA values are provided with the Statcast data, I had several things to explore:
- Is it possible to replicate the Statcast hit probabilities using a statistical model? (I am thinking of a generalized additive model (GAM) which is a general-purpose good regression method.)
- If there are differences between the Statcast probability estimates and the GAM probability estimates, how are they different?
- How can one compute a expected batting average on all at-bats from the xBA estimates?
- How do the actual number of Hits compare with the expected number of Hits for each player after one month of baseball in the 2018 season?
- Can one improve on the probability of hit estimates using the spray angle measurement?
GAM Model for Predicting Hits from Launch Angle and Exit Velocity
Using all the in-play data from the 2017 season, I fit the generalized additive model (GAM) of the form
logit(prob(H)) = s(Exit Velocity, Launch Angle)
where s() is a smooth function of the two variables. (At some late blog post, I’ll explain how a GAM works.)
Comparing GAM Predictions with Statcast Hit Probabilities
Using this GAM model, I predict the probability of a hit for all batted balls in the first month of the 2018 season. I don’t know how MLBAM computes their probability estimates, but we can compare the two sets of estimates. Here is a scatterplot of the two estimates. On average they seem similar, but there is considerable variation in my GAM estimates for a fixed value of the Statcast estimate.
Looking further, here are parallel histograms of the predicted hit probabilities using the two methods. The Statcast estimates seem a bit more discrete and there are clusters of the GAM estimates about the values 0.30 and 0.75.
Here I explore where the GAM estimates understate the Statcast estimates (blue means the GAM estimate understates the probability of a hit and red means that the GAM estimates overstates the probability) as a function of the two variables. It is pretty clear where in the (launch angle, exit velocity) space the two methods are different.
Expected Unconditional Batting Averages
For clarification purposes, I think the term “expected batting average” is a little confusing since the Statcast BA estimates are conditional on balls in play. An unconditional batting average is actually
BA = (1 – SO_rate) x BABIP
where SO_rate is the rate of strikeouts (among AB) and BABIP is the batting average on balls in play. The estimates shown above estimate the true BABIP rate, but we need to also estimate the strikeout probabilities. One reasonable approach is to estimate the SO probabilities using 2017 data (by adjusting them slightly towards the average SO rate) and merge these estimates with the 2018 estimates of the BABIP rates to get estimates of the (unconditional) expected BA’s.
Hits Minus Expected Hits (Luck) After a Month of Baseball
Using the GAM estimates of the in-play hit probabilities, we are interested if these provide reasonable estimates of the number of Hits for the players in the first month of the 2018 season. By summing all of the expected BA’s over all batted balls, we get an Expected number of hits. We plot the residual
Residual = Hits – Expected Hits
against the number of batting balls below. We see a fan shape appearance — the residuals tend to be larger for players with more batted balls. This is a sample size effect — the count of hits with more opportunities tend to be more variable than the count of hits for more opportunities.
To adjust for the AB effect, it is desirable to compute a standardized residual such as
Z = (Hits – Expected Hits) / sqrt(Expected Hits)
that we plot below.
There are several takeaways from this residual graph. First, practically all of the standardized residuals are between -2 and +2 which is what one typically sees with a reasonable model explaining much of the variation in hit probabilities. Second, 65% of the residuals are smaller than zero, indicating that our GAM model is overstating the probability of a hit — there are fewer hits than expected. Last, there are three “interesting players” with residuals exceeding 2 in absolute value. Carlos Santana (Z = -2.18) and Randal Grichuk (Z = -2.17) have been unlucky in terms of getting hits (fewer hits than expected), and Joey Wendle (Z = 2.04) has been unusually lucky this season (more hits than expected).
Our model only uses the launch angle and exit velocity and it would seem that the spray angle would also be helpful in predicting hits. I define an adjusted spray angle so that a negative value corresponds to a ball that is pulled and a positive value corresponds to a ball hit to the same side. I divide the batted ball data into bins by adjusted spray angle and this graph displays the average standardized residual in each bin. We see a clear pattern. Balls hit in the middle of the field tend have negative Z scores — there are fewer hits than expected based on launch angle and exit velocity. In contrast, balls hit in extreme spray angles (smaller than -35 degrees or larger 35 degrees) tend to have positive Z scores — there are more hits than expected. This pattern makes some sense given the traditional fielding positions. The largest residuals correspond to batted balls hit at extreme angles on the opposite side — are these batted balls beating the shift and causing more hits than expected?
A Better Model
The above residual graph indicates that Adjusted Spray Angle is a relevant input for predicting hits and that motivates trying the GAM of the form
logit(prob(H)) = s(Exit Velocity, Launch Angle, Adjusted Spray Angle)
I tried fitting this model on all of the 2017 batted ball data. My work on this is in its early stages, but it appears that one uses the simple criterion
sum(abs(Hit – Expected BA))
this three variable model does significantly better (on 2018 batted ball data) than both the Statcast and GAM two-variable models in predicting hits. Honestly, the improvement is not that great. 62.6 percent of the residuals are still negative (as opposed to 65% in the two-variable model above). Using this new model, Carlos Santana still has 10.2 fewer hits than expected (with the two-variable model, it was 11.1 fewer hits than expected).
- The GAM model in predicting hits based on launch angle and exit velocity performs similarly to the Statcast estimation method, but there are notable differences. (I don’t know what method Statcast is using, but I suspect the method is more empirical, say by dividing the launch angle/exit velocity space into small bins and computing the proportion of hits in each bin. I believe the GAM method provides smoother estimates.)
- Although the standardized residuals from the model fit look okay, about 65% of the residuals are negative which means that the probability estimates tend to be too high.
- There is a clear pattern when you plot the residuals against the spray angle indicating that there is a spray angle effect in getting hits. But simply adding spray angle to the model does not provide a big improvement.
- There is much more that can be said about expected batting average. One direction of study is to explore how these expected hit probabilities vary among teams or between pitchers. Perhaps one could measure team defense by use of expected batting averages.
I’ve been writing about expected batting averages in recent posts. In this post, I demonstrate there is more to hitting than just launch angle and exit velocities by plotting residuals (such as the Z score above) for two consecutive seasons. In a follow-up post, I focused on the error rates by using these expected BA predictions.