# Expected Batting Average and Hits Added

#### Introduction

Baseball Savant has a collection of Expected Statistics measures based on the Statcast data that has been available since the 2015 season. These measures are based on estimated values of expected BA, SLG, and wOBA given values of the launch angle and exit velocity measurements. I’ve talked about Expected Batting Average in several posts in the past. Since I think there may be some confusion about how the xBA is computed, I thought I would describe the computation with an example. But the main purpose of this post is to advocate the use of a new measure which I call Hits Added which may be useful in describing a player’s hitting performance beyond what is predicted on the basis of the launch variables. We describe some of the best and worst player seasons and careers with respect to the Hits Added measure.

#### Expected Batting Average

As described on the Statcast Expected Leaderboard, expected outcomes are helpful in measuring the skill in batted ball contact removing the effect of defense or ballpark. Given the launch angle and exit velocity of a batted ball, one estimates the probability of a hit. Accumulating all of these hit probabilities over all balls in play, one computes the expected count of hits xH, and then computes an Expected Batting Average by the formula

xBA = xH / AB

Let’s illustrate this computation for Juan Soto and Bryce Harper for the 2020 season. Here are the basic count statistics.

For each of Harper’s 150 balls-in-play, we estimate the probability of a a hit. Accumulating these probabilities, one obtains xH = 57.9 expected hits. Dividing this by his at-bats (AB = 150 + 43 – 2 – 1 = 190), we get an expected BA of xBA = 0.305. For Sota, we obtain xH = 50.8 and an xBA value of 50.8 / (126 + 28) = 0.330. Note from the table that the xBA values are much closer than the BA values for the two players.

#### How Do We Estimate the Hit Probabilities?

If you have been reading my blog, you know that I like to use generalized additive modeling (GAM) as a general method for fitting nonlinear patterns to data. Specifically, I have fit a GAM of the form logit(p(H)) = s(LA, LS), where s() is a smooth function of the launch angle and exit velocity as implemented in the `gam() `function in the `mgcv` package. Statcast uses a different method in its computation of these estimated hit probabilities. I have compared the two estimation methods (for example, see this post) and it appears that the GAM method (using the default number of basis functions in `gam()`) gives probability estimates that are somewhat inaccurate particularly in the area of (LA, LS) values where hits are likely. So I will use the Statcast estimated probabilities here that are conveniently available as a variable in the Statcast dataset. I think the GAM method will work fine using a larger number of basis functions.

In a baseball standings, one computes the runs scored and runs allowed for each team and using the Pythagorean relationship, one can compute a Pythagorean expected number of wins. The number of team wins above the Pythagorean expectation is commonly called “Pythagorean Luck”. In a similar fashion, we are interested in learning about the performance of a hitter beyond what is explained by the launch variable measurements. We have the expected hits calculation xH. One is interested in the number of hits the hitter obtains over what is expected — we will call this the “Hits Added” measure:

HA = H – xH.

Hits Added is attractive in that it has a clear interpretation in terms of hits — it represents the additional number of hits a batter gets over what would be expected based on his launch angle and exit velocity measurements. In our example, Harper had a hits added value of HA = 51 – 57.9 = -6.9 and Sosa had HA = 54 – 50.8 = 3.2. Perhaps Harper had a negative value due to his spray angle (he tends to pull the ball and is easily defended) while Sosa had a positive value due to his speed? (This conjecture invites further study.)

#### Standardized Scores

There is a sample size effect here since large values of HA will tend to be more variable than small values of HA. A standard way to adjust for this sample size effect is to compute the standardized or Z-score

Most of the Z-scores will tend to fall between -2 and 2. So extreme values of HA correspond to values of Z that are either larger than 2 or smaller than -2. I have demonstrated in an earlier post that Z is meaningful in that it appears to measure a hitter’s ability (through spray angle or speed or something else) to obtain hits. If one constructs a scatterplot of the Z scores for regular players two successive seasons, say 2018 and 2019, the correlation will be about 0.35.

Here’s a graph of the Z-scores against the balls in play for hitters in the 2019 season where a smoother has been added to see the pattern. Interestingly, there is a positive trend in the smooth which indicates that the regular players with more BIP tend to have positive Z-scores. (Their hit counts exceed what would be expected given their launch variables.)

#### Best and Worst Hitters in the Statcast Era

Now that we have six seasons of Statcast data (2015 through 2020), I thought it would be interesting in finding the best and worst player-seasons with respect to hits added. In addition, it would be interesting to find the hitters who had the highest and lowest cumulative hits added during the Statcast era.

Focusing on hitters with at least 200 BIP in a season, I found 51 player/seasons where the standardized score Z exceeded 2. The number of hits added for these seasons ranged from 15 to 53 with a median of 25. Dee Gordon had three extreme seasons and nine hitters (Xander Bogaerts, Jose Altuve, Jean Segura, Ender Inciarte, Jonathan Villar, Odubel Herrera, Kris Bryant, Eddie Rosario) each had two extreme seasons. At the low end was only one player/season (Ryan Howard in 2015) where the standardized score Z<−2.

It would also be interesting to see which players gained and lost the most hits over the entire Statcast era (2015 – 2020). Here is the top 10 list — note that Xander Bogart, Dee Gordon and Jose Altuve stand out.

Here is the bottom 10 list — Albert Pujols lost the most hits (beyond what would be expected in terms of launch variables). Googling “Albert Pujols slow”, I found this article that claims that Pujols was the slowest player in MLB in 2017.

#### Pitcher Perspective?

This entire exercise above was done from a hitter’s perspective. The obvious question is whether it is useful to consider the use of the Z-score for pitchers. So I repeated the above exploration for pitchers during the Statcast era. Here are some findings.

• When one constructs a scatterplot of the Z-scores for pitchers for successive seasons, one sees little association (r = 0.04). This graph suggests there is little evidence that pitchers have the ability to control hits beyond what is predicted based on the launch variables.
• Looking for extreme pitcher-seasons, I only found only 8 extreme high and 2 extreme low Z-scores. From a pitcher’s perspective, low is good and they corresponded to Daniel Mengden and Michael Wacha.
• Also I looked for cumulative hits added among pitchers in the Statcast era. The best pitcher from a HA criterion was Julio Teheran who saved 44 hits over the six-season period. The worst pitcher was Matthew Boyd who added 73.4 hits over the Statcast period.

In summary, the hits added or subtracted values for pitchers were much smaller than the corresponding values for hitters. Of course, pitchers have some control over launch angle, but beyond the launch variables, it doesn’t appear that pitchers have much control on the Hit/Out outcome.

#### Some Takeaways

• A Better Measure? Looking at the Statcast Leaderboard, I don’t think the definition of xBA is that meaningful since one is dividing the expected hits by AB. (I think it would make more sense to divide xH by BIP.) I think the Hits Added measure is easier to understand since we are focusing on the count of hits.
• How to Estimate the Hit Probabilities. It was surprising that GAM provided poor estimates at hit probabilities, but perhaps (as Tom Tango) suggested, it is due to the nature of the (LA, LS) region which is likely to result in a hit.
• HA is a Batter Characteristic. I showed in an earlier post that the variability in launch speeds or hit rates is primarily due to differences between batters than differences between pitchers. Similarly, hits added appears to be more a measure of hitting ability rather than pitching ability. Hitters who are able to direct their balls in play or have good speed will likely have positive HA values.