Making Sense of a Batting Average

Introduction: A Twitter Post

Last week, Tom Tango posted an observation on Twitter from watching the first game of the World Series.

In this post Tango was making a sarcastic comment about the batting average AVG that is well-known to be a poor measure among sabermetricians of the quality of a hitter. But I suspect that the media will continue to show a player’s AVG. Instead of just criticizing the AVG, why don’t we try to get a better understanding of this traditional batting measure? Here I’ll review my perspective of a batting average that decomposes AVG into two parts. Although I agree that the batting average by itself isn’t currently that informative or interpretable, we’ll see that the components of AVG are helpful in understanding the batter’s abilities. Note that I am using the word ability — an estimate at a batter’s ability can be very different from the batter’s actual performance. (I’ll contrast measures of performance with estimates of ability below.)

A Batting Average is the Product of Two Rates

Write the batting average as the product

AVG = \frac{H}{AB} = \left(1 - \frac{SO}{AB}\right) \frac{H}{AB - SO}

  • The first term (1 – SO / AB) is the rate of putting the ball in play (that is, not striking out).
  • The second term H / (AB – SO) is the batting average on balls in play.

If we decompose the batting average this way, we learn about a batter’s strikeout rate and also about his success in balls put into play.

For example, let’s compare the top two AVG hitters during the 2020 season:

  Name        In_Play_Rate H_Rate   AVG
1 DJ LeMahieu        0.892  0.408 0.364
2 Juan Soto          0.818  0.429 0.351

DJ LeMahieu was better than Juan Soto in terms of putting balls into play, but Soto was better in getting hits on balls in play. If you present LeMahieu’s AVG as, say 0.364 (= 0.892 \times 0.408), this provides some insight into LeMahieu’s batting performance. He seems to be effective in putting the ball into play since his in-play rate is close to 90%. This is especially notable in the 2020 baseball environment where strikeouts are pretty common.

Scatterplot of Component Rates

Here’s a scatterplot of the component rates for the 2020 regular players. I have added lines corresponding to specific batting averages and I have represented LeMahieu’s and Soto’s performances by red points. Note the wide spread both in the in-play and hit-on-balls-in-play rates and the little relationship between the two rates.

Scatterplot of Component Talents

One has to be careful in making sense of the above scatterplot since we had a shortened 2020 season and chance variation can play havoc on batting measures. One way to screen out the large binomial variation is to fit a random effects model to the collection of balls-in-play rates and fit the same model to the collection of BABIP rates. When you fit these models, the observed rates will be adjusted towards an average, and the degree of adjustment depends on the batting measure. The model estimates provide estimates at the batters’ talents that are better predictors of future performance than the observed rates. Here is an older post that describes the use of this random effects model to shrink batting averages.

I’ve graphed a scatterplot of the estimated component talents using the same scale so we can easily compare the two scatterplots. (Again the red points correspond to the talents of LeMahieu and Soto.). The in-play rates are strongly driven by talent, so these model estimates adjust the observed in-play rates a small amount towards the average. In contrast, the BABIP rates are more “chancy” and the observed rates are shrunk strongly towards the average for these estimates. Although a few players had batting averages exceeding .350, the highest BA talent (DJ LeMahieu) was only .317.

Let’s revisit our two hitters by comparing their component talent estimates. Here we get a different message. The two players have very similar BABIP talents, but LeMahieu clearly is better in putting balls in play. LeMahieu has a 317 – 294 = 23 point advantage in batting average talent (actually the hitting probability estimate) which exceeds his 364 – 351 = 13 point advantage in observed batting average.

         Name InPlay_Talent BAIP_Talent    Talent
1 DJ LeMahieu   0.869       0.364      0.317
2   Juan Soto   0.805       0.365      0.294


  • Can we make sense of a batting average? There was a time in the past when a statement of a batting average was meaningful. For example, since Tony Gwynn rarely struck out, his .394 AVG in the 1994 season was interpretable. Gwynn’s .394 AVG reflected his remarkable success in getting hits on balls in play. (If Gwynn was playing today, I doubt that defenses would shift against him since Gwynn had the ability to hit to all fields.) In 2020 baseball AVG confounds two batting abilities — the ability to not strike out and the ability to get a hit on a BIP — so I don’t know how to interpret, say a .300 AVG. Since it is hard to make sense of the batting average, so one might question why media like to report this measure. But as explained here, the components of BA are interpretable.
  • Will Fox Sports will replace AVG with something else? I guess the first step for Fox Sports or other networks is to replace AVG with a more meaningful or interpretable measure of batting performance. Probably the most obvious replacement would be the pair of measures OBP and SLG which can be combined by addition in the OPS measure. Although my component measures for AVG are somewhat intuitive, they are not that familiar to fans.
  • What Does FanGraphs report? These component measures for AVG are similar but not the same as the ones that FanGraphs reports on their hitting leaderboard. They have a K% (K percentage) that is the percentage of strikeouts in all plate appearances, while in my definition I divided SO by AB. The standard definition of BABIP is (H – HR)/(AB – K – HR + SF) which removes home runs and adds SF to the denominator.
  • Actually There Are Three Component Rates in a Batting Average. In my earlier work such as in this post and this paper, I wrote a batting average as the function of three rates:

where SO.Rate = SO / AB, HR.Rate = HR / (AB – SO) and BABIP = (H – HR) / (AB – SO – HR). Since home runs are so prevalent, this decomposition can provide additional information about a batter’s home run ability and his ability to hit on a ball in play.


2 responses

  1. Good stuff. You can continue this process and do it for OBP and thereby include BB+HBP in the expansion. Though the nonsensical handling of SF in BA is an annoying hurdle.

    1. Tom, thanks. I did this type of thing for OBP in that paper that is available on arXiv. Actually, there is a better way to estimate several rates at once that I’ll illustrate in a future post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: