Monthly Archives: September, 2021

Selection-Distortion Effect in Baseball


I’ve been recently reading some material on causal modeling in Statistics. This is a relatively modern approach to help understand causal relationships between variables. Anyway, this material reminded me of some general statistical paradoxes or misunderstanding about statistical relationships. One important paradox is often called Berkson’s Paradox, but as McElreath says in Chapter 6 of Rethinking Statistics, this paradox may be better remembered as the Selection-Distortion Effect. Two variables may appear to have some type of association. But when we select a portion of the data, the selected data may have a different association pattern. In other words, we are distorting the pattern of association by the selection mechanism.

This paradox can be described in simple settings. First, I will give an example of the idea for a common scenario and then show how this paradox applies in baseball. In particular, it creates confusion about the association between components of the popular batting average.

Simple Illustrations of the Effect

Here’s an example from the Rethinking Statistics text. Do you believe that there is an association between a restaurant’s location and the quality of its food? I think people would generally think there would be little association between the location and the quality of food. But one might notice that good location restaurants have poor food, and poor location restaurants have good food, suggesting a negative association. What is going on? There is a selection mechanism going on. Bad food restaurants can survive if they are in good locations. Similarly, good food restaurants can survive in bad locations. Implicitly we are only selecting restaurants that are surviving (making money) and that selection process will lead to a negative association pattern.

The Berkson’s paradox Wikipedia page gives more examples to illustrate the concept. Once one is aware of this phenomenon, it is easy to think of other situations where it applies.

Association of Two Components of a Batting Average

A batting average is hits divided by at-bats, that is, BA = H / AB. With a little algebra, we can write a batting average as the product

BA = (In-Play Rate) x BACON,

where In-Play Rate is the rate of not striking out

In-Play Rate = (1 – SO / AB)

and BACON is the batting average on balls on contact (including home runs)

BACON = H / (AB – SO).

Do you believe that a player’s In-Play Rate is associated with his BACON? I think most people would think there is little association between these two rates. Why would one’s ability to not strike out be related the quality of the batted ball put on contact?

Here’s some data. I’ve selected all hitters from the 2019 season with at least 100 at-bats. Here is a scatterplot of In-Play Rate and BACON — we see a small negative trend — the correlation is equal to -0.15. Maybe you are surprised that there is a negative correlation value, but clearly the degree of association is small.

Selecting the Best Hitters

Instead of looking at all hitters, let’s select only the hitters with a BA of 0.270 or higher. The dotted line in the graph represents the region where the batting average is equal to 0.270 and the red dots are the selected values of (In-Play Rate, BACON) where the BA is at least 0.270.

What do we see? Now there is a relatively strong negative relationship between In-Play Rate and BACON with a correlation of -0.77. So by the selection mechanism, we have changed a small association pattern into a large one.

What does this mean? There are two causes for a high batting average in baseball, In-Play Rate and BACON rate. A good BA player can have a high In-Play Rate (low strikeout rate) and an average BACON rate, OR this player can have a low In-Play Rate compensated by a high BACON rate. The collection of hitters of these two types are creating the distorted impression that In-Play Rate is strongly associated with BACON. These two variables aren’t really strongly associated, but they appear to be since we are looking at “high BA” data.

Selecting the Players with More At-Bats

A similar thing happens when you select players on the basis of at-bats (AB). Return to the first example where we were looking at players with at least 100 AB. What if we look instead at players with at least 300 AB?

Interesting, the correlation increases (in absolute value) from -0.15 to -0.34. Why is this happening? Well, the number of AB is positively associated with BA, and so again we are selecting better hitters when we increase the number of at-bats. So it is a weaker version of the same effect that we see for increasing BA.

A Shiny App

As readers of this blog know, I am a big fan of creating interactive Shiny apps to illustrate different concepts. Here is a snapshot of the app below. One chooses the season of interest, the minimum number of AB and the minimum batting average. The graph displays all (In-Play Rate, BACON) points, colors the ones that are selected, and displays a best-fitting line and the correlation value.

If you select the Minimum Batting Average slider, one can use the right-arrow key to step through larger selected BA values and see the impact on the correlation value.

Some Remarks

  • Be Careful About Selecting Data. One takeaway from this exercise is that one should be careful about the impact of selecting baseball data based on some criteria. Here we see that the association between In-Play Rate and changes when we select players with high BA or high AB. We think that there is a strong negative association, when in reality the association pattern is not that strong. Since we commonly fit models with “Leaders”, it is possible that this selection mechanism is messing up the conclusions from our regression.
  • Got Code? I’ve added this Shiny app to my ShinyBaseball package. Actually, this app is a single file app.R and the data is from the Lahman package, so one can try out this app by running this single R file.
  • Berkson’s Paradox. You can read this Wikipedia page to learn more about Berkson’s paradox. I didn’t find the description in the first paragraph that helpful, but it does contain several of the popular examples used to explain the idea.
  • Throw Away the Batting Average? I know Tom Tango wants to eliminate the BA from the baseball statistics toolkit, but at least it is useful for illustrating statistical concepts like the Selection-Distortion effect.

Added September 21

Tom Tango commented that I was using the wrong notation. BABIP usually means the batting average on balls in play excluding home runs. For contacted balls (including home runs), the batting average is called BACON (BA on contact). I have made corrections in this post using the BACON notation. (Thanks, Tom for this clarification.)