Batting Statistics in a 2020 Half-Season
Introduction – A Half-Season of Baseball?
Currently negotiations are going between MLB and the players on the logistics of a 2012 season. One proposal being considered is to have a half-season of baseball including an expanded playoff series. We don’t know yet if this half-season of MLB will actually happen, but it is interesting to think about the impact of having a 81-game season on the measures that we use to evaluate players. Let’s focus on some of the basic measures we use to evaluate hitters. In the 2019 (complete) season, Tim Anderson had the highest AVG of 0.335, Mike Trout had the highest OBP at 0.438, and Pete Alonso won the home run crown with 53. What would these leading batting statistics look like if there was only a half-season of games? We’ll use a simulation approach with the 2019 season data to explore the impact of a half-season on these statistics.
I am not going to do any fancy modeling here. Let’s start with the Retrosheet play-by-play dataset which contains information on each play for the 2019 season. If we consider only the batting plays, then we have a data frame with 186,517 rows. To see what a half-season looks like, I will create a new play-by-play dataset (we can call it a half-season dataset) where I randomly select half of the rows from the complete-season dataset without replacement. By repeating this simulation exercise many times, I can produce distributions of leading measures for many simulated seasons. For example, I can get a distribution of the leading batting average from a half-season of data.
Comments about Sample Size
We always focus on season accomplishments like home runs, AVG, OBP, etc., but these measures should be viewed within the context of a full season of baseball. Remember the famous asterisk added to Roger Maris’ home run record of 61? Maris’ record was not considered legitimate by MLB since Maris hit his 61 home runs in a season of 162 games while Babe Ruth his his 60 home runs in 154 games. (By the way, although MLB added an asterisk to Maris’ accomplishment, they later removed this asterisk from the record book.) Although this was a relatively small difference in season length, it showed that MLB recognized that opportunities (number of games) played a role in the evaluation of baseball measures like the home run count.
One important idea from Statistics is that the variation of a measure like a sample mean decreases for larger sample sizes. Applying to baseball, batting measures based on a smaller number of plate appearances show more variability. This is especially true for batting measures like a batting average that are more controlled by luck (that is, binomial) variation. We can see the effect of sample size by contrasting measures from a half-season with measures from a full season of baseball.
How do batting averages of qualified hitters (that it, those hitters who have had at least 3.1 plate appearances per game) compare with the 2019 full season and one 2019 simulated half-season? Here is a graph. The biggest observable difference is in the extremes. In the full season all of the AVGs fall between 0.20 and 0.35, and in the half-season, we see several AVGs below the Mendoza line of 0.200 and one hitter hit over 0.350.
How would the leading AVG look like in a half-season? I repeated my half-season simulation 100 times, collecting the highest AVG among qualifying hitters. A histogram of these half-season leading AVGs is shown below. The 2019 full-season leading AVG is represented by a red vertical line. We see that the highest AVG from a 2019 half-season will tend to be in the 0.350-0.370 range, which is about 20-40 points higher than Anderson’s AVG of 0.335 for the complete 2019 season.
In baseball, .300 is thought to represent a benchmark for a good hitting season. In 2019, 13.7% of the qualifying hitters had over a .300 AVG. How would that change in a half-season? In our simulations, the graph below indicates that the percentage of .300 hitters would jump to 17-23%. I suppose that some baseball contracts would give bonuses to players who achieve this .300 distinction. Would these contracts be adjusted if the 2019 season had only 81 games?
I did a similar study using on-base percentage. Now, OBP is more of a skill measure than AVG, so I would think that OBP would be less affected than AVG to the sample size. What would the leading OBP look like in a simulated 2019 half season? Actually it appears that the leading OBP would only be (on average) 0.45 which is not much higher than Trout’s leading OBP. (By the way, Trout had a shortened 2019 season due to injury and one would think that his OBP would drop some if he had played a complete season.)
One sees a similar sample size effect for home runs. Here are the leading home runs for 100 simulated half-seasons — the vertical line represents one half of Alonzo’s home run count of 53. On average, the home run leader in a half-season had about 30 home runs. This player would be predicted to have 30 x 2 = 60 home runs in a full-season, but this naive prediction ignores the regression to the mean phenomena. (Extreme performances in one half of the season tend to move to the average for the second half.)
- A shortened season will affect measures of baseball performance. We see that batting averages will be more spread out and extreme AVGs like 0.350 are more likely to be observed. A shortened season will result in more 300 hitters.
- Short seasons have the most impact on luck-driven measures of performance such as a batting average. On the other hand, the smaller sample size would have less of an impact on a measure that is more skill-driven such as a strikeout rate or an on-base percentage.
- Actually the more interesting study would be to explore the impact of a shortened season on the final standings and the outcome of the World Series. Chapter 9 of Analyzing Baseball with R used a simulation to find the probability that the best team, that is, the team with the best talent, wins the World Series. But that simulation was performed for the 1968 season which had a full 162-game season with two leagues and a limited playoff structure. It would be interesting to revisit that simulation study to find the chance that an average team (like the Phillies) wins the World Series with the proposed 2020 half-season expanded playoff format.