Albert Pujols retired in the recent 2022 season, completing a remarkable career hitting a total of 703 home runs, placing him fourth in the career HR list behind Barry Bonds (762), Hank Aaron (755) and Babe Ruth (714). Pujols was a very popular ballplayer who likely will be elected in the first-ballot in the Hall of Fame when he is eligible. Pujols’ career has a lot in common with Hank Aaron:
- They played about the same number of seasons — Aaron (23) and Pujols (22).
- They both debuted in the MLB at an early age and played until their early 40’s.
- They had similar number of career at-bats, home runs, and rates of hitting home runs:
Player HR AB HR_Rate <chr> <dbl> <dbl> <dbl> 1 Albert Pujols 720 11696 6.16 2 Henry Aaron 755 12364 6.11
We see that both Aaron and Pujols hit home runs at a 6.1-6.2 percent rate. That raises the obvious question — were there differences in the patterns of home run hitting between the two sluggers? We’ll explore the answer to this question in several ways.
- We look at the season by season HR rates of the two hitters.
- We look at the streakiness patterns of hitting home runs of Aaron and Pujols. In particular, we focus on the spacings, the number of non-HRs between the occurrences of HRs. A pattern of unusually short and unusually large spacings indicate that the player is “HR streaky”.
We will see some interesting differences between the two home run sluggers.
One obtains season-to-season home run data (HR and AB) for the two players from Baseball Reference. To look at spacings, one needs the outcomes of each of Aaron and Pujols’s official at-bats in their careers which is obtainable through the Retrosheet play-by-play files. A complete listing of the outcomes of Pujols’ 11,696 at-bats is available through Retrosheet. Unfortunately, some of the Retrosheet data for games in the 1950’s is missing so I don’t have complete data for Aaron. Currently I have the outcomes of 12,069 of Aaron’s at-bats which represents 97.6% of his 12,364 AB. I don’t believe my work on spacings below would change much if I had access to all of Aaron’s career at-bats.
Season to Season Performance
Using the Baseball Reference data, I begin by graphing the home run rates (HR / AB) for the two players as a function of their ages. Although their career HR rates are similar, the two players have remarkably different career trajectories.
- Aaron starts off with the usual increasing HR rate until 27, shows a small decrease until age 30, and then a second increase in HR rate in his mid-30s. Aaron’s highest HR rates occurred at ages 37 and 39.
- Pujols has a different career trajectory shape. His HR rate increased until about 27, and then steadily decreased until age 38. In his last few seasons, he showed an increased HR rate. In contrast to Aaron, Pujols’ highest HR rate occurred at age 26.
Streaky Performance – A Reference Distribution
Next using the Retrosheet data we focus on the spacings, the lengths of the gaps (in AB) between successive home runs. The question is whether one player exhibits a pattern of home run hitting that is more streaky than a second player. It is difficult to understand the size of these spacings since people generally don’t understand the spacings that one observes from “random” data. So it is helpful to compare the spacings with “random” spacings that one would see from a standard probability distribution.
Let’s suppose that a hitter is truly consistent in the sense that the chance of hitting a home run on a single at-bat is a number and outcomes of different at-bats are independent. Let denote the length of the spacing (number of failures) between a success (a HR) and the next success. With these assumptions, the distribution of is Geometric. The properties of the Geometric distribution are well-known and so it is straightforward to compare the lengths of the spacings of a player with those of a Geometric distribution where is estimated to be the player’s home run rate.
For each of our two hitters, suppose we look at all of the at-bats in a career and compute all of the spacings — the lengths of the gaps between successive home runs. A graphical way to compare these spacings with a Geometric distribution is a Geometric Plot. To construct a Geometric Plot, we construct a table of the values of the spacings () and the corresponding frequencies () and then construct a scatterplot of the values of (). If the points follow a line, then spacings have (approximately) a Geometric distribution.
Here is a reference to a family of plots to see if data follow Poisson, binomial, or geometric distributions. These plots were promoted by John Tukey in his Exploratory Data Analysis work.
Here we construct a Geometric Plot of the spacings between Aaron’s home runs. I add a smoothing curve to show the general pattern of the points. If the smoothing curve hugs the line, then we have approximately Geometric data. It takes some experience to interpret these plots, but I think the Aaron’s points do follow a line pretty well. The conclusion is that Aaron’s home run spacings are approximately Geometric. By the way, note that there were several largest spacing values for Aaron exceeding 80.
Let’s now look at the Geometric Plot of Pujols home run spacings. Comparing the two plots, I think Pujols’ spacings don’t follow a Geometric distribution as well as Aaron’s spacings. There seems to be stronger curvature in the Pujols plot — the concave up pattern corresponds to a hitter who is more streaky than one would anticipate from Geometric data. Pujols had several long spacings exceeding 100.
Streakiness in Each Season
In the previous section, we focused on the entire string of HR’s and non-HR’s in each player’s career at-bats. Now we want to do a streakiness study individually for each season of the players’ careers. Given the limited amount of data, Geometric plots are not as helpful in studying streakiness for a single season. But the Geometric distribution leads to a simple way to define streakiness. The mean of a Geometric distribution is and the variance is . One can show that
So this motivates a streaky measure:
- for a single season, collect the spacings
- estimate the mean and variance of the spacings
- compute the measure . If , then the spacings are streakier than one would expect from a Geometric. If If , then the spacings are more consistent than Geometric.
Here is an illustration of the computation of the measure using the spacings for the entire career string of HR. For each player, the table gives the number of spacings, the mean spacing length, the variance of the spacing lengths, and the measure . Aaron’s average spacing length and value of are smaller than those for Pujols, indicating Aaron was more consistent than Pujols in his pattern of home run hitting.
Player N M Var S <chr> <int> <dbl> <dbl> <dbl> 1 Albert Pujols 643 16.7 276. 0.937 2 Henry Aaron 698 16.2 237. 0.849
I repeat this computation using the spacings values for the individual seasons. Blow I plot the streaky measure against age for each of the two hitters. Many of the values are smaller than 1 indicating that both Aaron and Pujols were pretty consistent in the pattern of their home run hitting. But Pujols exhibits more streakiness than Aaron — Pujols’ streaky seasons occurred at ages 27, 29, 31, 32 and at the end of his career.
- (Differences between Aaron and Pujols) Although Aaron and Pujols had remarkable home run hitting careers, each player had interesting career trajectories. Aaron had some of his best home run seasons towards the end of his career. Pujols, in contrast, excelled in home run hitting early in his career although he had a brief surge at the end of his career.
- (Streakiness?) I have written about Hank Aaron’s pattern of home run hitting in the past. He really exhibited a consistent pattern of home run hitting and had few slumps of notable length. Pujols also was a consistent home run hitter, but this exploration shows that Pujols tended to be streakier than Aaron in the sense that he had more long spacings between consecutive home runs. We saw that Pujols had several stretches of 100+ at-bats (we call these ofers like 0 for 100) where he didn’t hit a home run.
- (R Package) I have a special R package BayesTestStreak containing special functions for exploring streaky patterns in sequences of 0/1 data. I am currently revising the functions to make them more data frame friendly
- (More Reading on Streakiness) If you are familiar with my work, you know that I have written a bit on the general topic of streakiness in sports data. I recently collected many of my posts from this Exploring Baseball Data with R blog. I have an article Streakiness Patterns in Baseball that describes much of my blog work on streaky patterns of individuals and teams in baseball.