Monthly Archives: April, 2022

Exploring the 3000 Hit Club

Introduction

Miguel Cabrera recently joined the 3000 hit club, the players who collected at least career 3000 hits. There are currently 33 hitters in this group. What does it mean for a player to have 3000 hits? Here are several things that come to mind:

  • The player in this club must have had a long career. 3000 hits translates to an average of 200 hits for 15 seasons which is pretty impressive. It is a mark of sustained excellence in hitting.
  • The players in this club likely had high career batting averages. But I suppose that there would be above-average BA hitters who obtain 3000 hits from a large number of career at-bats (AB).
  • Since we don’t know the career at-bats (AB) for these players, I suppose that there would be a lot of variation in the career AB for the players in this 3000 hit group.

These comments motivated me to perform an exploration of the hitting career trajectories of the players in the 3000 hit club. Using data from the Lahman database, I will explore the batting averages of these players as a function of their age. The goal here is to get some insight into the best hitters for average in this group and learn about the shapes of the career trajectories of their season-to-season batting averages. We will see some interesting variation in the trajectory shapes.

Some Helpful Categorizations

It is not easy to compare these 33 hitters using a single graphical display. I will simplify this comparison by dividing these 33 hitters into six groups, and then we will graphically compare the trajectories of the hitters in each group.

  • First we divide the hitters by era. The “vintage” players are the ones whose midcareer is 1976 or earlier. The “modern” players have midcareers later than 1976. (1976 was the median of the midcareers of the hitters in this group.)
  • After inspecting the career trajectories for all 33 hitters, I see some differences in the trajectory shapes. The “early bloomers” tend to peak (in BA) early in their career, in their mid 20s, and then decline towards the end of their career. The “late bloomers”, in contrast, tend to have their best BAs in the last half of their career. The “common trajectory” players tend to have a familiar trajectory shape — BA increases until about age 30 when it peaks, and then declines until retirement. I looked individually at each “smoothed” trajectory to make a judgement on the trajectory type for each player.

Since there are two era groups and three trajectory groups, I divide the 33 players into 2 x 3 = 6 groups. Below I present the smoothed BA trajectories for the players in each of the six groups and make some remarks comparing the hitters in each group. I am displaying smoothed trajectories using the loess algorithm on the graph of BAs against Age. By displaying only the smooths, we obtain cleaner displays and can focus on the patterns of the hitters’ BA trajectories.

Vintage Early Bloomers

This is one of the most common groups — the vintage (midyear before 1976) players who all peaked early in their season. Note that Ty Cobb appears to be the dominate hitter in this group, but it is perhaps unfair to say he was the best since BAs tended to be higher in Cobb’s era. Carl Yastrzemski, in contrast, had a relatively modest BA in his career, but played during a tough era for hitting. Yastrzemski won the AL batting crown with a .301 BA during the pitching-dominated 1968 season.

Modern Early Bloomers

These more recent (midcareer before 1976) early bloomers are familiar to baseball fans. It includes Albert Pujols who is completing his career with the Cardinals and Alex Rodriquez who currently works as an announcer for ESPN baseball. From this display, we see Wade Boggs had a remarkable high BA during the early part of his career.

Vintage Late Bloomers

It is less common for these 3000 hit players to be late bloomers but here is a graph for three players in this group. Tris Speaker had a remarkable BA in his middle 30’s and Clemente had a high BA during the final seasons of his career.

Modern Late Bloomers

The modern era late bloomers include Tony Gwynn who had great BAs in his middle 30s and Adrian Beltre who recently retired in 2018.

Vintage Common Trajectories

Although we are calling the “rise to 30 and decline” shape common, there were only three vintage players who fit this category. Although Pete Rose is the leader in career hits, his BAs were significantly lower than Rod Carew and Honus Wagner.

Modern Common Trajectories

There were five modern players in this group that fit the common shape trajectory. Derek Jeter actually had a high BA for much of his career. Cabrera had a remarkable high BA at age 30 but declined sharply in recent seasons.

A Few Remarks

  • Comparing players from different eras? One criticism of these comparison graphs is that I am comparing smoothed batting averages of players from different eras, and this may not be reasonable since a typical BA has changed over the history of baseball. One way to make a fairer comparison is to standardize each BA for a particular season where one is comparing a player’s BA with other hitters who played the same season.
  • Career trajectory shapes? This exercise demonstrates that it is unreasonable to assume that player career trajectories have a standard shape such as the common “rise, hit a peak at age 30, fall” shape. Certainly MLB teams need to have a good understanding of a free agent’s career trajectory when they offer a long-term contract.
  • Why write another blog post on batting average? I agree that batting average is a poor measure of hitting performance. But if we care about career hits, that leads us down the batting average path.

R Script

The R script for this particular exploration can be found on my Github Gist site. By running this code, one can replicate all of these graphs. Here are the main steps in this work.

  1. First one needs to collect the relevant data from the Lahman database. The Batting data frame contains the season to season batting statistics and the People data frame contains information such as first name, last name and birth year for all MLB players. (The Lahman package contains these data frames for players through the 2020 season.)
  2. One needs to identify the players with at least 3000 hits. Then one creates a data frame containing the Season, H, and AB for all seasons for the players in this club.
  3. One creates the classification variables Early/Late and Type of Trajectory. Also one needs to create a Age variable using the MLB definition of age.
  4. In the ggplot2 code, the key element is geom_smooth() that implements the loess smoothing. I focused on plotting smoothed trajectories of BA for only the seasons where the player had at least 300 AB.