Exploring Hall of Fame Voting


Since we’re still over a month away from the beginning of baseball season, it seemed appropriate to study an off-season problem — in particular, exploring the process of getting elected to the Baseball Hall of Fame (HOF).

There are different committees that can elect players to the HOF, but we’ll focus on election by the Baseball Writers’ Association of America (BBWAA).  The rules for election are stated on the HOF site.  Essentially, each voter can vote for 10 players and a player can get inducted by receiving 75% or more of the total number of ballots cast.  (Players can remain on the ballot for multiple years.)  Here we will focus on the pattern of votes for specific players.    We’ll look at recent players who get at least 20% of the vote in the initial year and look at the rate of increase in votes as players stay on the HOF ballot.

Actually, I was motivated by this post that provides an interesting visualization of Hall of Fame voting trajectories.  This graph is interesting, but it is hard to get a sense of general trajectory patterns.

Data Preparation

The voting data is available (through 2017) in the data frame HallofFame from the Lahman package.  Also we’ll be using the Master data frame to get the player names.  We limit our search to players who were on the HOF ballot from 1960 and after.   As mentioned above, we focus on the votes of the BBWAA and only consider players who have been on the ballot for 5 or more years and who have received at least 20% of the vote the first year.  There are 35 players in our group.

Initial Plot

We start by plotting the trajectories of vote percentages of our 35 players.  We’ve added a horizontal line at 0.75 that is the magic value for election to the HOF.  It is hard to compare players due to the lack in smoothness in the trajectories.  But most players’ vote percentages tend to increase and the rates of increase look similar across players.


Plot Best-Fitting Lines

To make this graph easier to view, we summarize each of the trajectories by a best-fitting line and the graph of these fits is displayed below.  This graph confirms that the slopes of  increase are similar in size.


Collect the Slopes and Intercepts

Suppose for each player, we fit the line of the form

Vote Proportion = Intercept + Slope (yearID – initialYear)

where Intercept represents the Vote Proportion in the first year, and Slope represents the yearly increase in Voting proportion.  Below I construct a scatterplots of the intercepts and the slopes — the labels are colored if they eventually got in the HOF (green) or not (red).


What do we see?

  • In the first year, these players got between 20 to 60 percent of the vote — the yearly increases range from 0 to 12 percent.
  • The players with high yearly increases such as Williams, Aparicio, and Matthews tend to be elected.  Also the players with high initial percentages (Sutton, Campanella, Perez, and Niekro) were elected.
  • There appear to be two hurdles to getting elected — a small initial voting percentage and a small yearly increase in votes.  All of the red labels (corresponding to not getting in the Hall) tend to be in the lower left of the graph.
  • It is interesting that Barry Bonds and Roger Clemens are in the middle of this graph.  Both players on the basis of their on-field accomplishments are very deserving of the Hall of Fame.  If each player averages 4% increases, then each will be inducted in the Hall in four years.


Obviously, this was not a complete analysis (a couple of hours of R work), but this is an interesting issue and deserves further exploration.

  • From a statistical perspective, I would be interested in prediction issues.  Given that a player has an initial vote of, say 30%, what is the probability that he will get induced in the HOF?
  • Does the player’s fielding position make a difference in the voting pattern?
  • One hears about the importance of breaking important milestones such as hitting 500 home runs.  How important are these milestones in getting HOF election?

R code

One can completely replicate this work by following the R script on my github gist site.