Jack Morris and the Hall of Fame

In less than a month we will know the names of the players (if any) that will join the trio of managers Joe Torre, Bobby Cox and Tony LaRussa, for next year induction into baseball Hall of Fame.

It will be Jack Morris’ 15th and final year on the ballot, so this will be his last chance to get in via the BBWWA vote.

Let’s visualize how many votes he got on the previous ballots–obviously with the help of R.

We begin by setting a couple of options: we don’t want string to be automatically converted to factors and we specify the working directory containing the Lahman’s files (Jim briefly explained how to get the Lahman’s baseball database in the previous post).

options(stringsAsFactors=F)
#set working dir
setwd("your/directory/containing/Lahman/DB")

Then we load the master and HallOfFame tables in R. The former contains biographical data of baseball people, the latter has information on Hall of Fame voting results.

master = read.csv("Master.csv")
HoF = read.csv("HallOfFame.csv")

As the next steps, we choose our pitcher, we get his hofID (i.e. the code that identifies him in the HallOfFame table) from the master table, we get his voting data and we compute the percentage of votes he obtained in each of his showings on the ballot.

#select pitcher
pitcher = "Jack Morris"
#get HOF voting data
pitHoFid = subset(master, paste(nameFirst, nameLast)==pitcher)$hofID
pitHOF = subset(HoF, hofID==pitHoFid)
pitHOF$pct = 100 * pitHOF$votes / pitHOF$ballots

Let’s now make a plot of the vote percentage by year.

ggplot(data=pitHOF, aes(x=yearid, y=pct)) +
  geom_line(size=1.2) +
  ylim(c(0,100)) +
  geom_hline(yintercept=75, linetype=2) +
  geom_text(x=min(pitHOF$yearid), y=77, label="induction threshold", hjust=0) +
  xlab("year of ballot") +
  ylab("percentage of favorable ballots") +
  ggtitle(pitcher) +
  scale_x_continuous(breaks=min(pitHOF$yearid):max(pitHOF$yearid))

Inside the geom_text call, there’s hjust=0: that tells R we want the text to be left-aligned (hjust=1 would make it right-aligned), relative to the position specified by x.

The last line has been added in order to have tick marks at every year on the x-axis. If you run the code without it, you should see sparser tick marks.
There’s a big family of scale_... functions: for example, in the post about ggplot2 tips and tricks we encountered scale_shape_manual and scale_color_manual which were used to affect the shape and the color of data points respectively.
Here we want to modify the appearance of things on the x-axis, on which a continuous variable is mapped, thus the call to scale_x_continuous. The breaks argument accepts a vector of the x-positions where we want tickmarks to be shown.

Here’s the resulting plot.

click for full size.

click for full size.

Morris has received increasing support as the years went by, but will he get the needed push for entering Cooperstown in his final year on the ballot?

A couple of years ago, another pitcher whose time on the ballot was running out, finally made it.
If you change the line where the pitcher is specified as below and re-run the rest of the code, you’ll see the trajectory of BBWWA voting for Bert Blyleven, inducted in 2011.

pitcher = "Bert Blyleven"
#re-run all the rest
click for full size.

click for full size.

Before wrapping up this post, let’s have a look at some numbers recorded by Morris during his career.
The pitching table contains regular season pitching data going back to the early days of MLB.
Here we load the data in R, then we aggregate numbers over a season (since pitchers having played for multiple teams in a season have multiple entries), and recalculate ERAs.

# get pitching data
pitching = read.csv("Pitching.csv")
# season totals by pitcher
library(doBy)
pitching = summaryBy(ER + IPouts + SO + BB ~ playerID + yearID
                     , data=pitching, FUN=sum, keep.names=T)
pitching$ERA = pitching$ER * 27 / pitching$IPouts

Now we get the data for the selected pitcher. Remember that if you have re-run the code for Blyleven and you now want to go back to Jack Morris, you need to pass the value "Jack Morris" to the object pitcher.

pitID = subset(master, paste(nameFirst, nameLast)==pitcher)$playerID
pitData = subset(pitching, playerID==pitID)

Then we get the numbers of Jack Morris’ contemporaries, in other words from the pitching table we select pitchers/seasons that occur during Jack’s career. We also leave out pitchers who do not qualify for the ERA title (those who have not reached the 162 innings pitched threshold).

contemporaries = subset(pitching
                         , yearID >= min(pitData$yearID) 
                         & yearID <= max(pitData$yearID)
                         & IPouts >= 162*3)

Finally, here’s a plot for comparing Morris’ ERA with the rest of MLB pitchers year by year.

ggplot(data=pitData, aes(x=factor(yearID), y=ERA)) +
  geom_boxplot(data=contemporaries, aes(x=factor(yearID), y=ERA)) +
  geom_point(data=contemporaries, aes(x=factor(yearID), y=ERA), position=position_jitter(width = 0.15), alpha=.6) +
  geom_point(col="blue", size=5) +
  xlab("season") +
  ggtitle(paste(pitcher, "(ERA)"))

Note the position argument in the geom_point call. It serves the purpose of adding some noise on the x-axis position of data points in order to diminish their overlapping. Also some degree of transparency (alpha) has been added for a better perception of clutered data points.

click for full size.

click for full size.

Looking at the chart above, Morris does not appear to have posted outstanding ERA numbers: his ERA (blue dots) cracked the top 25% only in three seasons, and twice barely.

Note that this is a post about writing R code, and I don’t want anybody to believe a simple chart like the one above is conclusive against Jack’s candidacy. ERA is just one part of measuring pitching ability, and I have not even tried to make any basic adjustment. In fact, Morris has played in the American League throughout his career, and I have compared him against AL and NL pitchers: remember the latter group faces several batting pitchers, thus their ERA is expected to be lower just because of that. And then I have not considered park factors (see Chapter 11 in our book), and so many other things.
If you are looking for compelling cases for and against Morris in the Hall of Fame, you can find as many as you want around the web.

On the other hand, if you prefer writing some R code rather than reading debates, you may try to build similar plots for other pitchers in the ballot (see the full ballot at Baseball-Reference.com).
Also, with the help of the ggplot2 tips and tricks shown earlier on this blog, you could try to improve the plot above by adding a legend to it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: