# Building a Hall of Fame Classifier

It’s that time of year again when debate over election to the Baseball Hall of Fame picks up. This year’s ballot involves a few new wrinkles, but I thought it might be instructive to examine previous ballots. While I certainly haven’t put as much thought into this as Jay Jaffe has with JAWS, or Bill James has with his Standards and Monitor, I was curious to see how well a simple decision tree would perform as a classifier, and if there was anything we could learn from it.

Luckily, all of the data we will need is in the `Lahman` database. As usual, I’ll also be using the `mosaic` syntax, which includes `dplyr`.

```require(mosaic)
require(Lahman)
```

#### Collecting Data

A classifier is a model that recursively partitions a data set into increasingly “pure” subsets, relative to some binary response variable. Our binary response variable is: “have you been elected to the Hall of Fame through the BBWAA (or Special Election)?” We can gather this information from the `HallOfFame` table.

```inductees <-
HallOfFame %>%
group_by(playerID) %>%
filter(votedBy %in% c("BBWAA", "Special Election") & category == "Player") %>%
summarise(yearsOnBallot = n(), inducted = sum(inducted == "Y"), best = max(votes/ballots)) %>%
arrange(desc(best))
```

In general, we are only interested in players elected by the BBWAA, but we should also include two players (Roberto Clemente and Lou Gehrig) who were elected via “Special Election”, because they clearly had Hall of Fame stats, but simply bypassed the process due to untimely circumstances.

In order to train our model, we need a set of explanatory variables. In the interest of simplicity, we’re going to include just two batting statistics: hits (`H`) and home runs (`HR`).

```batting <-
Batting %>%
group_by(playerID) %>%
summarise(numSeasons = length(unique(yearID)), lastSeason = max(yearID), tH = sum(H), tHR = sum(HR)) %>%
arrange(desc(tH))
```

For pitchers, we’ll again only include the simplest stats: wins (`W`), strikeouts (`SO`), and saves (`SV`).

```pitching <-
Pitching %>%
group_by(playerID) %>%
summarise(numSeasons = length(unique(yearID)), lastSeason = max(yearID), tW = sum(W), tSO = sum(SO), tSV = sum(SV)) %>%
arrange(desc(tW))
```

Finally, we’ll include the main awards: MVP, Cy Young, and Gold Glove.

```awards <-
AwardsPlayers %>%
group_by(playerID) %>%
summarise(mvp = sum(awardID == "Most Valuable Player"), gg = sum(awardID == "Gold Glove"), cy = sum(awardID == "Cy Young Award"))
```

Merging these four data sets gives us a single data frame containing all of this information. Each row in the data frame corresponds to the career totals for a single player. We only want to include players who actually appeared on a Hall of Fame ballot, so it is important that the last `merge` is a straight join.

```candidates = merge(x=batting, y=pitching, by="playerID", all=TRUE)
candidates = merge(x=candidates, y=awards, by="playerID", all.x=TRUE)
candidates = merge(x=candidates, y=inductees, by="playerID")
```

In this case, we have a lot of `NA`‘s in our data frame, but in this case it is safe to overwrite those with zeros. This will enable our classifier to work properly.

```candidates[is.na(candidates)] <- 0
```

#### Building the classifier

Classification trees are provided in R through the `rpart` package (for *r*ecursive *part*itioning). The `rpart` function will build the decision tree using the `formula` interface.

```require(rpart)
mod = rpart(as.factor(inducted) ~ tH + tHR + mvp + gg + tW + tSO + tSV + cy, data=candidates)
```

Let’s examine our tree. Here is what it looks like:

```require(maptree)
draw.tree(mod)
```

The first question is: “did you accumulate more than 2578 hits in your career?” If the answer is “yes”, then the follow up question is: “did you accumulate more than 2993 hits in your career?” If the answer to that is “yes”, then you probably made the Hall of Fame. If the answer is “no”, then “did you hit at least 405 home runs?” If yes, then you probably made the Hall of Fame. If not, then you probably didn’t. This is interesting in that it suggests that nearly 3000 hits and at least 400 home runs are decision boundaries.

On the other side, if you didn’t have 2578 hits, then “did you have at least 249 wins?”
The only players who answer “yes” to this question are pitchers. The follow-up question is: “did you strike out at least 2490 batters?” If yes, then you probably made the Hall of Fame, but if not, then you probably didn’t.

On the other hand, if you had fewer than 2578 hits and fewer than 249 wins, then the only way you are making the Hall of Fame is if you won at least two MVPs or won one MVP and struck out at least 1146 batters (i.e. you won that MVP as a pitcher).

So what does this tell us? First, notice that saves, Cy Young, and Gold Gloves don’t appear in the tree at all. For hitters, the paths to induction are:

1. at least 2993 hits
2. at least 2578 hits and at least 405 home runs
3. at least 2 MVPs

That strikes me as a pretty reasonable depiction of how the voters act! Basically, this means that either you had 3000 hits, or you had at least 400 home runs and a lot of hits, or you won two MVPs, which might mean that you were a dominant player in your prime, but did have the longevity.

For pitchers, there are only two possible paths:

1. at least 249 wins and at least 2490 strikeouts
2. at least one MVP

This seems overly simplistic to me, and I suspect that as time goes on, saves will start to appear in the tree as more relievers are elected. It’s also surprising that Cy Young awards don’t factor in.

#### Evaluation

Even though our model was very simple, it performed pretty well. In fact, it’s predictions were correct 94.3% of the time. That sounds very high, but keep in mind that 90% of the players in our sample didn’t make the Hall of Fame, so we’ve only pushed the needle up so far.

```candidates <- mutate(candidates, y.hat = predict(mod, type="class"), induct.prob = predict(mod)[,2])
confusion = tally(y.hat ~ inducted, data=candidates, format="count")
confusion
```
```##      inducted
## y.hat   0   1
##     0 955  49
##     1  12  63
```
```sum(diag(confusion)) / sum(confusion)
```
```## [1] 0.9434662
```

The confusion matrix above shows that we identified 24 false positives. These are players that our model predicted should be in the Hall of Fame, but they aren’t.

```candidates %>%
filter(y.hat == 1 & inducted == 0) %>%
select(playerID, tH, tHR, mvp, gg, tW, tSO, tSV, cy, inducted, yearsOnBallot, best, induct.prob) %>%
arrange(desc(induct.prob))
```
```##     playerID   tH tHR mvp gg  tW  tSO tSV cy inducted yearsOnBallot
## 1  bondsba01 2935 762   7  8   0    0   0  0        0             2
## 2  biggicr01 3060 291   0  4   0    0   0  0        0             2
## 3  palmera01 3020 569   0  3   0    0   0  0        0             4
## 4   rosepe01 4256 160   1  2   0    0   0  0        0             3
## 5  clemero02    0   0   1  0 354 4672   0  7        0             2
## 6  mussimi01    0   0   0  7 270 2813   0  0        0             1
## 7   bluevi01    0   0   1  0 209 2175   2  1        0             4
## 8  gonzaju03 1936 434   2  0   0    0   0  0        0             2
## 9  marisro01 1325 275   2  1   0    0   0  0        0            15
## 10 mclaide01   82   1   1  0 131 1282   2  2        0             3
## 11 murphda05 2111 398   2  5   0    0   0  0        0            15
## 12 newhoha01  201   2   2  0 207 1796  26  0        0            12
##           best induct.prob
## 1  0.362038664   0.9090909
## 2  0.747810858   0.8846154
## 3  0.125654450   0.8846154
## 4  0.095348837   0.8846154
## 5  0.376098418   0.8823529
## 6  0.203152364   0.8823529
## 7  0.087470449   0.7142857
## 8  0.051635112   0.7142857
## 9  0.430913349   0.7142857
## 10 0.006944444   0.7142857
## 11 0.232464930   0.7142857
## 12 0.428176796   0.7142857
```

For most of these players, the explanation as to why they are not the Hall of Fame is clear. Biggio is going to get in. Maris’s career wasn’t long enough, even though he won back-to-back MVPs. Rose is banned for life. Bonds, Palmeiro, Clemens, and Gonzalez are all tainted by steriods. Denny McLain went to prison. Vida Blue admitted cocaine use. Dale Murphy and Mike Mussina are the more interesting cases, and both have many supporters.

What about the false negatives? These are players who are in the Hall of Fame, but our model predicted that they wouldn’t be.

```candidates %>%
filter(y.hat == 0 & inducted == 1) %>%
select(playerID, tH, tHR, mvp, gg, tW, tSO, tSV, cy, inducted, yearsOnBallot, best, induct.prob) %>%
arrange(desc(induct.prob))
```

Most of these are middle infielders or catchers. Clearly, we didn’t have a variable in our model to account for this.

Also, there is one typo: Tom Glavine’s `playerID` in the `HallOfFame` table is wrong! It is “glavito01”, but it should be “glavito02”. Otherwise, Tommy Glaviano should be popping champagne!

```HallOfFame[grep("glavito", HallOfFame\$playerID),]
```
```##       playerID yearID votedBy ballots needed votes inducted category
## 4020 glavito01   2014   BBWAA     571    429   525        Y   Player
##      needed_note
## 4020        <NA>
```
```filter(candidates, playerID == "glavito01")
```
```##    playerID numSeasons.x lastSeason.x  tH tHR numSeasons.y lastSeason.y tW
## 1 glavito01            5         1953 259  24            0            0  0
##   tSO tSV mvp gg cy yearsOnBallot inducted      best y.hat induct.prob
## 1   0   0   0  0  0             1        1 0.9194396     0  0.02351624
```