Monthly Archives: November, 2014

Hall of Fame Classification Using randomForest

Last week, Ben exhibited how to put together a classifier for the Hall of Fame using rpart. Interestingly enough, I had planned on doing the same thing without knowledge of Ben’s plan for that post, but with a different package: randomForest. In fact, a co-author and I wrote a paper on this at the Journal of Quantitative Analysis in Sports back in 2011.

I’m going to use some of the basic code from that paper here, but with the data that Ben used last week for comparison’s sake (and so you can get it yourself through the Lahman package). One thing to note here is that I’m not trying to predict the best players, but trying to identify the voting patterns within the voting process. These are different goals, particularly if we assume the BBWAA doesn’t to a very good job at identifying the “best” players as measured by baseball talent.

First, let’s talk a bit about the relationship between rpart and randomForest. The Random Forest method was developed by Leo Breiman. The method is closely related to the classification trees that Ben used last week, but apply some additional methods (bagging) in order to ensure that we don’t over-fit the data. Rather than build a single tree to classify Hall of Fame players, this method builds numerous trees, with each tree randomly choosing only a fraction of the input variables. At each tree split, one of these variables is chosen at random and split to optimize the classification in each node. The individual classification trees are created to ensure perfectly classified nodes from the training data. If a player in the holdout sample–called the out-of-bag sample–ends up in a pure HOF node, they get a vote of Yes. If they end up in a pure non-HOF node from the given tree built on the training group, they get a vote of No. The final classification is based on a “majority vote” in a holdout sample for each tree.

The first thing I’ll do is get the data prepared just as Ben did last week. I have re-posted the modified code below, with some changes to have some additional variables in our data set. I also include only players that retired in 1950 or later, since All-Star games started in 1933 and those retiring earlier than this would have this variable biased downward. I also separate batters and pitchers, since some trees would only include pitching statistics, while others will only include batting statistics. This isn’t as trivial as I expected, and I try to do so by implementing a minimum IP and AB for pitchers and batters, respectively.

##set working directory
setwd("c:/...")

##load necessary packages
library(mosaic)
library(Lahman)
library(randomForest)

##get statistics, inductee information, awards information, and all-star games and merge for pitchers and batters separately
inductees <-
  HallOfFame %>%
  group_by(playerID) %>%
  filter(votedBy %in% c("BBWAA", "Special Election") & category == "Player") %>%
  summarise(yearsOnBallot = n(), inducted = sum(inducted == "Y"), best = max(votes/ballots)) %>%
  arrange(desc(best))

batting <-
  Batting %>%
  group_by(playerID) %>%
  summarise(numSeasons = length(unique(yearID)), lastSeason = max(yearID), tAB = sum(AB), tH = sum(H), tHR = sum(HR), tR = sum(R), tSB = sum(SB), tRBI = sum(RBI), tBA = round(sum(H)/sum(AB), 3)) %>%
  arrange(desc(tH))

pitching <-
  Pitching %>%
  group_by(playerID) %>%
  summarise(numSeasons = length(unique(yearID)), lastSeason = max(yearID), tIP = round(sum(IPouts)/3, 2), tW = sum(W), tSO = sum(SO), tSV = sum(SV), tERA = round((9*sum(ER))/(sum(IPouts)/3), 3), tWHIP = round((sum(BB)+sum(H))/(sum(IPouts)/3), 3)) %>%
  arrange(desc(tW))

awards <-
  AwardsPlayers %>%
  group_by(playerID) %>%
  summarise(mvp = sum(awardID == "Most Valuable Player"), gg = sum(awardID == "Gold Glove"), cy = sum(awardID == "Cy Young Award"))

allstar <-
  AllstarFull %>%
  group_by(playerID) %>%
  summarise(ASgame = sum(GP))

candidatesBat <- merge(batting, awards, by="playerID", all.x=T)
candidatesBat <- merge(candidatesBat, allstar, by="playerID", all.x=T)
candidatesBat <- merge(candidatesBat, inductees, by="playerID", all.x=T)

candidatesPitch <- merge(pitching, awards, by="playerID", all.x=T)
candidatesPitch <- merge(candidatesPitch, allstar, by="playerID", all.x=T)
candidatesPitch <- merge(candidatesPitch, inductees, by="playerID", all.x=T)

candidatesBat[is.na(candidatesBat)] <- 0
candidatesPitch[is.na(candidatesPitch)] <- 0

##get player names
Pnames <- Master[,c("playerID", "nameLast", "nameFirst")]
candidatesBat <- merge(candidatesBat, Pnames, by="playerID", all.x=T)
candidatesPitch <- merge(candidatesPitch, Pnames, by="playerID", all.x=T)

##subset data into training and test data separately for batters and pitchers using minimums for AB and IP
trainBat <- subset(candidatesBat, candidatesBat$numSeasons >= 10 & candidatesBat$lastSeason < 2009 & candidatesBat$lastSeason > 1949 & candidatesBat$tAB > 2500)
testBat <- subset(candidatesBat, candidatesBat$lastSeason >= 2009 & candidatesBat$tAB > 500)

trainPitch <- subset(candidatesPitch, candidatesPitch$numSeasons >= 10 &  candidatesPitch$lastSeason < 2009 & candidatesPitch$lastSeason > 1949 & candidatesPitch$tIP > 200)
testPitch <- subset(candidatesPitch, candidatesPitch$lastSeason >= 2009 & candidatesPitch$tIP > 50)

head(trainBat)

     playerID numSeasons lastSeason   tAB   tH tHR   tR tSB tRBI   tBA mvp gg cy ASgame yearsOnBallot inducted        best nameLast nameFirst
2   aaronha01         23       1976 12364 3771 755 2174 240 2297 0.305   1  3  0     24             1        1 0.978313253    Aaron      Hank
54  adairje01         13       1970  4019 1022  57  378  29  366 0.254   0  0  0      0             0        0 0.000000000    Adair     Jerry
61  adamsbo03         14       1959  4019 1082  37  591  67  303 0.269   0  0  0      0             1        0 0.003311258    Adams     Bobby
89  adcocjo01         17       1966  6606 1832 336  823  20 1122 0.277   0  0  0      2             0        0 0.000000000   Adcock       Joe
109  ageeto01         12       1973  3912  999 130  558 167  433 0.255   0  2  0      2             1        0 0.000000000     Agee    Tommie
168 alfoned01         12       2006  5385 1532 146  777  53  744 0.284   0  0  0      1             0        0 0.000000000  Alfonzo   Edgardo

head(trainPitch)

    playerID numSeasons lastSeason     tIP tW tSO tSV  tERA tWHIP mvp gg cy ASgame yearsOnBallot inducted      best  nameLast nameFirst
2   aasedo01         13       1990 1109.33 66 641  82 3.797 1.390   0  0  0      1             0        0 0.0000000      Aase       Don
7  abbotgl01         11       1984 1286.00 62 484   0 4.388 1.366   0  0  0      0             0        0 0.0000000    Abbott     Glenn
8  abbotji01         10       1999 1674.00 87 888   0 4.253 1.433   0  0  0      0             1        0 0.0251938    Abbott       Jim
10 abbotpa01         11       2004  720.67 43 496   0 4.920 1.492   0  0  0      0             0        0 0.0000000    Abbott      Paul
14 abernte02         14       1972 1147.67 63 765 148 3.458 1.396   0  0  0      0             0        0 0.0000000 Abernathy       Ted
25 ackerji01         10       1992  904.33 33 482  30 3.971 1.379   0  0  0      0             0        0 0.0000000     Acker       Jim

Now that the data are ready to go, we’ll have to decide a few things. First, we need to identify which variables will be worth using for building our random forest. Next, we need to decide how many variables that each classification tree in the forest will use. Luckily, the randomForest package includes a tuning function that tells us the optimal number of variables for each tree. The code below allows us to choose the number of variables that minimizes our out-of-bag error: the rate of incorrect classifications for those left out of each tree. I have 1,000 trees in the forest, but this is an arbitrary choice by me here. And the tuneRF function can also automatically plot the differences in out-of-bag error for each of the variable number choices. I don’t post the plots here, but they should pop up in your R window with the following code:

##tune the random forest
set.seed(12345)
mtryBat <- tuneRF(trainBat[,c("tH", "tHR", "tR", "tSB", "tRBI", "tBA", "mvp", "gg", "ASgame")], factor(trainBat[,c("inducted")]), ntreeTry=1000, data=trainBat, plot=T)
mtryBat

      mtry   OOBError
2.OOB    2 0.02963776
3.OOB    3 0.03073546
6.OOB    6 0.03512623

mtryPitch <- tuneRF(trainPitch[,c("tW", "tSO", "tSV", "tERA", "tWHIP", "gg", "cy", "ASgame")], factor(trainPitch[,c("inducted")]), ntreeTry=1000, data=trainPitch, plot=T)
mtryPitch

      mtry   OOBError
1.OOB    1 0.02119205
2.OOB    2 0.01324503
4.OOB    4 0.01854305

As you can see, using two variables for each tree—chosen at random by randomForest—is the best choice for both batters and pitchers. Therefore, we include this as our mtry in the fitting of the forest. Additionally, we note that we want our sampling stratified by Hall of Fame status when building each tree, because the large discrepancy in sample size across the two groups (HOF and non-HOF) could result in some trees including no Hall of Fame players. The rest of the commands allow us to use the forest to predict future Hall of Fame induction (keep.forest=TRUE), exhibit the importance of each variable in our classification (importance=TRUE), exhibit the importance of each variable in the classification of specific individuals (localImp=TRUE), and visualize the proximity of players based on how often they end up in the same node throughout the entire forest using a multi-dimensional scaling approach (proximity=TRUE).


##fit separate random forests for pitchers and batters
forestBat <- randomForest(factor(inducted) ~ tH + tHR + tR + tSB + tRBI + tBA + mvp + gg + ASgame, data=trainBat,
  ntree=1000, replace=TRUE, strata=factor(trainBat$inducted), keep.forest=TRUE, importance=TRUE, localImp=TRUE, proximity=TRUE, mtry=2)
print(forestBat)

Call:
 randomForest(formula = factor(inducted) ~ tH + tHR + tR + tSB +      tRBI + tBA + mvp + gg + ASgame, data = trainBat, ntree = 1000,      replace = TRUE, strata = factor(trainBat$inducted), keep.forest = TRUE,      importance = TRUE, localImp = TRUE, proximity = TRUE, mtry = 2)
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.29%
Confusion matrix:
    0  1 class.error
0 851 10   0.0116144
1  20 30   0.4000000

forestPitch <- randomForest(factor(inducted) ~ tW + tSO + tSV + tERA + tWHIP + cy + gg + ASgame, data=trainPitch,
  ntree=1000, replace=TRUE, strata=factor(trainPitch$inducted), keep.forest=TRUE, importance=TRUE, localImp=TRUE, proximity=TRUE, mtry=2)
print(forestPitch)

Call:
 randomForest(formula = factor(inducted) ~ tW + tSO + tSV + tERA +      tWHIP + cy + gg + ASgame, data = trainPitch, ntree = 1000,      replace = TRUE, strata = factor(trainPitch$inducted), keep.forest = TRUE,      importance = TRUE, localImp = TRUE, proximity = TRUE, mtry = 2)
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 1.46%
Confusion matrix:
    0  1 class.error
0 726  3 0.004115226
1   8 18 0.307692308

##get predicted classes and votes for individual players and attach to training data set
trainBat$predict <- forestBat$predicted
trainBat$votes <- forestBat$vote
head(trainBat)

     playerID numSeasons lastSeason   tAB   tH tHR   tR tSB tRBI   tBA mvp gg cy ASgame yearsOnBallot inducted        best nameLast nameFirst predict     votes.0     votes.1
2   aaronha01         23       1976 12364 3771 755 2174 240 2297 0.305   1  3  0     24             1        1 0.978313253    Aaron      Hank       1 0.424242424 0.575757576
54  adairje01         13       1970  4019 1022  57  378  29  366 0.254   0  0  0      0             0        0 0.000000000    Adair     Jerry       0 1.000000000 0.000000000
61  adamsbo03         14       1959  4019 1082  37  591  67  303 0.269   0  0  0      0             1        0 0.003311258    Adams     Bobby       0 1.000000000 0.000000000
89  adcocjo01         17       1966  6606 1832 336  823  20 1122 0.277   0  0  0      2             0        0 0.000000000   Adcock       Joe       0 0.995024876 0.004975124
109  ageeto01         12       1973  3912  999 130  558 167  433 0.255   0  2  0      2             1        0 0.000000000     Agee    Tommie       0 1.000000000 0.000000000
168 alfoned01         12       2006  5385 1532 146  777  53  744 0.284   0  0  0      1             0        0 0.000000000  Alfonzo   Edgardo       0 1.000000000 0.000000000

trainPitch$predict <- forestPitch$predicted
trainPitch$votes <- forestPitch$vote
head(trainPitch)

    playerID numSeasons lastSeason     tIP tW tSO tSV  tERA tWHIP mvp gg cy ASgame yearsOnBallot inducted      best  nameLast nameFirst predict votes.0 votes.1
2   aasedo01         13       1990 1109.33 66 641  82 3.797 1.390   0  0  0      1             0        0 0.0000000      Aase       Don       0       1       0
7  abbotgl01         11       1984 1286.00 62 484   0 4.388 1.366   0  0  0      0             0        0 0.0000000    Abbott     Glenn       0       1       0
8  abbotji01         10       1999 1674.00 87 888   0 4.253 1.433   0  0  0      0             1        0 0.0251938    Abbott       Jim       0       1       0
10 abbotpa01         11       2004  720.67 43 496   0 4.920 1.492   0  0  0      0             0        0 0.0000000    Abbott      Paul       0       1       0
14 abernte02         14       1972 1147.67 63 765 148 3.458 1.396   0  0  0      0             0        0 0.0000000 Abernathy       Ted       0       1       0
25 ackerji01         10       1992  904.33 33 482  30 3.971 1.379   0  0  0      0             0        0 0.0000000     Acker       Jim       0       1       0

Here we have our out-off-bag error rates, as well as our class errors in the confusion matrices for each forest. Note that the error for the non-HOF class is much better than those within the HOF. This is because there are so many more observations, and there are so many sure non-Hall of Fame players here. I encourage you to look around within the data to see which players in the Hall of Fame were predicted to not be inducted, and those that are not in the Hall of Fame that were predicted to be inducted. Also note some of the vote counts are surprising, which could be a data issue (I’ll get to that later with our predictions of future Hall of Fame players). Notably, we see that Hank Aaron was classified as a HOF player in only about 58% of the trees in our forest.

Now that we’ve built our forests, we can use R to tell us which of the variables were the most important in classifying players as Hall of Famers. This is useful in shedding some light on the “Black Box” approach of the randomForest function, and is one of the nice advantages to using this over some other machine learning techniques. While I won’t get into the calculations of this measure, I again encourage you to look at the background information on this method to get an understanding of how things like the Mean Decrease in Accuracy are calculated. Note that in the plots below, the dots further to the right indicate more important variables for deciding whether or not players should be inducted.

##variable importance plot
png(file="varImp.png", height=800, width=600)
par(mfrow=c(2,1))
varImpPlot(forestBat, pch=19, type=1, main="Variable Importance for Batters")
varImpPlot(forestPitch, pch=19, type=1, main="Variable Importance for Pitchers")
dev.off()

varImp

As you can see, All-Star Appearances seem to be picking up a large portion of the variation in Hall of Fame induction for Hitters, while Stolen Bases don’t have much predictive ability. For pitchers, Wins and Strikeouts dominate, whereas Gold Gloves are essentially useless in predicting Hall of Fame induction. Saves aren’t particularly useful, but we know it is a relatively new phenomenon for players to rack up lots of Saves. This approach may be more useful by splitting up starting pitchers and relief pitchers in our training of the forest.

So why are Runs so and RBI important, while Home Runs lag so far behind? In our JQAS paper, my co-author and I posited that this is because of the dependence of a large number of R and RBI on the HR that are hit by each player. To remedy this, we created a metric of “HR-Independent Runs” and “HR-Independent RBI” (note, this is imprecise language, but it gets the point across) by simply subtracting HR from the total R and total RBI. I won’t do that here, but I can tell you that it seemed to improve the variable importance plot—without affecting the error rate much—making the importance of Home Runs much closer to what we would expect.

While it’s nice to see the importance of variables for classification as a whole, we can also see which variables were the most important factors in any individual players’ induction with our “local” importance matrix. This can be done for the training model or the test data set, but I’m not going to go through this portion of the code, since I already have an inordinately long post. I’ll just note that you can find the local importance information under the following objects.

forestBat$localImp
forestPitch$localImp

Finally, there is a nice multidimensional scaling approach that we can use from the random forest output, which can visualize a distance measure for individual players in a two-dimensional representation. We can visualize both the current inductees, as well as our predictions for the future. While there is a direct plotting approach using the MDSplot function, this makes it a bit difficult to see which player is which. Therefore, I’ve used some of my own code, which uses the players’ last names in the plot.

##create HOF variable for color coding in MDS plot
trainBat$newhof <- trainBat$inducted + 1
trainPitch$newhof <- trainPitch$inducted + 1

##run 3-dimension MDS and plot using player classes
cmdB <- cmdscale(1 - forestBat$proximity, k=3)
trainBat$x <- cmdB[,1]
trainBat$y <- cmdB[,2]
trainBat$z <- cmdB[,3]
cmdP <- cmdscale(1 - forestPitch$proximity, k=3)
trainPitch$x <- cmdP[,1]
trainPitch$y <- cmdP[,2]
trainPitch$z <- cmdP[,3]

png(file="MDSviz.png", height=1000, width=800)
par(mfrow=c(2,1))
plot(trainBat$x, trainBat$y, type="n", xlab="Dim 1", ylab="Dim 2", main="Batters Proximity Visualization")
text(trainBat$x, trainBat$y, trainBat$nameLast, cex=.75, col=trainBat$newhof)
legend(-0.18, 0.4, c("Member", "Non-Member"), fill=c("red", "black"), bty="n", cex=1.2)

plot(trainPitch$x, trainPitch$y, type="n", xlab="Dim 1", ylab="Dim 2", main="Pitchers Proximity Visualization")
text(trainPitch$x, trainPitch$y, trainPitch$nameLast, cex=.75, col=trainPitch$newhof)
legend(0.65, 0.4, c("Member", "Non-Member"), fill=c("red", "black"), bty="n", cex=1.2)
dev.off()

MDSviz

Here, you can see the closeness of Jackie Robinson and Roy Campanella, which makes sense given that they both had shorter MLB careers, were pioneers of integration, and were excellent players in their shorter stints. Similarly, we see Mark McGwire and Tim Raines right on the edge of where the clusters of most Hall of Fame hitters lie in the visual, alongside some Veteran’s Committee inductions like Orlando Cepeda. Also note that we see the same Tom Glavine mistake as in Ben’s post (go there to see what happened). And of course, we know why Roger Clemens is in that elite cluster on the right of the pitcher visualization but isn’t in red font.

Finally, and perhaps the most fun part of the Hall of Fame prediction model exercise is predicting future Hall of Fame induction of current or recently retired players. However, remember that this is a naïve model: we don’t predict the career trajectories of current players. We simply ask: if they retired now, and we relax the 10-year minimum requirement, would their statistical output qualify them for the Hall of Fame based on what we’ve seen voters do in the past?

##Predict Future Hall of Fame Induction
##get predicted classes for test data
futureBat <- predict(forestBat, testBat, type="response")
futurePitch <- predict(forestPitch, testPitch, type="response")

##get vote percentages for test data
futureBat.vote <- predict(forestBat, testBat, type="vote", norm.votes=TRUE)
futurePitch.vote <- predict(forestPitch, testPitch, type="vote", norm.votes=TRUE)

##attach predicted class and vote percentages from MODEL 1
testBat$HOF <- as.numeric(futureBat) - 1
testBat$votes <- futureBat.vote[,2]

testPitch$HOF <- as.numeric(futurePitch) - 1
testPitch$votes <- futurePitch.vote[,2]

batterHOF <- subset(testBat[,c(1,18,19,20,21)], testBat$votes > 0.1)
pitcherHOF <- subset(testPitch[,c(1,17,18,19,20)], testPitch$votes > 0.1)

batterHOF[order(-batterHOF$votes),]

       playerID    nameLast nameFirst HOF votes
14896 sheffga01   Sheffield      Gary   1 0.823
13206 pujolal01      Pujols    Albert   1 0.797
13934 rodriiv01   Rodriguez      Ivan   1 0.797
6334  griffke02     Griffey       Ken   1 0.790
13353 ramirma02     Ramirez     Manny   1 0.775
15956 suzukic01      Suzuki    Ichiro   1 0.741
6424  guerrvl01    Guerrero  Vladimir   1 0.729
13914 rodrial01   Rodriguez      Alex   1 0.592
8195  jonesch06       Jones   Chipper   1 0.577
8018  jeterde01       Jeter     Derek   1 0.541
16278 thomeji01       Thome       Jim   0 0.445
16879 vizquom01     Vizquel      Omar   0 0.381
10348 mauerjo01       Mauer       Joe   0 0.369
3971  delgaca01     Delgado    Carlos   0 0.361
2268  cabremi01     Cabrera    Miguel   0 0.354
7034  heltoto01      Helton      Todd   0 0.350
12277 ortizda01       Ortiz     David   0 0.333
3712  damonjo01       Damon    Johnny   0 0.299
5828  giambja01      Giambi     Jason   0 0.277
1055  beltrca01     Beltran    Carlos   0 0.210
7465  hollima01    Holliday      Matt   0 0.199
8847  konerpa01     Konerko      Paul   0 0.187
1718  braunry02       Braun      Ryan   0 0.181
6623  hamiljo03    Hamilton      Josh   0 0.170
8176  jonesan01       Jones    Andruw   0 0.149
5649  garcino01 Garciaparra     Nomar   0 0.148
16899 vottojo01       Votto      Joey   0 0.142
34    abreubo01       Abreu     Bobby   0 0.108
13082 poseybu01       Posey    Buster   0 0.105

pitcherHOF[order(-pitcherHOF$votes),]

      playerID  nameLast nameFirst HOF votes
3951 johnsra05   Johnson     Randy   1 0.733
8204 wagnebi02    Wagner     Billy   1 0.611
4962 martipe02  Martinez     Pedro   1 0.610
5699 nathajo01    Nathan       Joe   1 0.601
3596 hoffmtr01   Hoffman    Trevor   1 0.599
7412 smoltjo01    Smoltz      John   1 0.506
6022 papeljo01  Papelbon  Jonathan   0 0.475
3173 hallaro01  Halladay       Roy   0 0.400
6680 riverma01    Rivera   Mariano   0 0.359
4156 kershcl01   Kershaw   Clayton   0 0.332
7222 shellst01     Shell    Steven   0 0.273
1306 chapmar01   Chapman   Aroldis   0 0.268
6739 rodrifr03 Rodriguez Francisco   0 0.268
4188 kimbrcr01   Kimbrel     Craig   0 0.266
3613 hollagr01   Holland      Greg   0 0.260
42   adamsmi03     Adams      Mike   0 0.255
4706  loupaa01      Loup     Aaron   0 0.254
6750 rodrist02 Rodriguez      Paco   0 0.254
8165 vinceni01   Vincent      Nick   0 0.253
231  avilalu01    Avilan      Luis   0 0.252
2418 fernajo02 Fernandez      Jose   0 0.252
3322 harvema01    Harvey      Matt   0 0.252
6691 roarkta01     Roark    Tanner   0 0.252
7960 torreal01    Torres      Alex   0 0.252
6976 santajo01   Santana     Johan   0 0.242
8036 ueharko01    Uehara      Koji   0 0.240
6793  romose01      Romo    Sergio   0 0.235
3863 janseke01    Jansen    Kenley   0 0.234
6924 saitota01     Saito   Takashi   0 0.234
7442 soriajo01     Soria    Joakim   0 0.232
8570 wilsoju10    Wilson    Justin   0 0.200
1377 cishest01    Cishek     Steve   0 0.190
1552  cookry01      Cook      Ryan   0 0.184
5965 oteroda01     Otero       Dan   0 0.177
1607 cosarja01    Cosart    Jarred   0 0.175
8136 ventejo01   Venters     Jonny   0 0.173
4877 manesse01    Maness      Seth   0 0.172
8795 zieglbr01   Ziegler      Brad   0 0.169
3018  grayso01      Gray     Sonny   0 0.120
3653 hoovejj01    Hoover     J. J.   0 0.120
5857  odayda01     O'Day    Darren   0 0.120
6816 rosentr01 Rosenthal    Trevor   0 0.102

Here we can see the proportion of trees that each player is predicted to be inducted. Sheffield, Pujols, Pudge, Griffey, Manny, Ichiro, and Vlad are all easily predicted to be Hall of Famers, with A-Rod, Jones and Jeter all predicted as inductees. But this approach doesn’t account for steroid suspicions, so that would likely be an important omitted variable in predicting BBWAA voting behavior.

A quick note: it’s interesting to see A-Rod so low on this list, and Mariano Rivera listed as a non-inductee. As it turns out, both Alex Rodriguez and Mariano Rivera are recoded as having zero All-Star appearances. So it looks like it will be important to clean the data before making any final conclusions here, though I don’t have the time to go back through all the possible omissions.

Clearly, there are also a lot of names on this list that shouldn’t be getting any votes, like Nick Vincent and other relievers. Our forest is likely getting false positives due to low ERA and WHIP numbers for some relievers. Maybe splitting up starters and relievers, and requiring some larger minimum number of innings pitched would be helpful for our predictions here. But our Top 6 (plus Rivera), all make sense as being above the 50% forest vote threshold for classification as Hall of Fame pitchers.

Well, that’s it for today. Note that while I’ve shown a number of things that can be done with the randomForest package, there are a number of other things that we can investigate with the models. For example, we can look at the average tree size in each forest, track the error rates as the forest is built, and look at outliers in our data. Lastly, there is the possibility of smoothing the classification choice by using vote percentage with a logistic regression. However, I simply want to highlight the use of the randomForest package here. I’ll leave further investigation up to you to see if you can improve on the rates presented here.