Hall of Fame Classification Using randomForest
Last week, Ben exhibited how to put together a classifier for the Hall of Fame using rpart
. Interestingly enough, I had planned on doing the same thing without knowledge of Ben’s plan for that post, but with a different package: randomForest. In fact, a co-author and I wrote a paper on this at the Journal of Quantitative Analysis in Sports back in 2011.
I’m going to use some of the basic code from that paper here, but with the data that Ben used last week for comparison’s sake (and so you can get it yourself through the Lahman package). One thing to note here is that I’m not trying to predict the best players, but trying to identify the voting patterns within the voting process. These are different goals, particularly if we assume the BBWAA doesn’t to a very good job at identifying the “best” players as measured by baseball talent.
First, let’s talk a bit about the relationship between rpart
and randomForest
. The Random Forest method was developed by Leo Breiman. The method is closely related to the classification trees that Ben used last week, but apply some additional methods (bagging) in order to ensure that we don’t over-fit the data. Rather than build a single tree to classify Hall of Fame players, this method builds numerous trees, with each tree randomly choosing only a fraction of the input variables. At each tree split, one of these variables is chosen at random and split to optimize the classification in each node. The individual classification trees are created to ensure perfectly classified nodes from the training data. If a player in the holdout sample–called the out-of-bag sample–ends up in a pure HOF node, they get a vote of Yes. If they end up in a pure non-HOF node from the given tree built on the training group, they get a vote of No. The final classification is based on a “majority vote” in a holdout sample for each tree.
The first thing I’ll do is get the data prepared just as Ben did last week. I have re-posted the modified code below, with some changes to have some additional variables in our data set. I also include only players that retired in 1950 or later, since All-Star games started in 1933 and those retiring earlier than this would have this variable biased downward. I also separate batters and pitchers, since some trees would only include pitching statistics, while others will only include batting statistics. This isn’t as trivial as I expected, and I try to do so by implementing a minimum IP and AB for pitchers and batters, respectively.
##set working directory setwd("c:/...") ##load necessary packages library(mosaic) library(Lahman) library(randomForest) ##get statistics, inductee information, awards information, and all-star games and merge for pitchers and batters separately inductees <- HallOfFame %>% group_by(playerID) %>% filter(votedBy %in% c("BBWAA", "Special Election") & category == "Player") %>% summarise(yearsOnBallot = n(), inducted = sum(inducted == "Y"), best = max(votes/ballots)) %>% arrange(desc(best)) batting <- Batting %>% group_by(playerID) %>% summarise(numSeasons = length(unique(yearID)), lastSeason = max(yearID), tAB = sum(AB), tH = sum(H), tHR = sum(HR), tR = sum(R), tSB = sum(SB), tRBI = sum(RBI), tBA = round(sum(H)/sum(AB), 3)) %>% arrange(desc(tH)) pitching <- Pitching %>% group_by(playerID) %>% summarise(numSeasons = length(unique(yearID)), lastSeason = max(yearID), tIP = round(sum(IPouts)/3, 2), tW = sum(W), tSO = sum(SO), tSV = sum(SV), tERA = round((9*sum(ER))/(sum(IPouts)/3), 3), tWHIP = round((sum(BB)+sum(H))/(sum(IPouts)/3), 3)) %>% arrange(desc(tW)) awards <- AwardsPlayers %>% group_by(playerID) %>% summarise(mvp = sum(awardID == "Most Valuable Player"), gg = sum(awardID == "Gold Glove"), cy = sum(awardID == "Cy Young Award")) allstar <- AllstarFull %>% group_by(playerID) %>% summarise(ASgame = sum(GP)) candidatesBat <- merge(batting, awards, by="playerID", all.x=T) candidatesBat <- merge(candidatesBat, allstar, by="playerID", all.x=T) candidatesBat <- merge(candidatesBat, inductees, by="playerID", all.x=T) candidatesPitch <- merge(pitching, awards, by="playerID", all.x=T) candidatesPitch <- merge(candidatesPitch, allstar, by="playerID", all.x=T) candidatesPitch <- merge(candidatesPitch, inductees, by="playerID", all.x=T) candidatesBat[is.na(candidatesBat)] <- 0 candidatesPitch[is.na(candidatesPitch)] <- 0 ##get player names Pnames <- Master[,c("playerID", "nameLast", "nameFirst")] candidatesBat <- merge(candidatesBat, Pnames, by="playerID", all.x=T) candidatesPitch <- merge(candidatesPitch, Pnames, by="playerID", all.x=T) ##subset data into training and test data separately for batters and pitchers using minimums for AB and IP trainBat <- subset(candidatesBat, candidatesBat$numSeasons >= 10 & candidatesBat$lastSeason < 2009 & candidatesBat$lastSeason > 1949 & candidatesBat$tAB > 2500) testBat <- subset(candidatesBat, candidatesBat$lastSeason >= 2009 & candidatesBat$tAB > 500) trainPitch <- subset(candidatesPitch, candidatesPitch$numSeasons >= 10 & candidatesPitch$lastSeason < 2009 & candidatesPitch$lastSeason > 1949 & candidatesPitch$tIP > 200) testPitch <- subset(candidatesPitch, candidatesPitch$lastSeason >= 2009 & candidatesPitch$tIP > 50) head(trainBat) playerID numSeasons lastSeason tAB tH tHR tR tSB tRBI tBA mvp gg cy ASgame yearsOnBallot inducted best nameLast nameFirst 2 aaronha01 23 1976 12364 3771 755 2174 240 2297 0.305 1 3 0 24 1 1 0.978313253 Aaron Hank 54 adairje01 13 1970 4019 1022 57 378 29 366 0.254 0 0 0 0 0 0 0.000000000 Adair Jerry 61 adamsbo03 14 1959 4019 1082 37 591 67 303 0.269 0 0 0 0 1 0 0.003311258 Adams Bobby 89 adcocjo01 17 1966 6606 1832 336 823 20 1122 0.277 0 0 0 2 0 0 0.000000000 Adcock Joe 109 ageeto01 12 1973 3912 999 130 558 167 433 0.255 0 2 0 2 1 0 0.000000000 Agee Tommie 168 alfoned01 12 2006 5385 1532 146 777 53 744 0.284 0 0 0 1 0 0 0.000000000 Alfonzo Edgardo head(trainPitch) playerID numSeasons lastSeason tIP tW tSO tSV tERA tWHIP mvp gg cy ASgame yearsOnBallot inducted best nameLast nameFirst 2 aasedo01 13 1990 1109.33 66 641 82 3.797 1.390 0 0 0 1 0 0 0.0000000 Aase Don 7 abbotgl01 11 1984 1286.00 62 484 0 4.388 1.366 0 0 0 0 0 0 0.0000000 Abbott Glenn 8 abbotji01 10 1999 1674.00 87 888 0 4.253 1.433 0 0 0 0 1 0 0.0251938 Abbott Jim 10 abbotpa01 11 2004 720.67 43 496 0 4.920 1.492 0 0 0 0 0 0 0.0000000 Abbott Paul 14 abernte02 14 1972 1147.67 63 765 148 3.458 1.396 0 0 0 0 0 0 0.0000000 Abernathy Ted 25 ackerji01 10 1992 904.33 33 482 30 3.971 1.379 0 0 0 0 0 0 0.0000000 Acker Jim
Now that the data are ready to go, we’ll have to decide a few things. First, we need to identify which variables will be worth using for building our random forest. Next, we need to decide how many variables that each classification tree in the forest will use. Luckily, the randomForest package includes a tuning function that tells us the optimal number of variables for each tree. The code below allows us to choose the number of variables that minimizes our out-of-bag error: the rate of incorrect classifications for those left out of each tree. I have 1,000 trees in the forest, but this is an arbitrary choice by me here. And the tuneRF
function can also automatically plot the differences in out-of-bag error for each of the variable number choices. I don’t post the plots here, but they should pop up in your R window with the following code:
##tune the random forest set.seed(12345) mtryBat <- tuneRF(trainBat[,c("tH", "tHR", "tR", "tSB", "tRBI", "tBA", "mvp", "gg", "ASgame")], factor(trainBat[,c("inducted")]), ntreeTry=1000, data=trainBat, plot=T) mtryBat mtry OOBError 2.OOB 2 0.02963776 3.OOB 3 0.03073546 6.OOB 6 0.03512623 mtryPitch <- tuneRF(trainPitch[,c("tW", "tSO", "tSV", "tERA", "tWHIP", "gg", "cy", "ASgame")], factor(trainPitch[,c("inducted")]), ntreeTry=1000, data=trainPitch, plot=T) mtryPitch mtry OOBError 1.OOB 1 0.02119205 2.OOB 2 0.01324503 4.OOB 4 0.01854305
As you can see, using two variables for each tree—chosen at random by randomForest
—is the best choice for both batters and pitchers. Therefore, we include this as our mtry
in the fitting of the forest. Additionally, we note that we want our sampling stratified by Hall of Fame status when building each tree, because the large discrepancy in sample size across the two groups (HOF and non-HOF) could result in some trees including no Hall of Fame players. The rest of the commands allow us to use the forest to predict future Hall of Fame induction (keep.forest=TRUE
), exhibit the importance of each variable in our classification (importance=TRUE
), exhibit the importance of each variable in the classification of specific individuals (localImp=TRUE
), and visualize the proximity of players based on how often they end up in the same node throughout the entire forest using a multi-dimensional scaling approach (proximity=TRUE
).
##fit separate random forests for pitchers and batters forestBat <- randomForest(factor(inducted) ~ tH + tHR + tR + tSB + tRBI + tBA + mvp + gg + ASgame, data=trainBat, ntree=1000, replace=TRUE, strata=factor(trainBat$inducted), keep.forest=TRUE, importance=TRUE, localImp=TRUE, proximity=TRUE, mtry=2) print(forestBat) Call: randomForest(formula = factor(inducted) ~ tH + tHR + tR + tSB + tRBI + tBA + mvp + gg + ASgame, data = trainBat, ntree = 1000, replace = TRUE, strata = factor(trainBat$inducted), keep.forest = TRUE, importance = TRUE, localImp = TRUE, proximity = TRUE, mtry = 2) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 2 OOB estimate of error rate: 3.29% Confusion matrix: 0 1 class.error 0 851 10 0.0116144 1 20 30 0.4000000 forestPitch <- randomForest(factor(inducted) ~ tW + tSO + tSV + tERA + tWHIP + cy + gg + ASgame, data=trainPitch, ntree=1000, replace=TRUE, strata=factor(trainPitch$inducted), keep.forest=TRUE, importance=TRUE, localImp=TRUE, proximity=TRUE, mtry=2) print(forestPitch) Call: randomForest(formula = factor(inducted) ~ tW + tSO + tSV + tERA + tWHIP + cy + gg + ASgame, data = trainPitch, ntree = 1000, replace = TRUE, strata = factor(trainPitch$inducted), keep.forest = TRUE, importance = TRUE, localImp = TRUE, proximity = TRUE, mtry = 2) Type of random forest: classification Number of trees: 1000 No. of variables tried at each split: 2 OOB estimate of error rate: 1.46% Confusion matrix: 0 1 class.error 0 726 3 0.004115226 1 8 18 0.307692308 ##get predicted classes and votes for individual players and attach to training data set trainBat$predict <- forestBat$predicted trainBat$votes <- forestBat$vote head(trainBat) playerID numSeasons lastSeason tAB tH tHR tR tSB tRBI tBA mvp gg cy ASgame yearsOnBallot inducted best nameLast nameFirst predict votes.0 votes.1 2 aaronha01 23 1976 12364 3771 755 2174 240 2297 0.305 1 3 0 24 1 1 0.978313253 Aaron Hank 1 0.424242424 0.575757576 54 adairje01 13 1970 4019 1022 57 378 29 366 0.254 0 0 0 0 0 0 0.000000000 Adair Jerry 0 1.000000000 0.000000000 61 adamsbo03 14 1959 4019 1082 37 591 67 303 0.269 0 0 0 0 1 0 0.003311258 Adams Bobby 0 1.000000000 0.000000000 89 adcocjo01 17 1966 6606 1832 336 823 20 1122 0.277 0 0 0 2 0 0 0.000000000 Adcock Joe 0 0.995024876 0.004975124 109 ageeto01 12 1973 3912 999 130 558 167 433 0.255 0 2 0 2 1 0 0.000000000 Agee Tommie 0 1.000000000 0.000000000 168 alfoned01 12 2006 5385 1532 146 777 53 744 0.284 0 0 0 1 0 0 0.000000000 Alfonzo Edgardo 0 1.000000000 0.000000000 trainPitch$predict <- forestPitch$predicted trainPitch$votes <- forestPitch$vote head(trainPitch) playerID numSeasons lastSeason tIP tW tSO tSV tERA tWHIP mvp gg cy ASgame yearsOnBallot inducted best nameLast nameFirst predict votes.0 votes.1 2 aasedo01 13 1990 1109.33 66 641 82 3.797 1.390 0 0 0 1 0 0 0.0000000 Aase Don 0 1 0 7 abbotgl01 11 1984 1286.00 62 484 0 4.388 1.366 0 0 0 0 0 0 0.0000000 Abbott Glenn 0 1 0 8 abbotji01 10 1999 1674.00 87 888 0 4.253 1.433 0 0 0 0 1 0 0.0251938 Abbott Jim 0 1 0 10 abbotpa01 11 2004 720.67 43 496 0 4.920 1.492 0 0 0 0 0 0 0.0000000 Abbott Paul 0 1 0 14 abernte02 14 1972 1147.67 63 765 148 3.458 1.396 0 0 0 0 0 0 0.0000000 Abernathy Ted 0 1 0 25 ackerji01 10 1992 904.33 33 482 30 3.971 1.379 0 0 0 0 0 0 0.0000000 Acker Jim 0 1 0
Here we have our out-off-bag error rates, as well as our class errors in the confusion matrices for each forest. Note that the error for the non-HOF class is much better than those within the HOF. This is because there are so many more observations, and there are so many sure non-Hall of Fame players here. I encourage you to look around within the data to see which players in the Hall of Fame were predicted to not be inducted, and those that are not in the Hall of Fame that were predicted to be inducted. Also note some of the vote counts are surprising, which could be a data issue (I’ll get to that later with our predictions of future Hall of Fame players). Notably, we see that Hank Aaron was classified as a HOF player in only about 58% of the trees in our forest.
Now that we’ve built our forests, we can use R to tell us which of the variables were the most important in classifying players as Hall of Famers. This is useful in shedding some light on the “Black Box” approach of the randomForest
function, and is one of the nice advantages to using this over some other machine learning techniques. While I won’t get into the calculations of this measure, I again encourage you to look at the background information on this method to get an understanding of how things like the Mean Decrease in Accuracy are calculated. Note that in the plots below, the dots further to the right indicate more important variables for deciding whether or not players should be inducted.
##variable importance plot png(file="varImp.png", height=800, width=600) par(mfrow=c(2,1)) varImpPlot(forestBat, pch=19, type=1, main="Variable Importance for Batters") varImpPlot(forestPitch, pch=19, type=1, main="Variable Importance for Pitchers") dev.off()
As you can see, All-Star Appearances seem to be picking up a large portion of the variation in Hall of Fame induction for Hitters, while Stolen Bases don’t have much predictive ability. For pitchers, Wins and Strikeouts dominate, whereas Gold Gloves are essentially useless in predicting Hall of Fame induction. Saves aren’t particularly useful, but we know it is a relatively new phenomenon for players to rack up lots of Saves. This approach may be more useful by splitting up starting pitchers and relief pitchers in our training of the forest.
So why are Runs so and RBI important, while Home Runs lag so far behind? In our JQAS paper, my co-author and I posited that this is because of the dependence of a large number of R and RBI on the HR that are hit by each player. To remedy this, we created a metric of “HR-Independent Runs” and “HR-Independent RBI” (note, this is imprecise language, but it gets the point across) by simply subtracting HR from the total R and total RBI. I won’t do that here, but I can tell you that it seemed to improve the variable importance plot—without affecting the error rate much—making the importance of Home Runs much closer to what we would expect.
While it’s nice to see the importance of variables for classification as a whole, we can also see which variables were the most important factors in any individual players’ induction with our “local” importance matrix. This can be done for the training model or the test data set, but I’m not going to go through this portion of the code, since I already have an inordinately long post. I’ll just note that you can find the local importance information under the following objects.
forestBat$localImp forestPitch$localImp
Finally, there is a nice multidimensional scaling approach that we can use from the random forest output, which can visualize a distance measure for individual players in a two-dimensional representation. We can visualize both the current inductees, as well as our predictions for the future. While there is a direct plotting approach using the MDSplot
function, this makes it a bit difficult to see which player is which. Therefore, I’ve used some of my own code, which uses the players’ last names in the plot.
##create HOF variable for color coding in MDS plot trainBat$newhof <- trainBat$inducted + 1 trainPitch$newhof <- trainPitch$inducted + 1 ##run 3-dimension MDS and plot using player classes cmdB <- cmdscale(1 - forestBat$proximity, k=3) trainBat$x <- cmdB[,1] trainBat$y <- cmdB[,2] trainBat$z <- cmdB[,3] cmdP <- cmdscale(1 - forestPitch$proximity, k=3) trainPitch$x <- cmdP[,1] trainPitch$y <- cmdP[,2] trainPitch$z <- cmdP[,3] png(file="MDSviz.png", height=1000, width=800) par(mfrow=c(2,1)) plot(trainBat$x, trainBat$y, type="n", xlab="Dim 1", ylab="Dim 2", main="Batters Proximity Visualization") text(trainBat$x, trainBat$y, trainBat$nameLast, cex=.75, col=trainBat$newhof) legend(-0.18, 0.4, c("Member", "Non-Member"), fill=c("red", "black"), bty="n", cex=1.2) plot(trainPitch$x, trainPitch$y, type="n", xlab="Dim 1", ylab="Dim 2", main="Pitchers Proximity Visualization") text(trainPitch$x, trainPitch$y, trainPitch$nameLast, cex=.75, col=trainPitch$newhof) legend(0.65, 0.4, c("Member", "Non-Member"), fill=c("red", "black"), bty="n", cex=1.2) dev.off()
Here, you can see the closeness of Jackie Robinson and Roy Campanella, which makes sense given that they both had shorter MLB careers, were pioneers of integration, and were excellent players in their shorter stints. Similarly, we see Mark McGwire and Tim Raines right on the edge of where the clusters of most Hall of Fame hitters lie in the visual, alongside some Veteran’s Committee inductions like Orlando Cepeda. Also note that we see the same Tom Glavine mistake as in Ben’s post (go there to see what happened). And of course, we know why Roger Clemens is in that elite cluster on the right of the pitcher visualization but isn’t in red font.
Finally, and perhaps the most fun part of the Hall of Fame prediction model exercise is predicting future Hall of Fame induction of current or recently retired players. However, remember that this is a naïve model: we don’t predict the career trajectories of current players. We simply ask: if they retired now, and we relax the 10-year minimum requirement, would their statistical output qualify them for the Hall of Fame based on what we’ve seen voters do in the past?
##Predict Future Hall of Fame Induction ##get predicted classes for test data futureBat <- predict(forestBat, testBat, type="response") futurePitch <- predict(forestPitch, testPitch, type="response") ##get vote percentages for test data futureBat.vote <- predict(forestBat, testBat, type="vote", norm.votes=TRUE) futurePitch.vote <- predict(forestPitch, testPitch, type="vote", norm.votes=TRUE) ##attach predicted class and vote percentages from MODEL 1 testBat$HOF <- as.numeric(futureBat) - 1 testBat$votes <- futureBat.vote[,2] testPitch$HOF <- as.numeric(futurePitch) - 1 testPitch$votes <- futurePitch.vote[,2] batterHOF <- subset(testBat[,c(1,18,19,20,21)], testBat$votes > 0.1) pitcherHOF <- subset(testPitch[,c(1,17,18,19,20)], testPitch$votes > 0.1) batterHOF[order(-batterHOF$votes),] playerID nameLast nameFirst HOF votes 14896 sheffga01 Sheffield Gary 1 0.823 13206 pujolal01 Pujols Albert 1 0.797 13934 rodriiv01 Rodriguez Ivan 1 0.797 6334 griffke02 Griffey Ken 1 0.790 13353 ramirma02 Ramirez Manny 1 0.775 15956 suzukic01 Suzuki Ichiro 1 0.741 6424 guerrvl01 Guerrero Vladimir 1 0.729 13914 rodrial01 Rodriguez Alex 1 0.592 8195 jonesch06 Jones Chipper 1 0.577 8018 jeterde01 Jeter Derek 1 0.541 16278 thomeji01 Thome Jim 0 0.445 16879 vizquom01 Vizquel Omar 0 0.381 10348 mauerjo01 Mauer Joe 0 0.369 3971 delgaca01 Delgado Carlos 0 0.361 2268 cabremi01 Cabrera Miguel 0 0.354 7034 heltoto01 Helton Todd 0 0.350 12277 ortizda01 Ortiz David 0 0.333 3712 damonjo01 Damon Johnny 0 0.299 5828 giambja01 Giambi Jason 0 0.277 1055 beltrca01 Beltran Carlos 0 0.210 7465 hollima01 Holliday Matt 0 0.199 8847 konerpa01 Konerko Paul 0 0.187 1718 braunry02 Braun Ryan 0 0.181 6623 hamiljo03 Hamilton Josh 0 0.170 8176 jonesan01 Jones Andruw 0 0.149 5649 garcino01 Garciaparra Nomar 0 0.148 16899 vottojo01 Votto Joey 0 0.142 34 abreubo01 Abreu Bobby 0 0.108 13082 poseybu01 Posey Buster 0 0.105 pitcherHOF[order(-pitcherHOF$votes),] playerID nameLast nameFirst HOF votes 3951 johnsra05 Johnson Randy 1 0.733 8204 wagnebi02 Wagner Billy 1 0.611 4962 martipe02 Martinez Pedro 1 0.610 5699 nathajo01 Nathan Joe 1 0.601 3596 hoffmtr01 Hoffman Trevor 1 0.599 7412 smoltjo01 Smoltz John 1 0.506 6022 papeljo01 Papelbon Jonathan 0 0.475 3173 hallaro01 Halladay Roy 0 0.400 6680 riverma01 Rivera Mariano 0 0.359 4156 kershcl01 Kershaw Clayton 0 0.332 7222 shellst01 Shell Steven 0 0.273 1306 chapmar01 Chapman Aroldis 0 0.268 6739 rodrifr03 Rodriguez Francisco 0 0.268 4188 kimbrcr01 Kimbrel Craig 0 0.266 3613 hollagr01 Holland Greg 0 0.260 42 adamsmi03 Adams Mike 0 0.255 4706 loupaa01 Loup Aaron 0 0.254 6750 rodrist02 Rodriguez Paco 0 0.254 8165 vinceni01 Vincent Nick 0 0.253 231 avilalu01 Avilan Luis 0 0.252 2418 fernajo02 Fernandez Jose 0 0.252 3322 harvema01 Harvey Matt 0 0.252 6691 roarkta01 Roark Tanner 0 0.252 7960 torreal01 Torres Alex 0 0.252 6976 santajo01 Santana Johan 0 0.242 8036 ueharko01 Uehara Koji 0 0.240 6793 romose01 Romo Sergio 0 0.235 3863 janseke01 Jansen Kenley 0 0.234 6924 saitota01 Saito Takashi 0 0.234 7442 soriajo01 Soria Joakim 0 0.232 8570 wilsoju10 Wilson Justin 0 0.200 1377 cishest01 Cishek Steve 0 0.190 1552 cookry01 Cook Ryan 0 0.184 5965 oteroda01 Otero Dan 0 0.177 1607 cosarja01 Cosart Jarred 0 0.175 8136 ventejo01 Venters Jonny 0 0.173 4877 manesse01 Maness Seth 0 0.172 8795 zieglbr01 Ziegler Brad 0 0.169 3018 grayso01 Gray Sonny 0 0.120 3653 hoovejj01 Hoover J. J. 0 0.120 5857 odayda01 O'Day Darren 0 0.120 6816 rosentr01 Rosenthal Trevor 0 0.102
Here we can see the proportion of trees that each player is predicted to be inducted. Sheffield, Pujols, Pudge, Griffey, Manny, Ichiro, and Vlad are all easily predicted to be Hall of Famers, with A-Rod, Jones and Jeter all predicted as inductees. But this approach doesn’t account for steroid suspicions, so that would likely be an important omitted variable in predicting BBWAA voting behavior.
A quick note: it’s interesting to see A-Rod so low on this list, and Mariano Rivera listed as a non-inductee. As it turns out, both Alex Rodriguez and Mariano Rivera are recoded as having zero All-Star appearances. So it looks like it will be important to clean the data before making any final conclusions here, though I don’t have the time to go back through all the possible omissions.
Clearly, there are also a lot of names on this list that shouldn’t be getting any votes, like Nick Vincent and other relievers. Our forest is likely getting false positives due to low ERA and WHIP numbers for some relievers. Maybe splitting up starters and relievers, and requiring some larger minimum number of innings pitched would be helpful for our predictions here. But our Top 6 (plus Rivera), all make sense as being above the 50% forest vote threshold for classification as Hall of Fame pitchers.
Well, that’s it for today. Note that while I’ve shown a number of things that can be done with the randomForest package, there are a number of other things that we can investigate with the models. For example, we can look at the average tree size in each forest, track the error rates as the forest is built, and look at outliers in our data. Lastly, there is the possibility of smoothing the classification choice by using vote percentage with a logistic regression. However, I simply want to highlight the use of the randomForest package here. I’ll leave further investigation up to you to see if you can improve on the rates presented here.
Recent Comments