Winning Players and Derek Jeter

MLB is currently giving a season-long tribute to Derek Jeter who is concluding his career with the Yankees. Among Jeter's accolades is the comment that Jeter is a “winning player”.

In every game, one pitcher is credited with a win and another pitcher is given a loss. Suppose we do this for every position player who starts a game. We say that he “wins” if his team wins the game; other he “loses” the game. Given this criteria, who were the most successful position players during the Jeter seasons? Is Derek Jeter high on this list? Are there any surprising names in the top 10 winning players?

To answer these questions, we use the Retrosheet game log data files and also the Lahman database. I am assuming these files are stored on your computer and I'm letting “gamelog.folder” and “Lahman.folder” be the locations of these folders. (You can substitute your folder names in the following R code.)

The following get.data function will combine the game log files for a vector of seasons. This function creates a variable HomeWin which is equal to 1 if the home team wins and 0 otherwise.

get.data <- function(Seasons){
  data <- NULL
  gamelog.folder <- "/Users/albert/Desktop/gamelogs/gamelogs/"
  for (year in Seasons){
     filename <- paste(gamelog.folder, "gl",
                       year,".txt", sep="")
     d <- read.csv(filename, header=FALSE)
     headers <- read.csv(paste(gamelog.folder, 
                       "game_log_header.csv", sep=""))
     names(d) <- names(headers)
     data <- rbind(data, d)}
  data$Season <- substr(data$Date, 1, 4)
  data$HomeWin <- with(data, 
                  ifelse(HomeRunsScore > VisitorRunsScored, 1, 0))
  data
}

I use this function to get the game logs for the Jeter years 1995 to 2013

D <- get.data(1995:2013)

For a given player (index of a vector of player ids), the win.loss.record function will record the number of wins and losses for the player. Note that I use variables VisitorBatting1PlayerID, …, and HomeBatting1PlayerID, … which give the player codes for the players starting in batting positions 1 through 9.

win.loss.record <- function(jj){
  pid <- retroid.all[jj]
  require(dplyr)
  win.visitor <- 1- filter(D, VisitorBatting1PlayerID==pid |
                            VisitorBatting2PlayerID==pid |
                            VisitorBatting3PlayerID==pid |
                            VisitorBatting4PlayerID==pid |
                            VisitorBatting5PlayerID==pid |
                            VisitorBatting6PlayerID==pid |
                            VisitorBatting7PlayerID==pid |
                            VisitorBatting8PlayerID==pid |
                            VisitorBatting9PlayerID==pid)$HomeWin
  win.home <- filter(D, HomeBatting1PlayerID==pid |
                     HomeBatting2PlayerID==pid |
                     HomeBatting3PlayerID==pid |
                     HomeBatting4PlayerID==pid |
                     HomeBatting5PlayerID==pid |
                     HomeBatting6PlayerID==pid |
                     HomeBatting7PlayerID==pid |
                     HomeBatting8PlayerID==pid |
                     HomeBatting9PlayerID==pid)$HomeWin
  W <- c(win.visitor, win.home)
  c(sum(W==1), sum(W==0))
}

From the Lahman files, I create a list of the Retrosheet playerids for all batters from 1995 to 2013 (the Jeter years). This vector of playerids is stored in the variable retroid.all.

Lahman.folder <- "/Users/albert/Desktop/lahman-csv_2013-12-10/"
Batting <- read.csv(paste(Lahman.folder, "Batting.csv", sep=""))
Master <- read.csv(paste(Lahman.folder, "Master.csv", sep=""))
library(dplyr)
b <- filter(Batting, yearID >= 1995)
playerid.all <- sort(unique(as.character(b$playerID)))
retroid.all <- as.character(filter(Master, playerID %in% 
                                     playerid.all)$retroID)

By use of the sapply function, I find win/loss records for all players in list and store the output in file win.results.Rdata. (This will take several minutes to execute.)

S <- sapply(1:length(retroid.all), win.loss.record)
save(S, file="win.results.Rdata")

I create a data frame win.results, adding Player, W, L, and N (sample size) variables.

win.results <- data.frame(Player=retroid.all, W=S[1, ], L=S[2, ])
win.results$N = with(win.results, W + L)

We focus on players with at least one win or loss, and for these players, we compute a winning percentage Win.Pct.

win.results2 <- filter(win.results, N > 0)
win.results2$Win.Pct <- with(win.results2, 100 * W / N)

By limiting our search to players with at least 500 W/L decisions, this will remove the pitchers from the data frame. The data frame is merged with columns from the Master data frame so we can see first and last names.

win.results3 <- filter(win.results2, N >= 500)

win.results3 <- merge(Master[, c("retroID", "nameFirst", "nameLast")],
                     win.results3, by.x="retroID", by.y="Player")

We construct a histogram of the player winning percentages. As expected, it is centered about 50% with extremes between 40% and 60%.

library(MASS)
truehist(win.results3$Win.Pct,
         xlab="Winning Percentage",
         main="Winning Pct for All Players in Jeter Years with 500+ Games")

winningpct

By use of the arrange function in the dplyr package, we sort the data frame by Winning Percentage. We first list the top 10 players in the Jeter era.

win.results3 <- arrange(win.results3,  desc(Win.Pct))
win.results3[1:10, -1]
##    nameFirst nameLast    W    L    N Win.Pct
## 1      Julio   Franco  316  205  521   60.65
## 2     Bernie Williams  980  636 1616   60.64
## 3      Jorge   Posada  988  654 1642   60.17
## 4       Paul  O'Neill  592  392  984   60.16
## 5      Jason  Varitek  829  550 1379   60.12
## 6      Derek    Jeter 1552 1038 2590   59.92
## 7      Brett  Gardner  299  204  503   59.44
## 8      David  Justice  534  375  909   58.75
## 9    Chipper    Jones 1403  996 2399   58.48
## 10  Robinson     Cano  788  565 1353   58.24

Jeter appears as number 6 on the top 10 list, together with his Yankee teammates Bernie Williams, Jorge Posada, Paul O'Neill, Brett Gardner, David Justice, and Robinson Cano. The interesting players are the non-Yankees: Julio Franco (1), Jason Varitek (5), and Chipper Jones (9).

Some may be interested in the bottom 10 position players during the Jeter years.

win.results3[548:557, -1]
##     nameFirst  nameLast   W   L    N Win.Pct
## 548     Ronny    Cedeno 291 401  692   42.05
## 549     Gregg Jefferies 235 324  559   42.04
## 550      Mike   Sweeney 566 782 1348   41.99
## 551     Angel    Berroa 289 400  689   41.94
## 552   Starlin    Castro 250 349  599   41.74
## 553     Bobby Higginson 535 760 1295   41.31
## 554      Emil     Brown 224 336  560   40.00
## 555      Ryan    Doumit 298 448  746   39.95
## 556     David   DeJesus 472 711 1183   39.90
## 557     Shane    Halter 193 309  502   38.45

By the way, Shane Halter played for the Royals, Tigers, Mets, and Angels. I don’t think we see any Yankees on this bottom 10 list.

I guess the message of this post is that “winning” players play on winning teams. One thing that we don’t understand about baseball and other sports very well is the role of teamwork (team chemistry?) in winning. If we pose the right questions, I believe we have the data to draw some interesting conclusions about teamwork.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: