MLB is currently giving a season-long tribute to Derek Jeter who is concluding his career with the Yankees. Among Jeter's accolades is the comment that Jeter is a “winning player”.

In every game, one pitcher is credited with a win and another pitcher is given a loss. Suppose we do this for every position player who starts a game. We say that he “wins” if his team wins the game; other he “loses” the game. Given this criteria, who were the most successful position players during the Jeter seasons? Is Derek Jeter high on this list? Are there any surprising names in the top 10 winning players?

To answer these questions, we use the Retrosheet game log data files and also the Lahman database. I am assuming these files are stored on your computer and I'm letting “gamelog.folder” and “Lahman.folder” be the locations of these folders. (You can substitute your folder names in the following R code.)

The following `get.data`

function will combine the game log files for a vector of seasons. This function creates a variable ` HomeWin `

which is equal to 1 if the home team wins and 0 otherwise.

get.data <- function(Seasons){ data <- NULL gamelog.folder <- "/Users/albert/Desktop/gamelogs/gamelogs/" for (year in Seasons){ filename <- paste(gamelog.folder, "gl", year,".txt", sep="") d <- read.csv(filename, header=FALSE) headers <- read.csv(paste(gamelog.folder, "game_log_header.csv", sep="")) names(d) <- names(headers) data <- rbind(data, d)} data$Season <- substr(data$Date, 1, 4) data$HomeWin <- with(data, ifelse(HomeRunsScore > VisitorRunsScored, 1, 0)) data }

I use this function to get the game logs for the Jeter years 1995 to 2013

D <- get.data(1995:2013)

For a given player (index of a vector of player ids), the `win.loss.record`

function will record the number of wins and losses for the player. Note that I use variables `VisitorBatting1PlayerID`

, …, and `HomeBatting1PlayerID`

, … which give the player codes for the players starting in batting positions 1 through 9.

win.loss.record <- function(jj){ pid <- retroid.all[jj] require(dplyr) win.visitor <- 1- filter(D, VisitorBatting1PlayerID==pid | VisitorBatting2PlayerID==pid | VisitorBatting3PlayerID==pid | VisitorBatting4PlayerID==pid | VisitorBatting5PlayerID==pid | VisitorBatting6PlayerID==pid | VisitorBatting7PlayerID==pid | VisitorBatting8PlayerID==pid | VisitorBatting9PlayerID==pid)$HomeWin win.home <- filter(D, HomeBatting1PlayerID==pid | HomeBatting2PlayerID==pid | HomeBatting3PlayerID==pid | HomeBatting4PlayerID==pid | HomeBatting5PlayerID==pid | HomeBatting6PlayerID==pid | HomeBatting7PlayerID==pid | HomeBatting8PlayerID==pid | HomeBatting9PlayerID==pid)$HomeWin W <- c(win.visitor, win.home) c(sum(W==1), sum(W==0)) }

From the Lahman files, I create a list of the Retrosheet playerids for all batters from 1995 to 2013 (the Jeter years). This vector of playerids is stored in the variable `retroid.all`

.

Lahman.folder <- "/Users/albert/Desktop/lahman-csv_2013-12-10/" Batting <- read.csv(paste(Lahman.folder, "Batting.csv", sep="")) Master <- read.csv(paste(Lahman.folder, "Master.csv", sep="")) library(dplyr) b <- filter(Batting, yearID >= 1995) playerid.all <- sort(unique(as.character(b$playerID))) retroid.all <- as.character(filter(Master, playerID %in% playerid.all)$retroID)

By use of the `sapply`

function, I find win/loss records for all players in list and store the output in file `win.results.Rdata`

. (This will take several minutes to execute.)

S <- sapply(1:length(retroid.all), win.loss.record) save(S, file="win.results.Rdata")

I create a data frame `win.results`

, adding `Player`

, `W`

, `L`

, and `N`

(sample size) variables.

win.results <- data.frame(Player=retroid.all, W=S[1, ], L=S[2, ]) win.results$N = with(win.results, W + L)

We focus on players with at least one win or loss, and for these players, we compute a winning percentage `Win.Pct`

.

win.results2 <- filter(win.results, N > 0) win.results2$Win.Pct <- with(win.results2, 100 * W / N)

By limiting our search to players with at least 500 W/L decisions, this will remove the pitchers from the data frame. The data frame is merged with columns from the `Master`

data frame so we can see first and last names.

win.results3 <- filter(win.results2, N >= 500) win.results3 <- merge(Master[, c("retroID", "nameFirst", "nameLast")], win.results3, by.x="retroID", by.y="Player")

We construct a histogram of the player winning percentages. As expected, it is centered about 50% with extremes between 40% and 60%.

library(MASS) truehist(win.results3$Win.Pct, xlab="Winning Percentage", main="Winning Pct for All Players in Jeter Years with 500+ Games")

By use of the `arrange`

function in the `dplyr`

package, we sort the data frame by Winning Percentage. We first list the top 10 players in the Jeter era.

win.results3 <- arrange(win.results3, desc(Win.Pct)) win.results3[1:10, -1]

## nameFirst nameLast W L N Win.Pct ## 1 Julio Franco 316 205 521 60.65 ## 2 Bernie Williams 980 636 1616 60.64 ## 3 Jorge Posada 988 654 1642 60.17 ## 4 Paul O'Neill 592 392 984 60.16 ## 5 Jason Varitek 829 550 1379 60.12 ## 6 Derek Jeter 1552 1038 2590 59.92 ## 7 Brett Gardner 299 204 503 59.44 ## 8 David Justice 534 375 909 58.75 ## 9 Chipper Jones 1403 996 2399 58.48 ## 10 Robinson Cano 788 565 1353 58.24

Jeter appears as number 6 on the top 10 list, together with his Yankee teammates Bernie Williams, Jorge Posada, Paul O'Neill, Brett Gardner, David Justice, and Robinson Cano. The interesting players are the non-Yankees: Julio Franco (1), Jason Varitek (5), and Chipper Jones (9).

Some may be interested in the bottom 10 position players during the Jeter years.

win.results3[548:557, -1]

## nameFirst nameLast W L N Win.Pct ## 548 Ronny Cedeno 291 401 692 42.05 ## 549 Gregg Jefferies 235 324 559 42.04 ## 550 Mike Sweeney 566 782 1348 41.99 ## 551 Angel Berroa 289 400 689 41.94 ## 552 Starlin Castro 250 349 599 41.74 ## 553 Bobby Higginson 535 760 1295 41.31 ## 554 Emil Brown 224 336 560 40.00 ## 555 Ryan Doumit 298 448 746 39.95 ## 556 David DeJesus 472 711 1183 39.90 ## 557 Shane Halter 193 309 502 38.45

By the way, Shane Halter played for the Royals, Tigers, Mets, and Angels. I don’t think we see any Yankees on this bottom 10 list.

I guess the message of this post is that “winning” players play on winning teams. One thing that we don’t understand about baseball and other sports very well is the role of teamwork (team chemistry?) in winning. If we pose the right questions, I believe we have the data to draw some interesting conclusions about teamwork.