Team Streaks, Part I

The media and baseball fans get fascinated with team winning and losing streaks and the significance of streaks that seem long. One of the subplots in the move Moneyball was the 20-game winning streak of the Oakland Athletics in the 2002 season. In this post, we will use R to download all of the game logs for a particular season, and write a R function to extract all of the winning and losing streaks of a particular team. By looking at the lengths of streaks for all teams in the 2002 season, we can see if Oakland’s streak seemed unusual. All of the R code and functions can be found on my github gist site here.

The following function load.gamelog will read in the Retrosheet game log file for a particular season. The inputs are the season and the vector of names of the variables.

load.gamelog <- function(season, headers){
  download.file(
    url <- paste("http://www.retrosheet.org/gamelogs/gl", season, ".zip"
              , sep="")
    , destfile <- paste("gl", season, ".zip", sep="")
  )
  unzip(paste("gl", season, ".zip", sep=""))
  gamelog <- read.table(paste("gl", season, ".txt", sep="")
                        , sep=",", stringsAsFactors=F)
  names(gamelog) <- headers
  file.remove(paste("gl", season, ".zip", sep=""))
  file.remove(paste("gl", season, ".txt", sep=""))
  gamelog
}

The file “headerinfo.R” creates a vector Header containing the variable names. We use the load.gamelog function to read in the game logs for the 2002 season.  (This file is included in the github gist.) Note that these game logs are stored in the file gl2002 .

source("headerinfo.R")
gl2002 <- load.gamelog(2002, Headers)

The function find.team.streaks finds the length of all winning and losing streaks for a specific team for a particular season.

find.team.streaks <- function(team, data){
  streaks <- function(y){
    n <- length(y)
    where <- c(0, y, 0) == 0
    location.zeros <- (0 : (n + 1))[where]
    streak.lengths <- diff(location.zeros) - 1
    streak.lengths[streak.lengths > 0]
  }
  home <- subset(data, HomeTeam == team)
  home$GameNumber <- home$HomeTeamGameNumber
  home$Win <- with(home,
                  ifelse(HomeRunsScore > VisitorRunsScored, 1, 0))
  visiting <- subset(data, VisitingTeam == team)
  visiting$GameNumber <- visiting$VisitingTeamGameNumber
  visiting$Win <- with(visiting,
                  ifelse(HomeRunsScore < VisitorRunsScored, 1, 0))
  streak.data <- rbind(home, visiting)
  streak.data <- streak.data[order(streak.data$GameNumber), ]
  winning.streaks <- streaks(streak.data$Win)
  losing.streaks <- streaks(1 - streak.data$Win)
  list(Winning = winning.streaks, Losing = losing.streaks)
}

We use the find.team.streaks function to find the streak lengths of Oakland (team abbreviation “OAK”) and Philadelphia (team abbreviation “PHI”) for the 2002 season.

find.team.streaks("OAK", gl2002)
## $Winning
##  [1]  3  3  2  1  2  4  1  1  1  1  1  4  3  1  8  8  1  1  1  2  4  2  3
## [24]  1  1  2  4  2 20  3  2  1  5  4
##
## $Losing
##  [1] 2 2 3 1 2 1 3 4 3 4 1 2 1 1 1 3 1 1 1 1 1 1 1 4 2 1 1 2 1 3 1 1 2
find.team.streaks("PHI", gl2002)
## $Winning
##  [1] 1 1 1 2 1 1 1 1 7 1 3 1 2 4 1 1 1 1 2 2 2 1 1 2 3 4 1 3 2 5 6 4 1 2 2
## [36] 1 4 1
##
## $Losing
##  [1] 1 1 1 1 3 1 4 6 1 1 6 4 2 2 1 1 2 1 2 1 1 1 3 1 1 5 1 1 2 4 3 1 3 6 1
## [36] 1 1 2 1

We see that Oakland, besides the 20 game winning streak, also had two 8-game winning streaks, and its longest losing streak was only 4 games. The Phillies’ longest winning streak was 7 games, and they had three 6-game losing streaks.

The vector teams contains the team abbreviation for all teams. Then using the sapply function, we find the lengths of winning and losing streaks for all teams in the 2002 season.

teams <- as.character(unique(gl2002$HomeTeam))
S <- sapply(teams, find.team.streaks, gl2002)

We create a data frame containing the lengths of all streaks. There are three variables: Team, Streak, and Type (whether it is a winning streak or a losing streak).

D <- NULL
for(j in teams)
   D <- rbind(D, data.frame(Team=j, Type="Winning",
                Streak=S[["Winning", j]]))
for(j in teams)
   D <- rbind(D, data.frame(Team=j, Type="Losing",
                Streak=S[["Losing", j]]))
head(D)
##   Team    Type Streak
## 1  ANA Winning      1
## 2  ANA Winning      2
## 3  ANA Winning      2
## 4  ANA Winning      1
## 5  ANA Winning      8
## 6  ANA Winning      1

I next construct a graph using ggplot2 showing the lengths of all streaks for all teams in the 2012 season. I jitter the points so one can see individual points. Also I compare the lengths of the winning and losing streaks.

library(ggplot2)
print(ggplot(D, aes(Team, Streak)) +
     geom_point(position="jitter") +
     coord_flip() +
     facet_wrap(~ Type) +
     ggtitle("Lengths of Streaks in 2002 Season"))

teamstreaks

In the 2002 season, Oakland’s 20-game winning streak stands out, but Cleveland, Seattle, and Anaheim all had 10-game winning streaks. On the losing side, there were five losing streaks of length 10 or longer — Tampa Bay’s 15-game losing streak was the longest. Chapter 10 of Analyzing Baseball With R illustrates the use of R to find individual batting streaks from Retrosheet play-by-play data. In a follow-up post, I’ll talk about methods of figuring out the significance of team streaks.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: