Team Streaks, Part II

In the previous post, we found the lengths of all winning and losing streaks of teams during a particular season. Here we look more carefully at the significance of clumpy win sequences and clumpy loss sequences.

We illustrate the basic ideas first. Suppose we represent a teams’s wins and losses by a binary sequence where a 1 corresponds to a win and a 0 to a loss.

Here are the results of the first 20 games of the Phillies 2002 season:

results <- c(0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1)

Suppose we compute the spacings or lengths of the gaps between victories. The function find.gaps will compute these spacings.

find.gaps <- function(x) {
        n <- length(x)
        ab.hit <- c((1:n)[x == 1], n + 1)
        diff(c(0, ab.hit)) - 1
    }
find.gaps(results)
## [1] 1 1 1 1 0 3 1 4 0

Here we see the Phillies had several losing streaks of lengths 3 and 4. One way to measure the streakiness in this sequence is the sum of squared spacings.

sum(find.gaps(results) ^ 2)
## [1] 30

Large values of this “clumpiness” measure indicate streakiness.

We can decide if this measure is large enough by use of a permutation test of randomness. If a team does not have any streaky tendency, then all possible permutations of this sequence of 8 wins and 12 losses are equally likely. We implement this test by randomly mixing up the game results (0’s and 1’s) (using the function sample ), compute the clumpiness measure, and repeat this result 1000 times. We construct a histogram of the 1000 clumpiness measures (under randomness) and show the observed measure for the Phillies as a vertical line.

S = replicate(1000, sum(find.gaps(sample(results))^ 2))
library(MASS)
truehist(S, xlab="Clumpy Measure")
abline(v=30, lwd=3, col="red")
text(35, .05, "OBSERVED", col="red")

clumpy1

To see if the observed streakiness (30) is extreme, we compute a tail-probability (a p-value). If this value is close to 0, this indicates the team is unsually streaky; if the value is close to 1, this indicates the team is unusually consistent.

We write a function perm test to implement this test and output the p-value. We illustrate it for the Phillies first 20 games in the 2002 season – since this p-value is moderate in size, we conclude this sequence was not unusually streaky or consistent.

permtest <- function (y, ITER = 1000) {
    S = replicate(ITER, sum(find.gaps(sample(y)) ^ 2))
    mean(S >= sum(find.gaps(y) ^ 2))
}
permtest(results)
## [1] 0.787

Let’s do this for all teams in the 2002 season. The function load.gamelog will read in the Retrosheet gamelog file for a particular season. The inputs are the season and the vector of names of the variables.

load.gamelog <- function(season, headers){
  download.file(
    url <- paste("http://www.retrosheet.org/gamelogs/gl", season, ".zip"
              , sep="")
    , destfile <- paste("gl", season, ".zip", sep="")
  )
  unzip(paste("gl", season, ".zip", sep=""))
  gamelog <- read.table(paste("gl", season, ".txt", sep="")
                        , sep=",", stringsAsFactors=F)
  names(gamelog) <- headers
  file.remove(paste("gl", season, ".zip", sep=""))
  file.remove(paste("gl", season, ".txt", sep=""))
  gamelog
}

The file headerinfo.R creates a vector Header containing the variable names. We use the load.gamelog function to read in the game logs for the 2002 season.

source("headerinfo.R")
gl2002 <- load.gamelog(2002, Headers)

The function find.team.sequence gives the win/loss data (1 for a win and 0 for a loss) for a specific team for a particular season.

find.team.sequence <- function(team, data){
  home <- subset(data, HomeTeam == team)
  home$GameNumber <- home$HomeTeamGameNumber
  home$Win <- with(home, 
                  ifelse(HomeRunsScore > VisitorRunsScored, 1, 0))
  visiting <- subset(data, VisitingTeam == team)
  visiting$GameNumber <- visiting$VisitingTeamGameNumber
  visiting$Win <- with(visiting, 
                  ifelse(HomeRunsScore < VisitorRunsScored, 1, 0))
  streak.data <- rbind(home, visiting)
  streak.data <- streak.data[order(streak.data$GameNumber), ]
  streak.data$Win
}

We use the find.team.sequence function to find the win/loss sequence of Oakland (team abbreviation “OAK”) and Philadelphia (team abbreviation “PHI”) for the 2002 season.

find.team.sequence("OAK", gl2002)
##   [1] 1 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0
##  [36] 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1
##  [71] 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 0 0 1
## [106] 0 0 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## [141] 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1
find.team.sequence("PHI", gl2002)
##   [1] 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1
##  [36] 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1
##  [71] 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0
## [106] 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 0
## [141] 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0

The vector teams contains the team abbreviation for all teams. We collect the win/loss sequences for all teams and put it in the data frame D

teams <- as.character(unique(gl2002$HomeTeam))
D <- NULL
for(j in teams)
  D <- rbind(D, 
      data.frame(Team=j, Result=find.team.sequence(j, gl2002)))

The function clumpy will compute the sum of squared spacings of a win/loss binary sequence.

clumpy <- function(y){
    find.gaps <- function(x) {
        n <- length(x)
        ab.hit <- c((1:n)[x == 1], n + 1)
        diff(c(0, ab.hit)) - 1
    }
    sum(find.gaps(y) ^ 2)
}

Illustrate these functions for the 2002 Oakland Athletics.

oakland <- find.team.sequence("OAK", gl2002)
clumpy(oakland)
## [1] 139
permtest(oakland)
## [1] 0.188

Using the ddply function in the plyr function, we compute for each team, the sum of squared spacings and the p-value of the permutation test. In addition, by looking at the spacings between losses (instead of victories), we compute the sum of squared spacings and the associated p-value. So we can explore the significance of both losing streaks and winning streaks.

library(plyr)
S2 <- ddply(D, .(Team), summarize,
       Stat.Win=clumpy(Result),
       P.Value.Win=permtest(Result),
       Stat.Lose=clumpy(1 - Result),
       P.Value.Lose=permtest(1 - Result))

We plot the p-value (streakiness in winning sequence) against the p-value (streakiness in losing sequence) for all teams.

ggplot(S2, aes(P.Value.Win, P.Value.Lose, label=Team)) +
 geom_point() + geom_text() +
 geom_smooth(method="lm",se=FALSE) +
 scale_x_continuous(limits=c(0, 1)) +
 scale_y_continuous(limits=c(0, 1)) +
 annotate("text", x = 0.1, y = 0.1, 
          label = "STREAKY", color="red", size=10) +
 annotate("text", x = 0.85, y = 1.0, 
          label = "CONSISTENT", color="blue", size=10)  

permtest

This graph is interesting — teams in the lower left portion of the plot (small p-values) tend to be the streaky teams in the 2002 season, and the teams in the upper right portion (large p-values) tend to be the consistent teams. The line indicates a positive relationship — teams who tend to be streaky in their pattern of winning games tend also to be consistent in their pattern of losing games.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: