In the previous post, we found the lengths of all winning and losing streaks of teams during a particular season. Here we look more carefully at the significance of clumpy win sequences and clumpy loss sequences.

We illustrate the basic ideas first. Suppose we represent a teams’s wins and losses by a binary sequence where a 1 corresponds to a win and a 0 to a loss.

Here are the results of the first 20 games of the Phillies 2002 season:

results <- c(0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1)

Suppose we compute the spacings or lengths of the gaps between victories. The function ` find.gaps`

will compute these spacings.

find.gaps <- function(x) { n <- length(x) ab.hit <- c((1:n)[x == 1], n + 1) diff(c(0, ab.hit)) - 1 } find.gaps(results)

## [1] 1 1 1 1 0 3 1 4 0

Here we see the Phillies had several losing streaks of lengths 3 and 4. One way to measure the streakiness in this sequence is the sum of squared spacings.

sum(find.gaps(results) ^ 2)

## [1] 30

Large values of this “clumpiness” measure indicate streakiness.

We can decide if this measure is large enough by use of a permutation test of randomness. If a team does not have any streaky tendency, then all possible permutations of this sequence of 8 wins and 12 losses are equally likely. We implement this test by randomly mixing up the game results (0’s and 1’s) (using the function ` sample `

), compute the clumpiness measure, and repeat this result 1000 times. We construct a histogram of the 1000 clumpiness measures (under randomness) and show the observed measure for the Phillies as a vertical line.

S = replicate(1000, sum(find.gaps(sample(results))^ 2)) library(MASS) truehist(S, xlab="Clumpy Measure") abline(v=30, lwd=3, col="red") text(35, .05, "OBSERVED", col="red")

To see if the observed streakiness (30) is extreme, we compute a tail-probability (a p-value). If this value is close to 0, this indicates the team is unsually streaky; if the value is close to 1, this indicates the team is unusually consistent.

We write a function `perm test`

to implement this test and output the p-value. We illustrate it for the Phillies first 20 games in the 2002 season – since this p-value is moderate in size, we conclude this sequence was not unusually streaky or consistent.

permtest <- function (y, ITER = 1000) { S = replicate(ITER, sum(find.gaps(sample(y)) ^ 2)) mean(S >= sum(find.gaps(y) ^ 2)) } permtest(results)

## [1] 0.787

Let’s do this for all teams in the 2002 season. The function `load.gamelog`

will read in the Retrosheet gamelog file for a particular season. The inputs are the season and the vector of names of the variables.

load.gamelog <- function(season, headers){ download.file( url <- paste("http://www.retrosheet.org/gamelogs/gl", season, ".zip" , sep="") , destfile <- paste("gl", season, ".zip", sep="") ) unzip(paste("gl", season, ".zip", sep="")) gamelog <- read.table(paste("gl", season, ".txt", sep="") , sep=",", stringsAsFactors=F) names(gamelog) <- headers file.remove(paste("gl", season, ".zip", sep="")) file.remove(paste("gl", season, ".txt", sep="")) gamelog }

The file `headerinfo.R`

creates a vector `Header`

containing the variable names. We use the `load.gamelog`

function to read in the game logs for the 2002 season.

source("headerinfo.R") gl2002 <- load.gamelog(2002, Headers)

The function `find.team.sequence`

gives the win/loss data (1 for a win and 0 for a loss) for a specific team for a particular season.

find.team.sequence <- function(team, data){ home <- subset(data, HomeTeam == team) home$GameNumber <- home$HomeTeamGameNumber home$Win <- with(home, ifelse(HomeRunsScore > VisitorRunsScored, 1, 0)) visiting <- subset(data, VisitingTeam == team) visiting$GameNumber <- visiting$VisitingTeamGameNumber visiting$Win <- with(visiting, ifelse(HomeRunsScore < VisitorRunsScored, 1, 0)) streak.data <- rbind(home, visiting) streak.data <- streak.data[order(streak.data$GameNumber), ] streak.data$Win }

We use the `find.team.sequence`

function to find the win/loss sequence of Oakland (team abbreviation “OAK”) and Philadelphia (team abbreviation “PHI”) for the 2002 season.

find.team.sequence("OAK", gl2002)

## [1] 1 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0 ## [36] 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 ## [71] 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 0 0 1 ## [106] 0 0 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 ## [141] 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1

find.team.sequence("PHI", gl2002)

## [1] 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 ## [36] 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1 ## [71] 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 ## [106] 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 0 ## [141] 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0

The vector `teams`

contains the team abbreviation for all teams. We collect the win/loss sequences for all teams and put it in the data frame `D`

teams <- as.character(unique(gl2002$HomeTeam)) D <- NULL for(j in teams) D <- rbind(D, data.frame(Team=j, Result=find.team.sequence(j, gl2002)))

The function `clumpy`

will compute the sum of squared spacings of a win/loss binary sequence.

clumpy <- function(y){ find.gaps <- function(x) { n <- length(x) ab.hit <- c((1:n)[x == 1], n + 1) diff(c(0, ab.hit)) - 1 } sum(find.gaps(y) ^ 2) }

Illustrate these functions for the 2002 Oakland Athletics.

oakland <- find.team.sequence("OAK", gl2002) clumpy(oakland)

## [1] 139

permtest(oakland)

## [1] 0.188

Using the `ddply`

function in the `plyr`

function, we compute for each team, the sum of squared spacings and the p-value of the permutation test. In addition, by looking at the spacings between losses (instead of victories), we compute the sum of squared spacings and the associated p-value. So we can explore the significance of both losing streaks and winning streaks.

library(plyr) S2 <- ddply(D, .(Team), summarize, Stat.Win=clumpy(Result), P.Value.Win=permtest(Result), Stat.Lose=clumpy(1 - Result), P.Value.Lose=permtest(1 - Result))

We plot the p-value (streakiness in winning sequence) against the p-value (streakiness in losing sequence) for all teams.

ggplot(S2, aes(P.Value.Win, P.Value.Lose, label=Team)) + geom_point() + geom_text() + geom_smooth(method="lm",se=FALSE) + scale_x_continuous(limits=c(0, 1)) + scale_y_continuous(limits=c(0, 1)) + annotate("text", x = 0.1, y = 0.1, label = "STREAKY", color="red", size=10) + annotate("text", x = 0.85, y = 1.0, label = "CONSISTENT", color="blue", size=10)

This graph is interesting — teams in the lower left portion of the plot (small p-values) tend to be the streaky teams in the 2002 season, and the teams in the upper right portion (large p-values) tend to be the consistent teams. The line indicates a positive relationship — teams who tend to be streaky in their pattern of winning games tend also to be consistent in their pattern of losing games.

Hi Jim !

Very interesting article. I am trying to map this knowledge to football games, but on football games there also is “Draw”. So a typical sequence could be something like: {L,L,D,L,W}. I am thinking on different ways to solve that, but not sure 100%. So I would appreciate a lot an advice from you.

Greetings !

Sergio.