# Team Streaks, Part II

In the previous post, we found the lengths of all winning and losing streaks of teams during a particular season. Here we look more carefully at the significance of clumpy win sequences and clumpy loss sequences.

We illustrate the basic ideas first. Suppose we represent a teams’s wins and losses by a binary sequence where a 1 corresponds to a win and a 0 to a loss.

Here are the results of the first 20 games of the Phillies 2002 season:

```results <- c(0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1)
```

Suppose we compute the spacings or lengths of the gaps between victories. The function ` find.gaps` will compute these spacings.

```find.gaps <- function(x) {
n <- length(x)
ab.hit <- c((1:n)[x == 1], n + 1)
diff(c(0, ab.hit)) - 1
}
find.gaps(results)
```
```##  1 1 1 1 0 3 1 4 0
```

Here we see the Phillies had several losing streaks of lengths 3 and 4. One way to measure the streakiness in this sequence is the sum of squared spacings.

```sum(find.gaps(results) ^ 2)
```
```##  30
```

Large values of this “clumpiness” measure indicate streakiness.

We can decide if this measure is large enough by use of a permutation test of randomness. If a team does not have any streaky tendency, then all possible permutations of this sequence of 8 wins and 12 losses are equally likely. We implement this test by randomly mixing up the game results (0’s and 1’s) (using the function ` sample `), compute the clumpiness measure, and repeat this result 1000 times. We construct a histogram of the 1000 clumpiness measures (under randomness) and show the observed measure for the Phillies as a vertical line.

```S = replicate(1000, sum(find.gaps(sample(results))^ 2))
library(MASS)
truehist(S, xlab="Clumpy Measure")
abline(v=30, lwd=3, col="red")
text(35, .05, "OBSERVED", col="red")
``` To see if the observed streakiness (30) is extreme, we compute a tail-probability (a p-value). If this value is close to 0, this indicates the team is unsually streaky; if the value is close to 1, this indicates the team is unusually consistent.

We write a function `perm test` to implement this test and output the p-value. We illustrate it for the Phillies first 20 games in the 2002 season – since this p-value is moderate in size, we conclude this sequence was not unusually streaky or consistent.

```permtest <- function (y, ITER = 1000) {
S = replicate(ITER, sum(find.gaps(sample(y)) ^ 2))
mean(S >= sum(find.gaps(y) ^ 2))
}
permtest(results)
```
```##  0.787
```

Let’s do this for all teams in the 2002 season. The function `load.gamelog` will read in the Retrosheet gamelog file for a particular season. The inputs are the season and the vector of names of the variables.

```load.gamelog <- function(season, headers){
download.file(
url <- paste("http://www.retrosheet.org/gamelogs/gl", season, ".zip"
, sep="")
, destfile <- paste("gl", season, ".zip", sep="")
)
unzip(paste("gl", season, ".zip", sep=""))
gamelog <- read.table(paste("gl", season, ".txt", sep="")
, sep=",", stringsAsFactors=F)
names(gamelog) <- headers
file.remove(paste("gl", season, ".zip", sep=""))
file.remove(paste("gl", season, ".txt", sep=""))
gamelog
}
```

The file `headerinfo.R` creates a vector `Header` containing the variable names. We use the `load.gamelog` function to read in the game logs for the 2002 season.

```source("headerinfo.R")
gl2002 <- load.gamelog(2002, Headers)
```

The function `find.team.sequence` gives the win/loss data (1 for a win and 0 for a loss) for a specific team for a particular season.

```find.team.sequence <- function(team, data){
home <- subset(data, HomeTeam == team)
home\$GameNumber <- home\$HomeTeamGameNumber
home\$Win <- with(home,
ifelse(HomeRunsScore > VisitorRunsScored, 1, 0))
visiting <- subset(data, VisitingTeam == team)
visiting\$GameNumber <- visiting\$VisitingTeamGameNumber
visiting\$Win <- with(visiting,
ifelse(HomeRunsScore < VisitorRunsScored, 1, 0))
streak.data <- rbind(home, visiting)
streak.data <- streak.data[order(streak.data\$GameNumber), ]
streak.data\$Win
}
```

We use the `find.team.sequence` function to find the win/loss sequence of Oakland (team abbreviation “OAK”) and Philadelphia (team abbreviation “PHI”) for the 2002 season.

```find.team.sequence("OAK", gl2002)
```
```##    1 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 1 0 0 0 1 0 0 0 0
##   1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1
##   1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 0 0 1
##  0 0 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
##  1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1
```
```find.team.sequence("PHI", gl2002)
```
```##    0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1
##   0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1
##   0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0
##  1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 0
##  0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 0
```

The vector `teams` contains the team abbreviation for all teams. We collect the win/loss sequences for all teams and put it in the data frame `D`

```teams <- as.character(unique(gl2002\$HomeTeam))
D <- NULL
for(j in teams)
D <- rbind(D,
data.frame(Team=j, Result=find.team.sequence(j, gl2002)))
```

The function `clumpy` will compute the sum of squared spacings of a win/loss binary sequence.

```clumpy <- function(y){
find.gaps <- function(x) {
n <- length(x)
ab.hit <- c((1:n)[x == 1], n + 1)
diff(c(0, ab.hit)) - 1
}
sum(find.gaps(y) ^ 2)
}
```

Illustrate these functions for the 2002 Oakland Athletics.

```oakland <- find.team.sequence("OAK", gl2002)
clumpy(oakland)
```
```##  139
```
```permtest(oakland)
```
```##  0.188
```

Using the `ddply` function in the `plyr` function, we compute for each team, the sum of squared spacings and the p-value of the permutation test. In addition, by looking at the spacings between losses (instead of victories), we compute the sum of squared spacings and the associated p-value. So we can explore the significance of both losing streaks and winning streaks.

```library(plyr)
S2 <- ddply(D, .(Team), summarize,
Stat.Win=clumpy(Result),
P.Value.Win=permtest(Result),
Stat.Lose=clumpy(1 - Result),
P.Value.Lose=permtest(1 - Result))
```

We plot the p-value (streakiness in winning sequence) against the p-value (streakiness in losing sequence) for all teams.

```ggplot(S2, aes(P.Value.Win, P.Value.Lose, label=Team)) +
geom_point() + geom_text() +
geom_smooth(method="lm",se=FALSE) +
scale_x_continuous(limits=c(0, 1)) +
scale_y_continuous(limits=c(0, 1)) +
annotate("text", x = 0.1, y = 0.1,
label = "STREAKY", color="red", size=10) +
annotate("text", x = 0.85, y = 1.0,
label = "CONSISTENT", color="blue", size=10)
``` This graph is interesting — teams in the lower left portion of the plot (small p-values) tend to be the streaky teams in the 2002 season, and the teams in the upper right portion (large p-values) tend to be the consistent teams. The line indicates a positive relationship — teams who tend to be streaky in their pattern of winning games tend also to be consistent in their pattern of losing games.

Advertisements

### One response

1. Hi Jim !

Very interesting article. I am trying to map this knowledge to football games, but on football games there also is “Draw”. So a typical sequence could be something like: {L,L,D,L,W}. I am thinking on different ways to solve that, but not sure 100%. So I would appreciate a lot an advice from you.

Greetings !

Sergio.