Modeling Team Competition

Some History

Back in October 1992 (almost 22 years ago — remarkable how time flies), I organized a Ohio Statistics conference at Bowling Green and was very fortunate to have Bill James as the main speaker. His talk addressed the simple question “Does the Best Team Win the World Series?”. To answer this question, Bill ran a simulation. He assumed that teams all possessed specific talents and simulated a full season and post-season assuming these talents. He repeated this process for many seasons, collecting the team talents and season results. He showed that the best team, that is the team with the best talent, had a relatively small chance of winning the World Series. (By the way, Bill’s Bowling Green paper together with some statistical work by myself and Hal Stern appeared in Chance.)

Anyway, Bill’s study inspired me to write more about simulating baseball seasons, in Chapter 12 of Curve Ball and Chapter 9 of Analyzing Baseball with R. The key model I used in the simulation was the famous Bradley-Terry model. Here I describe this model and show how one can fit this model easily to win/loss data for a specific season using the R package BradleyTerry2.

Here is the Bradley-Terry model. Suppose the 30 teams have specific talents (abilities) that we denote by a_1, ..., a_{30}. The probability team i defeats team j is given by

P({\rm team} \, i \, {\rm beats} \, {\rm team} \, j) = \frac{exp(a_i - a_j)} {1 + exp(a_i - a_j)}

This is a special case of a logistic model which is a common regression model where the response variable is binary (0 or 1).

Here’s how I use R to fit this model to data from the 2013 season.

Setup

The function load.gamelog will read in the Retrosheet gamelog file for a particular season. The inputs are the season and the vector of names of the variables.

load.gamelog <- function(season, headers){
  download.file(
    url <- paste("http://www.retrosheet.org/gamelogs/gl", season, ".zip"
              , sep="")
    , destfile <- paste("gl", season, ".zip", sep="")
  )
  unzip(paste("gl", season, ".zip", sep=""))
  gamelog <- read.table(paste("gl", season, ".txt", sep="")
                        , sep=",", stringsAsFactors=F)
  names(gamelog) <- headers
  file.remove(paste("gl", season, ".zip", sep=""))
  file.remove(paste("gl", season, ".txt", sep=""))
  gamelog
}

The file headerinfo.R (found on my github gist site) creates a vector Header containing the variable names. We use the load.gamelog function to read in the game logs for the 2013 season.

source("headerinfo.R")
gl2013 <- load.gamelog(2013, Headers)

Next, I create a new variable Win that is 1 or 0 depending if the home team wins (1) or loses (0).

gl2013$Win <- with(gl2013, 
               ifelse(HomeRunsScore > VisitorRunsScored, 1, 0))

Using the summarize function in the dplyr package, I find the number of home wins and away wins for each matchup of two teams. I show what the first few rows of this data frame look like.

library(dplyr)
d <- summarize(group_by(gl2013, HomeTeam, VisitingTeam),
               home.wins=sum(Win), away.wins=sum(1 - Win))
head(d)
## Source: local data frame [6 x 4]
## Groups: HomeTeam
## 
##   HomeTeam VisitingTeam home.wins away.wins
## 1      ANA          BAL         1         3
## 2      ANA          BOS         2         1
## 3      ANA          CHA         2         2
## 4      ANA          CHN         1         1
## 5      ANA          CLE         0         3
## 6      ANA          DET         3         0

Fit the Model

We can now fit the Bradley-Terry model using the function Btm in the BradleyTerry2 package. I’m following the syntax in the example of the function Btm.

library(BradleyTerry2)
model1 <- BTm(cbind(home.wins, away.wins), HomeTeam, VisitingTeam,
              data=d, id= "team")
model1
## Bradley Terry model fit by glm.fit 
## 
## Call:  BTm(outcome = cbind(home.wins, away.wins), player1 = HomeTeam, 
##     player2 = VisitingTeam, id = "team", data = d)
## 
## Coefficients:
##  teamARI   teamATL   teamBAL   teamBOS   teamCHA   teamCHN   teamCIN  
##  0.07120   0.36552   0.25565   0.51730  -0.34796  -0.26937   0.29493  
##  teamCLE   teamCOL   teamDET   teamHOU   teamKCA   teamLAN   teamMIA  
##  0.32887  -0.10462   0.35421  -0.64300   0.20159   0.31251  -0.40908  
##  teamMIL   teamMIN   teamNYA   teamNYN   teamOAK   teamPHI   teamPIT  
## -0.08390  -0.26797   0.24895  -0.13366   0.40467  -0.14482   0.38195  
##  teamSDN   teamSEA   teamSFN   teamSLN   teamTBA   teamTEX   teamTOR  
## -0.05402  -0.16107  -0.03923   0.44673   0.38255   0.28207   0.00848  
##  teamWAS  
##  0.13789  
## 
## Degrees of Freedom: 540 Total (i.e. Null);  511 Residual
## Null Deviance:       772 
## Residual Deviance: 670   AIC: 1530

The output gives the estimates at the team talents a_1, ..., a_{30}. One can not get unique estimates at all 30 talents with imposing some restriction, so by convention the first talent is equal to a_1 = 0.

Add a Home Advantage Effect?

One way to improve this model is to add a home effect since teams are more likely to win at home.

P({\rm team} \, i \, {\rm beats} \, {\rm team} \, j) = \frac{exp(a_i - a_j + h)} {1 + exp(a_i - a_j + h)}

We add a home advantage variable at.home and refit the model using the home advantage effect.

d$HomeTeam <- data.frame(team = d$HomeTeam, at.home = 1)
d$VisitingTeam <- data.frame(team = d$VisitingTeam, at.home = 0)
model2 <- update(model1, formula = ~ team + at.home)
model2
## Bradley Terry model fit by glm.fit 
## 
## Call:  BTm(outcome = cbind(home.wins, away.wins), player1 = HomeTeam, 
##     player2 = VisitingTeam, formula = ~team + at.home, id = "team", 
##     data = d)
## 
## Coefficients:
##  teamARI   teamATL   teamBAL   teamBOS   teamCHA   teamCHN   teamCIN  
##  0.07167   0.36793   0.25728   0.52084  -0.34969  -0.27086   0.29660  
##  teamCLE   teamCOL   teamDET   teamHOU   teamKCA   teamLAN   teamMIA  
##  0.33057  -0.10495   0.35567  -0.64716   0.20269   0.31446  -0.41076  
##  teamMIL   teamMIN   teamNYA   teamNYN   teamOAK   teamPHI   teamPIT  
## -0.08428  -0.27022   0.25035  -0.13421   0.40725  -0.14545   0.38492  
##  teamSDN   teamSEA   teamSFN   teamSLN   teamTBA   teamTEX   teamTOR  
## -0.05447  -0.16273  -0.03915   0.44973   0.38564   0.28218   0.00867  
##  teamWAS   at.home  
##  0.13922   0.15739  
## 
## Degrees of Freedom: 540 Total (i.e. Null);  510 Residual
## Null Deviance:       772 
## Residual Deviance: 656   AIC: 1520

Using the Model

Let’s illustrate using the fitted model. Suppose Atlanta is playing Detroit in Atlanta — we see that the respective team ability estimates are 0.36793 and 0.35567 and the home effect estimate is 0.15739. So the probability Atlanta defeats Detroit in this game is estimated to be
p = \frac{\exp(0.36793 - 0.35567 + 0.15739)}{1 + \exp(0.36793 - 0.35567 + 0.15739)} = 0.54
So the probability Atlanta wins this particular game is 0.54.

Is the B-T model a good model for team competition?

Well, the B-T model certainly is an approximation to real life. For example, I believe it is unrealistic to assume that a team has a constant talent throughout the season, and I don’t quite believe each team has the same home advantage. But I believe this B-T model is helpful for understanding the random behavior of baseball competition. In particular, it helps us learn about the impact of changes in the playoff structure (such as the addition of a second wild card team) to baseball competition.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: