## Some History

Back in October 1992 (almost 22 years ago — remarkable how time flies), I organized a Ohio Statistics conference at Bowling Green and was very fortunate to have Bill James as the main speaker. His talk addressed the simple question “Does the Best Team Win the World Series?”. To answer this question, Bill ran a simulation. He assumed that teams all possessed specific talents and simulated a full season and post-season assuming these talents. He repeated this process for many seasons, collecting the team talents and season results. He showed that the best team, that is the team with the best talent, had a relatively small chance of winning the World Series. (By the way, Bill’s Bowling Green paper together with some statistical work by myself and Hal Stern appeared in Chance.)

Anyway, Bill’s study inspired me to write more about simulating baseball seasons, in Chapter 12 of Curve Ball and Chapter 9 of Analyzing Baseball with R. The key model I used in the simulation was the famous Bradley-Terry model. Here I describe this model and show how one can fit this model easily to win/loss data for a specific season using the R package `BradleyTerry2`

.

Here is the Bradley-Terry model. Suppose the 30 teams have specific talents (abilities) that we denote by . The probability team i defeats team j is given by

This is a special case of a logistic model which is a common regression model where the response variable is binary (0 or 1).

Here’s how I use R to fit this model to data from the 2013 season.

## Setup

The function ` load.gamelog `

will read in the Retrosheet gamelog file for a particular season. The inputs are the season and the vector of names of the variables.

load.gamelog <- function(season, headers){ download.file( url <- paste("http://www.retrosheet.org/gamelogs/gl", season, ".zip" , sep="") , destfile <- paste("gl", season, ".zip", sep="") ) unzip(paste("gl", season, ".zip", sep="")) gamelog <- read.table(paste("gl", season, ".txt", sep="") , sep=",", stringsAsFactors=F) names(gamelog) <- headers file.remove(paste("gl", season, ".zip", sep="")) file.remove(paste("gl", season, ".txt", sep="")) gamelog }

The file `headerinfo.R`

(found on my github gist site) creates a vector Header containing the variable names. We use the `load.gamelog`

function to read in the game logs for the 2013 season.

source("headerinfo.R") gl2013 <- load.gamelog(2013, Headers)

Next, I create a new variable `Win`

that is 1 or 0 depending if the home team wins (1) or loses (0).

gl2013$Win <- with(gl2013, ifelse(HomeRunsScore > VisitorRunsScored, 1, 0))

Using the `summarize`

function in the `dplyr`

package, I find the number of home wins and away wins for each matchup of two teams. I show what the first few rows of this data frame look like.

library(dplyr) d <- summarize(group_by(gl2013, HomeTeam, VisitingTeam), home.wins=sum(Win), away.wins=sum(1 - Win)) head(d) ## Source: local data frame [6 x 4] ## Groups: HomeTeam ## ## HomeTeam VisitingTeam home.wins away.wins ## 1 ANA BAL 1 3 ## 2 ANA BOS 2 1 ## 3 ANA CHA 2 2 ## 4 ANA CHN 1 1 ## 5 ANA CLE 0 3 ## 6 ANA DET 3 0

## Fit the Model

We can now fit the Bradley-Terry model using the function `Btm`

in the `BradleyTerry2`

package. I’m following the syntax in the example of the function `Btm`

.

library(BradleyTerry2) model1 <- BTm(cbind(home.wins, away.wins), HomeTeam, VisitingTeam, data=d, id= "team") model1 ## Bradley Terry model fit by glm.fit ## ## Call: BTm(outcome = cbind(home.wins, away.wins), player1 = HomeTeam, ## player2 = VisitingTeam, id = "team", data = d) ## ## Coefficients: ## teamARI teamATL teamBAL teamBOS teamCHA teamCHN teamCIN ## 0.07120 0.36552 0.25565 0.51730 -0.34796 -0.26937 0.29493 ## teamCLE teamCOL teamDET teamHOU teamKCA teamLAN teamMIA ## 0.32887 -0.10462 0.35421 -0.64300 0.20159 0.31251 -0.40908 ## teamMIL teamMIN teamNYA teamNYN teamOAK teamPHI teamPIT ## -0.08390 -0.26797 0.24895 -0.13366 0.40467 -0.14482 0.38195 ## teamSDN teamSEA teamSFN teamSLN teamTBA teamTEX teamTOR ## -0.05402 -0.16107 -0.03923 0.44673 0.38255 0.28207 0.00848 ## teamWAS ## 0.13789 ## ## Degrees of Freedom: 540 Total (i.e. Null); 511 Residual ## Null Deviance: 772 ## Residual Deviance: 670 AIC: 1530

The output gives the estimates at the team talents . One can not get unique estimates at all 30 talents with imposing some restriction, so by convention the first talent is equal to .

## Add a Home Advantage Effect?

One way to improve this model is to add a home effect since teams are more likely to win at home.

We add a home advantage variable `at.home`

and refit the model using the home advantage effect.

d$HomeTeam <- data.frame(team = d$HomeTeam, at.home = 1) d$VisitingTeam <- data.frame(team = d$VisitingTeam, at.home = 0) model2 <- update(model1, formula = ~ team + at.home) model2 ## Bradley Terry model fit by glm.fit ## ## Call: BTm(outcome = cbind(home.wins, away.wins), player1 = HomeTeam, ## player2 = VisitingTeam, formula = ~team + at.home, id = "team", ## data = d) ## ## Coefficients: ## teamARI teamATL teamBAL teamBOS teamCHA teamCHN teamCIN ## 0.07167 0.36793 0.25728 0.52084 -0.34969 -0.27086 0.29660 ## teamCLE teamCOL teamDET teamHOU teamKCA teamLAN teamMIA ## 0.33057 -0.10495 0.35567 -0.64716 0.20269 0.31446 -0.41076 ## teamMIL teamMIN teamNYA teamNYN teamOAK teamPHI teamPIT ## -0.08428 -0.27022 0.25035 -0.13421 0.40725 -0.14545 0.38492 ## teamSDN teamSEA teamSFN teamSLN teamTBA teamTEX teamTOR ## -0.05447 -0.16273 -0.03915 0.44973 0.38564 0.28218 0.00867 ## teamWAS at.home ## 0.13922 0.15739 ## ## Degrees of Freedom: 540 Total (i.e. Null); 510 Residual ## Null Deviance: 772 ## Residual Deviance: 656 AIC: 1520

## Using the Model

Let’s illustrate using the fitted model. Suppose Atlanta is playing Detroit in Atlanta — we see that the respective team ability estimates are 0.36793 and 0.35567 and the home effect estimate is 0.15739. So the probability Atlanta defeats Detroit in this game is estimated to be

So the probability Atlanta wins this particular game is 0.54.

## Is the B-T model a good model for team competition?

Well, the B-T model certainly is an approximation to real life. For example, I believe it is unrealistic to assume that a team has a constant talent throughout the season, and I don’t quite believe each team has the same home advantage. But I believe this B-T model is helpful for understanding the random behavior of baseball competition. In particular, it helps us learn about the impact of changes in the playoff structure (such as the addition of a second wild card team) to baseball competition.