Calculation of Win Probabilities, Part II

In an earlier post, I described how one can compute end-of-inning win probabilities for the home team using Retrosheet game log data. One is usually interested in computing win probabilities after each play, and using these to find so-called WPA (win-probability added) values that we can attribute to specific players. Here we describe using Retrosheet play-by-play data to compute these win probabilities and we’ll contrast these values with the ones posted on the Baseball-Reference site.

To review some previous work, we have already described in a previous post the process of downloading all play-by-play data for a specific season from Retrosheet and computing the run values (from the run expectancy table) for all players. We will be using these run values in this work.

Computing the Inning Win Probabilities

I wrote a new function prob.win.game2.R that will download the Retrosheet play-by-play data for a specific season and use these to estimate the probability the home team wins at the end of each half-inning. Essentially, I collect the winning margin (home score – visitor score) and the ultimate game outcome (1 if the home team wins, and 0 otherwise) at the end of a specific half-inning and fit the logistic model
$\log \left( \frac{p} {1 - p}\right) = \beta_0 + \beta_1 \times (WIN MARGIN)$
where p is the probability the home team wins. We fit 16 of these models, corresponding to each of the half-innings top-of-first, bottom-of-first, …, top-of eighth, bottom-of-eighth. (In the 9th inning and beyond, we use the same model that was used for the 8th inning.) For example, at the end of the 4th inning, we have the fit
$\log \left( \frac{p} {1 - p}\right) = 0.074 + 0.632 \times (WIN MARGIN)$
So by leading by an additional run after 4 innings, the home team’s probability of winning (on the logit scale) is increased by 0.632.

Computing the Play Win Probabilities

Actually, we want to compute the probability the home team wins after each play in an inning. For example, suppose the home team has runners on 1st and 2nd with one out in the 4th — what is the probability they will win? Using a rule of probabilities
$P(Win) = \sum P(R) P(Win | R),$
where $P(R)$ is the probability the team scores R runs in the inning, and $P(Win | R)$ is the probability the team wins at the end of the inning if they have scored R runs. This probability is tedious to compute, but we can approximate it by
$P(Win) \approx P(Win | \bar R),$
where $\bar R$ is the expected run value, and $P(Win | \bar R)$ is the probability the home team wins (computed using the logistic model) if the team has scored $\bar R$ runs. For example, suppose that a home team currently has a 2 run lead, and runners on 1st and 2nd with two outs in the bottom of the forth. The expected number of runs in the remainder of the inning is 0.40 (from 2014 data). Using the logistic model, the logit of the probability of the home team winning at the bottom of that inning is estimated to be
$\log \left( \frac{p} {1 - p}\right) = 0.074 + 0.632 \times (2 + 0.40) = 1.59,$
and the probability the home team wins the game is exp(1.59) / (1 + exp(1.59)) = 0.83.

To compute these win probs, I first use the function compute.runs.expectancy to compute all of the run values and then use a new function compute.win.probs to compute the win probabilities after all plays. This last function uses both the data frame that contains the Retrosheet data and run values, and also the data frame containing the logistic regression coefficients for all half-innings.

Displaying Play Win Probabilities

I have saved the data frame containing all of this work for the 2014 season on my website. You can plot the win probabilities for a specific game by downloading the data frame and sourcing in a new function graph.game that will do the plotting. The function outputs the rows of the data frame corresponding to the game. Here is an illustration of how it works — we’ll plot the win probabilities for the New York Yankees at Tampa Bay Rays game of April 18, 2014.  The plays