Calculation of Win Probabilities, Part II

In an earlier post, I described how one can compute end-of-inning win probabilities for the home team using Retrosheet game log data. One is usually interested in computing win probabilities after each play, and using these to find so-called WPA (win-probability added) values that we can attribute to specific players. Here we describe using Retrosheet play-by-play data to compute these win probabilities and we’ll contrast these values with the ones posted on the Baseball-Reference site.

To review some previous work, we have already described in a previous post the process of downloading all play-by-play data for a specific season from Retrosheet and computing the run values (from the run expectancy table) for all players. We will be using these run values in this work.

Computing the Inning Win Probabilities

I wrote a new function prob.win.game2.R that will download the Retrosheet play-by-play data for a specific season and use these to estimate the probability the home team wins at the end of each half-inning. Essentially, I collect the winning margin (home score – visitor score) and the ultimate game outcome (1 if the home team wins, and 0 otherwise) at the end of a specific half-inning and fit the logistic model
\log \left( \frac{p} {1 - p}\right) = \beta_0 + \beta_1 \times (WIN MARGIN)
where p is the probability the home team wins. We fit 16 of these models, corresponding to each of the half-innings top-of-first, bottom-of-first, …, top-of eighth, bottom-of-eighth. (In the 9th inning and beyond, we use the same model that was used for the 8th inning.) For example, at the end of the 4th inning, we have the fit
\log \left( \frac{p} {1 - p}\right) = 0.074 + 0.632 \times (WIN MARGIN)
So by leading by an additional run after 4 innings, the home team’s probability of winning (on the logit scale) is increased by 0.632.

Computing the Play Win Probabilities

Actually, we want to compute the probability the home team wins after each play in an inning. For example, suppose the home team has runners on 1st and 2nd with one out in the 4th — what is the probability they will win? Using a rule of probabilities
P(Win) = \sum P(R) P(Win | R),
where P(R) is the probability the team scores R runs in the inning, and P(Win | R) is the probability the team wins at the end of the inning if they have scored R runs. This probability is tedious to compute, but we can approximate it by
P(Win) \approx P(Win | \bar R),
where \bar R is the expected run value, and P(Win | \bar R) is the probability the home team wins (computed using the logistic model) if the team has scored \bar R runs. For example, suppose that a home team currently has a 2 run lead, and runners on 1st and 2nd with two outs in the bottom of the forth. The expected number of runs in the remainder of the inning is 0.40 (from 2014 data). Using the logistic model, the logit of the probability of the home team winning at the bottom of that inning is estimated to be
\log \left( \frac{p} {1 - p}\right) = 0.074 + 0.632 \times (2 + 0.40) = 1.59,
and the probability the home team wins the game is exp(1.59) / (1 + exp(1.59)) = 0.83.

To compute these win probs, I first use the function compute.runs.expectancy to compute all of the run values and then use a new function compute.win.probs to compute the win probabilities after all plays. This last function uses both the data frame that contains the Retrosheet data and run values, and also the data frame containing the logistic regression coefficients for all half-innings.

Displaying Play Win Probabilities

I have saved the data frame containing all of this work for the 2014 season on my website. You can plot the win probabilities for a specific game by downloading the data frame and sourcing in a new function graph.game that will do the plotting. The function outputs the rows of the data frame corresponding to the game. Here is an illustration of how it works — we’ll plot the win probabilities for the New York Yankees at Tampa Bay Rays game of April 18, 2014.  The plays

load(url("http://bayes.bgsu.edu/baseball/pbp2014winprob.Rdata"))
library(devtools)
source_gist("a8e5255c98954d151974")
d.game <- graph.game(d, "TBA201404180")

winprob11

You can plot win probabilities for any game for the 2014 season — just change the game code in the input of graph.game .

Checking My Work

If you look at the Baseball-Reference description of this game here, you’ll see the same type of win probabilities plotted and each play has an associated value of WPA. I’m not sure of the exact method that was used, but it is easy to download their data and compare their values of WPA with my values. Here is a scatterplot of the two sets of values for this particular game.

winprob12

I’m encouraged that the WPA values are similar in this case. In particular, both methods would identify the important plays in this game. The key play was James Loney’s single in the bottom of the 7th that raised the Tampa win probability by over 40%.

The code for the two new functions prob.win.game2.R and compute.win.probs can be found on my gist site here.

6 responses

  1. Hi Jim, great stuff. Thanks for sharing. I did something similar. I downloaded retrosheet play by play data from the past four seasons. At first my plan was to similulate the runs created stats, as shown in the Marchi/Albert book, so I built a shiny app to do jst that. But then I decided to take it further and use the information to calculate expected runs states and prob of scoring at least one run, and to calculate the break even success rates of all sorts of base running plays (tag from third, steal, advance two bases on a single, and sac bunts), all depending on the bases/outs state. I show that runners are usually far too cautious (particularly when tagging- or not tagging- from third). I also show that in all circumstances the sac bunt gets you to a worse state, with the one exception being that a successful bunt with a runner on second and no outs slightly increases the prob of scoring at least one run. There is also an input box allowing the user to see the expected end state, depending on the entered ‘success rate’.

    I then built a second shiny app, using the probability distributions of runs scored in a full half inning by the home and away teams, along with the distribution of the current half inning. I simulate several thousand games to calculate the probability of each team winning the game. I also allow the user to see how the probs will be affected by bunts, steals, and tags, all before the play as well as after both a successful and unsuccessful outcome.

    I’d be happy to share my code with you, although admittedly yours is far more glamorous than mine.

    Please take a look. I’d look forward to hearing your thoughts.

    Mark Malter

    https://malter61.shinyapps.io/BaseballStats/
    https://malter61.shinyapps.io/gamePredictor/

    1. Mark, those are neat Shiny applications — they seemed very easy to use. My only suggestion is to try to graphically display some of the output so I can quickly see some of the effects. It seems that 99% of sabermtrics results are shown through tables, when a well-constructed graph can communicate much better.

      1. Thanks Jim. Good idea. Hopefully I’ll get the graphics going soon.

  2. I took your advice Jim. So far I’ve added graphics for stolen bases and sac bunts. The user can play with the baserunning success rate to see how the expected end state changes.

  3. Hi Jim,

    I recently came across your post while investigating methods for calculating win expectancies and have a quick question about the example above. Where do the values for beta-0 (0.074) and beta-1 (0.632) come from, specifically, as they relate to the probabilities for innings and leads presented in the first part of your series on win expectancies? I’m 40+ years from my college stats courses, so any explanation in baseball terms would be greatly appreciated.

    Thanks!

    1. John:

      In that post, I was fitting a logistic model where log(p / (1 – p)) is a linear function of the runs ahead at the end of the inning — p is the probability the team wins the game. So log(p / (1 – p) = beta0 + beta1 * R and you use a standard algorithm to find beta0 and beta1.

      Jim

Leave a comment