When you visit Baseball Reference’s boxes pages, such as the extended box score of the final game of the 2014 season, you will see a display of win probabilities. One particular graph shows the probability of the Giants winning the game after each play. These are used to measure player performances. For example, Madison Bumgarner was the most valuable player of this game since his total win probability added (WPA) over all of his plays was 0.603. (On a side note, Jay Bennett and I presented these type of calculations in Curve Ball.)

It is straightforward to compute these win probabilities. A first step in this calculation is to compute, at the end of each inning, the probability the home team wins the game. We illustrate one method of performing this computation using Retrosheet game logs for a particular season.

A function ` plot.prob.home `

is available on this github gist page. Let me outline the main steps of this function.

- The game logs for the particular season are downloaded from Retrosheet. A short file is also downloaded from my site that gives the header (variable names) for the game log file.
- The game log file contains the line scores for the home and visiting teams. The function will parse these line scores and create a numerical variable of runs scored for all innings. (The string function
`strsplit`

is helpful for doing this parsing.) - It didn’t seem obvious how to parse line scores for games where with innings where 10 or more runs are scored. So I have omitted those line scores — I don’t think this would impact the results very much.
- A logistic regression model is used to develop a smooth curve for predicting the probability of a home team victor given the winning margin. This model has the form

This model is applied for each of the bottoms of the eight innings. The regression intercept represents the home team advantage and represents the additional advantage for the home team (on the logit scale) for each additional run lead. - I use the
`ggplot2`

package to plot the probability the home team wins as a function of the inning number (1 through 8) for each of the home team leads -4, -3, -2, -1, 0, 1, 2, 3, 4.

Here are a couple of illustrations of the use of this function for two seasons — 1980 and 2013. Assuming you have the packages ` devtools `

, ` arm `

, and ` ggplot2 `

installed, you can just type the following R code into the Console window to see these graphs and have the probabilities displayed in a data frame.

library(devtools) source_gist("70a166149f71622fed97") plot.prob.home(1993)

plot.prob.home(2013)

Several comments about what we see in these graphs.

- Note that the home team has a slight advantage at the beginning, about 0.54, of winning the game. The size of the home advantage has stayed pretty consistent through recent baseball seasons. Also, the home team maintains this slight advantage for tied games at the end of each inning.
- As one might expect, the probability the home team wins with a small lead gets closer to 1 as the game approaches the end. One way of measuring the impact of a team’s closer is to compare these late-game probabilities across teams.
- Following up this last comment, if you look at the description of the WPA methodology, you will note that these probability calculations are for an “average team”, and it would be desirable to get probability and WPA estimates that take in account the offense and defense capabilities of the individual teams.

The calculation of these win probabilities is a start towards computing the WPAs after each baseball play. The WPAs can be found by using this work together with run expectancies and the details of this work will be found in a future post.

Reblogged this on Adam Harper's Sports Analytics Blog and commented:

Great post on conditional probability and logistic regression.