Beyond Runs Expectancy – Some Background

Introduction: Modeling Runs Scored in an Inning

In last week’s post, I presented a different way of understanding runs potential over different states of an inning defined by the runners on base and the number of outs. The problem of modeling runs scored has addressed by many people over the years. For example, Keith Woolner in this article describes a RPI formula and Patriot describes in this post the Tango distribution and use of a zero-inflated negative binomial distribution.

Some years ago, I did explore the use of standard distributions to represent the runs scored in a half-inning. What I found was that the Poisson model underestimated the probability of a scoreless inning and the negative binomial model (which uses an additional parameter) was unsatisfactory. I could have continued with more sophisticated probability models, but the ordinal response approach seemed pretty easy to apply.

Since some readers may not be familiar with the cumulative logit ordinal response regression model, I will introduce this model by assuming that there is an unobserved run-scoring variable behind the scenes. This motivates the familiar logistic model, and by using more cutpoints, the ordinal regression model. There is an important assumption with this ordinal model, called proportional odds, and I’ll give evidence that this is approximately true for the run scoring data.

Last I’ll contrast runs expectancy and my runs advantage tables over 20 seasons of data. Tom Tango suggested in my ordering of states that the “120” state (runners on 1st and 2nd) should go before the “003” state (runner on third). Runs expectancy suggests that “120” is more valuable than “003” when there are 0 or 2 outs, but runs advantage gives an opposite conclusion.

Logistic Regression

Here is a graphical way to think about logistic regression. Suppose you could measure the run-scoring performance $Z$ of a team — you don’t observe this, but it is the culmination of the team’s ability to get on base and move runners to home. We assume that $Z$ has a logistic (bell-shaped) curve distribution with mean $\mu$ and scale 1 that I show below.

I’d added a vertical line at $Z = 0$. We observe for each inning if one or more runs scores ($Z > 0$) or a run does not score ($Z \le 0$). The mean $\mu$ might depend on other things like the team, home vs away or other covariates. This representation is equivalent to a logistic regression model.

More Cutpoints – Ordinal Regression

In logistic regression we just recorded 1 if one or more run scores in an inning. Suppose instead we record if one of the five outcomes happens:

no runs scored, 1 run, 2 runs, 3 runs, 4 or more runs

Now we have an ordinal response. We represent the ordinal response model using the same picture but now we have four lines dividing the five possible outcomes. The mean $\mu$ is a function of different things that impact run scoring such as the state of the inning (number of outs and bases occupied.). So really ordinal regression and logistic regression follow the same representation about the logistic curve for $Z$. The only difference is that we have one cutpoint (0) for the binary response in logistic regression and we have four cutpoints for the ordinal response in ordinal regression. When we fit the ordinal regression model, we estimate the boundaries (cutpoints) and the covariate effect.

Proportional Odds

There is one big assumption in ordinal logistic regression. Suppose we wish to compare the run scoring in two different states, say “no runners on with 0 outs” and “runner on 1st with 1 out”. We assume that the run scoring in the two states has the same logistic curve, where the only difference is the locations — that is, the two curves have the same spread.

Suppose we collect the run scoring $R$ in two states and compute the logits

$LOGIT = \log\left(\frac{P(R \le j)}{P(R > j)}\right)$

and we plot these logits against the boundary (value of $j$) for each state. The model assumes that these logit curves are parallel. That is

LOGIT (state 2) = LOGIT(state 1) + constant.

We can state this in terms of odds which is the ratio of the probability of runs less than or equal to j to the probability of runs greater than j.

$ODDS = \frac{P(R \le j)}{P(R > j)}$

For one state the ODDS is a constant multiple of the ODDS for a second state — this is the proportional odds assumption.

ODDS (state 2) = constant $\times$ ODDS (state 1).

We can check this assumption graphically for the 2022 season data. Below I plot the observed logits as a function of the boundary for all 24 states defined by number of outs and runners on base. Generally these logit curves do look parallel — this suggests that the ordinal regression model is reasonable for the run-scoring data.

Comparing Runs Expectancy with Ordinal Coefficients

There is an interesting comparison between the familiar runs expectancy table and my “runs advantage table” where you are displaying the ordinal regression slopes for all bases/outs situation.

Here is the run expectancy table (using data for 20 seasons 2000-2019) and the graph of these values. Following Tom Tango’s suggestion, the “Bases Score” is defined to be

Bases Score = 1 (runner on 1st) + 2 (runner on 2nd) + 4 (runner on 3rd)

The Bases Score corresponds to the column number in the table below. Note that, for 0 and 2 outs, the “003” situation (column 5) has lower runs expectation than the “120” situation (column 4).

Here is the corresponding table and graph for my ordinal regression coefficients (again using 20 seasons of data). We note:

• We don’t have the ordering that we saw for runs expectancies — the bases situation “120” has a smaller coefficient value than “003” for each of the three outs values.
• Generally, the ordinal coefficients are monotone increasing as a function of the bases score for each values of outs.
• I should emphasize that these coefficients are based on the proportional odds assumption, so perhaps more study should be made to compare the “120” and “003” situations.

Takeaways

• Probability Modeling of Runs Scored? Run scoring in baseball is not well represented by typical probability models like the Poisson or negative binomial which motivates the ordinal response regression approach.
• Ordinal and logistic regression are based on the same conceptual logistic distribution framework — you just have more boundaries with the ordinal approach.
• The proportional-odds assumption in ordinal regression seems to be supported by the observed run scoring data. But violations of this assumption for particular cases would be interesting for further study.
• Comparing Value of States. The ordinal regression coefficient estimates support the belief that the runner on 3rd situation is more valuable (from a run scoring perspective) than the runners on 1st and 2nd situation. The runs expectancy values don’t support this pattern.
• How Can a Team Score 0.5 Runs? I think the runs advantage table is better than runs expectancy since it tells you something about the distribution of runs scored, not just the average.
• Other Applications. I have illustrated the use of ordinal regression models in earlier posts. In this post, I explore the ordinal response that is the outcome in balls put in play and in this post, I use an ordinal model to relate expected wOBA with the Statcast barrels definition.