As you may know, I am a Bayesian statistician and so I think of inference and prediction from a Bayesian perspective. Although many of my posts have illustrated Bayesian thinking, I thought it would be good to have a new series of posts introducing Bayesian modeling in simple settings. Here I introduce some features of Bayes, using a model that should be familiar to all “quantitative” baseball fans. (By the way, I know that some MLB teams have placed ads for analysts who are familiar with Bayesian inference, so MLB understands the usefulness of this inferential approach.)
Bill James’ Pythagorean Formula
Many years ago, Bill James found empirically a simple relationship between a team’s wins and losses and the runs scored and allowed. This Pythagorean relationship has the simple form that the ratio of wins ( W) to losses (L) is equal to a power of the ratio of the runs scored (R) to runs allowed (RA):
This formula is easy to demonstrate graphically. Below I construct a scatterplot of R/RA against W/L for the 2019 MLB teams and have overlaid a Pythagorean curve using the exponent value of k = 1.8.
A Statistical Model Based on the Formula
Here is a Bayesian extension of this Pythagorean relationship. We assume that a team’s actual win loss ratio W/L is normally distributed with a mean given by the runs ratio taken to a power k, and a standard deviation . We write this as
The unknown parameters of this model (the quantities we want to learn about) are the exponent value k and the standard deviation . In Bayesian thinking, we regard these unknown parameter values as random and we express our beliefs about the location of these values by means of a prior probability distribution.
Figuring Out the Prior
Okay, this is one of the tricky Bayesian things — how does one specify priors for the model parameters k and ? I describe some ways of doing this below.
- Independence. First, I am going to assume that my prior beliefs about the power parameter k are independent or unrelated to my prior beliefs about . That seems to be a reasonable assumption and it makes it easier to construct a prior.
- Prior for the Pythagorean exponent k? When Bill James introduced this formula, he said that k was in the neighborhood of the value 2. So a reasonable guess at k would be 2 and this would be the mean of my prior. So I start with the belief that k is normally distributed with mean 2 and a standard deviation S1. The value of the standard deviation S1 reflects the strength of my prior belief that k is equal to 2. Suppose I don’t have a lot of confidence in the guess of 2, and think that k could plausibly be any value between 1.5 and 2.5. Based on this statement, I assume that S1 = 0.5. My prior for k is normal with mean 2 and standard deviation 0.5.
- Prior for the standard deviation ? It is pretty hard to construct a prior for the standard deviation . This standard deviation reflects the variation of a team’s win/loss ratio for a fixed value of the runs ratio. (Looking at the scatterplot above, these would be the vertical deviations from the average for a fixed value of the horizontal variable R/RA.) This might be a hard parameter to guess at since we aren’t that familiar with teams’ win/loss ratios and how they can vary. In cases like this when it is hard to think of a prior, one can use a weakly informative prior for that reflects little knowledge about this parameter. Here is one weakly informative prior we can try: has an exponential distribution with rate 1.
Simulating from the Predictive Distribution
- Does this prior make sense? So we have stated a prior where the exponent k is normal(2, 0.5) and is exponential(1). Is this a reasonable prior? Before we go further, a good idea is to perform so-called predictive checks. Basically, what we do is to simulate data, that is win/loss ratios, from the predictive distribution assuming our prior and see if this simulated predicted data looks like data we might see in baseball seasons. If this simulated data looks fine, then our prior is reasonable.
- A predictive simulation. Suppose we have a team whose runs ratio is equal to 1.2 — that is, this team scores 20% more runs than it allows. What values of the W/L ratio would we predict for this team assuming our prior model? We do this in two steps: (1) We first simulate values of k and from our prior (from normal and exponential distributions) and (2) simulate W/L ratios from a normal distribution with mean (1.2)^k and standard deviation . I can do this using three R commands.
Since we are not familiar with win/loss ratios, I will convert those ratios to games won in a 162 game season, and then display a histogram of the games won.
Even the casual baseball fan will realize that this plot does not make sense — how can a baseball team win a negative number of games or even win 150 games in a 162-game season?
What Went Wrong and Adjusting My Prior
What we showed is that if we assume a particular prior for k and , we get unrealistic predictions for the number of games won (in a full season) by a team who scores 20% more runs than it allows. If our prior results in bad predictions, this suggests that we need to adjust our prior. Recall our prior is
k is normal(2, 0.5) and is exponential(1)
After some reflection, I think my prior on k is reasonable, but the prior on puts a lot of probability on large values of which seems unreasonable given the scatterplot that we saw earlier. (The variation of the points for fixed values of R/RA is certainly smaller than 1.) I am going to revise my prior on so that the mean is 0.1 instead of 1, so my prior is now
k is normal(2, 0.5) and is exponential(10)
(By the way, the rate parameter of an exponential is the reciprocal of the mean, so a rate of 10 is the same as a mean of 1 / 10 = 0.01.) Let’s simulate new data using this prior. Again assume we have a team that scores 20% more runs than it allows. Using the same simulation scheme, here is a histogram of the number of wins for this team in a 162-game season. This plot looks more reasonable although it is seems somewhat likely for this team to win 100+ games.
Comments and Looking Ahead
- This post was inspired by my reading of the 2nd edition of the Bayesian regression text Rethinking Statistics where McElreath talks a lot about the use of predictive simulations to understand the consequences of a particular choice of prior.
- The idea here is to introduce Bayesian modeling in a setting that is familiar. We took Bill James’ Pythagorean formula and created a corresponding regression model, allowing for variability in the W/L ratio for a fixed R/RA ratio.
- One advantage of Bayesian thinking is that it gives one the opportunity to use expert information in constructing a prior. But we see that prior construction can be challenging, especially when dealing with a parameter like that is harder to interpret and specify. One is tempted as we did to try a weakly informative prior instead of doing the extra work to specify an informative prior.
- Predictive simulation checks are very helpful in seeing if our prior makes sense. We saw that the choice of a weak prior for was not great since it led to predictions for game wins that seemed counter to what we know about games won in a season. This predictive check motivated us to think harder, and revise our prior for .
- In part II of this post, we’ll continue the study of this model and use our prior and data from a recent season in a Bayesian posterior analysis. In particular, we will illustrate how we can perform both inference (learning about the locations of k and ) and prediction (how many games a team will win in a future season).