The baseball world is still in shock after watching Daniel Murphy’s home run prowess in the MLB playoffs. One natural question to ask is “how surprising is this result based on what we knew about Murphy before the 2015 playoffs?” As a Bayesian, let me describe an exercise that illustrates “learning from data” and might provide some insight into this question.
Murphy’s Home Run Talent
We don’t know Murphy’s talent or ability to hit a home run during the 2015 playoffs. Let’s represent this talent by a number which is the probability Murphy hits a home run in a single AB.
A Prior for P
I’d like to construct a prior for based on my beliefs about Murphy’s home run ability before the playoffs. We have some data, namely the number of HR and AB for the seven seasons that Murphy has been in professional baseball. I assume that Murphy has a unique probability, say , of hitting a home run for the $j$th season, and I assume that come from a common talent distribution. (This is often called a random-effects model.) Fitting the model to the home run data for the seven seasons, I get the following curve for the talent distribution. I think this is a reasonable approximation to my beliefs about Murphy’s home run ability before the 2015 playoffs.
Predicting the NLDS
Based on my prior, I can make a prediction about the number of home runs that Murphy will hit in the NLDS against the Dodgers. I can simulate the predictive distribution by first simulating a home run ability from my prior and then simulating a number of home runs from a binomial distribution with sample size 21 (the number of at-bats in the NLDS) and probability of success . I get the following predictions (in 10,000 simulations) for the number of home runs — note that the actual number of home runs Murphy hit during the NLDS (3) is unusual, but clearly possible in my 10,000 simulations.
Updating My Prior
Now we have observed Murphy hit 3 home runs in 21 AB, my beliefs about Murphy’s home run hitting will change. There is an easy way to update my prior with this new information. I have graphed below my new prior and compared it with the original prior.
Predicting the NLCS
Based on my new prior, I can make predictions on Murphy’s home run output during the NLCS against the Cubs. I perform the same type of simulation exercise — simulate from my new prior, and then simulate from a binomial distribution with 17 AB. The graph shows the prediction for the number of NLCS home runs based on 10,000 simulations. Murphy actually hit 4 more home runs in the NLCS. It is interesting to note that this performance against the Cubs was more surprising given the current beliefs about Murphy’s talent than the three-home run performance against the Dodgers.
Predicting the World Series
Now this is the million-dollar question. How many home runs will Murphy hit in the 2015 World Series? We can take the same general approach. First, I revise my beliefs about Murphy’s home run talent . After seeing Murphy hit four homers in four games in the NLCS, my (new) prior will be adjusted so that his mean rate is about 0.032. If I know the number of ABs that Murphy will have in the World Series, I could predict the number of home runs he will hit. There are two uncertainties here — we don’t know the length of the World Series and we don’t know if the AL pitchers will even pitch to Murphy (remember all of the intentional walks given to Barry Bonds?). But I could make reasonable guesses (that is, priors) at the number of games and on the likelihood that Murphy would have an official at-bat, and then make predictions based on my prior on and my other information.
What Have We Learned?
I think many of the naive calculations about the “degree of surprise” of Murphy’s post-season home run hitting overstate the rareness of this event. One type of calculation will compute the probability of a home run in six consecutive games by taking an estimated probability of hitting a home run in a single game and taking this to the 6th power. But this calculation ignores the idea that we chose this “6 home run” event since it was interesting, and this results in a probability calculation that is too small.
I think my Bayesian exercise is closer to the beliefs of a manager. Initially he did not think that Murphy’s home run ability was unusual, but his opinions were modified after seeing his three home runs in the NLDS. Watching Murphy hit HR in four consecutive games against the Cubs was surprising, but not so surprising given what he accomplished in the NLDS. These Bayesian calculations can be modified with different assumptions. Nice features of this approach is that it allows one to easily modify one’s beliefs about players’ abilities given new information, and allows you to easily make predictions for future game.
Of course, all of the calculations are done using R. The fitting of the random effects model is done using the LearnBayes package, the simulations use the rbeta and rbinom functions, and graphing using the ggplot2 package.