If you search for “openWAR” on this blog site, you’ll see some posts about the use of the openWAR package in downloading play-by-play MLBAM data and computing WAR statistics for players. In the last post, I illustrated how one easily downloads MLBAM play-by-play data using this package. Also, we have cited the JQAS paper that describes this implementation of WAR. Since this paper was written for a statistical audience and might be tough reading with all of the equations, I thought I’d make some general comments about the paper and the methodology so one understands better what is happening behind the scenes with openWAR.
What’s the Point?
Although WAR is a popular measure of the total contribution of ball players including defense, hitting, and baserunning, there are different implementations of WAR and so the recipe remains like a mysterious black box for most fans. One aim of the paper is to provide a clear implementation of the WAR methodology, hence “open” WAR. Another motivation of the paper is statistical — any measure of performance like WAR is just an estimate of a player’s talent and so it is of interest to develop standard errors or margins of errors for WAR measures.
Start with Run Values
One starts with the familiar runs value of a plate appearance. One computes the expected number of runs before and after the plate appearance. The change in expected runs plus any runs scored on the play represents the value of the plate appearance. This quantity is represented in the paper by . As indicated in the schematic diagram from the paper a contribution of is attributed to the offense and a contribution of is attributed to the defense. I’ll use a schematic diagram from the paper to help describe the methodology.
Suppose a particular play has a offensive run value of . It is hard to interpret this since there are many confounding factors — for example, the offensive may have this positive run value since the batter was hitting in a “offensive-friendly” ballpark or was hitting against a pitcher from the opposite side. It is desirable to adjust this run value for these confounding factors.
Statistically, one performs this adjustment by the use of regression. One fits a linear model where one wishes to predict run value based on the ballpark and the platoon effect. One obtains predictions of the form
Predicted run value = constant + a1 ball_park1 + … + a30 ball_park30 + b opposite_side
where ball_park1, …, ball_park30 and opposite_side are indicators for the particular ball park and the opposite side (1 if true and 0 if not true), and a1, …, a30, b are the regression coefficient estimates.
From this fitted model, one computes the residual which is equal to
Residual = Actual run value – Predicted run value
This residual (denoted by in the above figure) removes the effect of ball park and platoon side from the run values.
Adjustments, adjustments, and more adjustments
Basically this same regression method is used to adjust the run values for other variables that influence run production. Let’s focus here on the offensive side. We have already adjusted the run value for platoon and ball park effects. Suppose a batter hits a single with runners on 1st and 2nd, scoring a run with the runner on 1st moving to third. Part of this run scoring and advancement of runners is attributed to the batter, but part of the success is also due to the runners — maybe the runner advancement is more than one would expect based on the bases/outs situation. The authors perform another regression for this adjustment.
- First one uses regression to predict the value of using the outs/runners state and the type of batting event. After doing this one computes a new residual that removes the effect of outs/runners and batting play. The regression model is represented by
- This new residual measures the benefit of the baserunners above average — if the runners take one more extra bases than expected, then will be positive. If there are multiple base runners one partitions this value among the runners.
- Last, the fitted value from the model is used to measure the portion of run value attributed to the hitter.
This explains in a general way the right hand side of the authors’ schematic diagram. The authors use the notation RAA to denote run values, where the superscript (hit, runner, pitch, field) represent the run contributions for different types of baseball play.
The Defensive Side
Figuring out what to do with the negative part of run value () is more complicated since defense is both the pitcher and the fielders. Some plays such as walks, strikeouts and home runs are entirely the pitcher’s fault. The success of balls in play depend partly on the pitcher and partly on the fielders. Here the location of the batted ball is a factor. A more complete discussion requires another post. But again the methodology rests on the general notion of adjustment — one fits a regression model based on variables that one believes are relevant for run production and the residual represents the run contribution adjusted for these variables. In fact, it can be said that one of the most important statistical ideas in baseball is the notion of adjustment. (A raw baseball statement without any information about context is not very informative.)
One BIG idea in the openWAR methodology is the use of regression to make adjustments for run production and the purpose of this post is give some insight (not using equations) about this adjustment method. A second BIG idea is the “above replacement” idea — once one finds a player’s total run contribution, then one wishes to see how much better he is than a replacement player and that requires a careful definition of “replacement”. The last BIG idea is giving standard errors for these openWAR estimates and I’ll explain the authors’ bootstrap procedure in a future post.