I’m continuing my discussion of the openWAR methodology implemented in the
openWAR package. Specifically, I’ll focus on the adjustment issue. To quote from the Baumer, Jensen and Matthews paper (I’ve added some bold for emphasis),
“We begin our modeling of offensive run value by adjusting (run value) for several factors beyond the control of the hitter or baserunners that make it difficult to compare run values across contexts. Specifically, we want to first adjust for the ballpark of the event
and any platoon advantage the batter may have over the pitcher.”
I’ll illustrate this adjustment that is incorporated into the
Adjusting run value for ballpark
In a previous post, I illustrated computing the run values for all plate appearances from the play-by-play data scraped using the
openWAR package. It is straightforward to compute the mean run value for all 2015 players. In the graph below, I plot the mean run value against the number of plate appearances. I add a smoothing curve to the plot — this indicates (as expected) that players with more PAs (the regulars) tend to have larger mean run values. I have plotted the values for the Rockies players in red. From this graph, it seems that most of the Rockies are good hitters since the red points tend to be above the smooth.
But, wait — this is not a fair comparison since the Rockie players have an advantage — they play half of their games in Coors Field which might explain their success in hitting. We want to adjust or control the batting performance measure by the ballpark.
I’ll illustrate this process using regression since this approach applies for general adjustments (adjustment variable could either be discrete or continuous).
- In the play-by-play dataset, I define a new variable
bat_teamthat gives the ballpark where the player is hitting.
Using the R
lmfunction, I fit a regression of
Runson the factor variable
bat_team. If one inspects the regression coefficients, we’ll see which ballparks are advantageous to the hitter and which ones are not.
To perform the adjustment, we compute a residual
RESIDUAL = ACTUAL RUNS – FITTED RUNS FROM MODEL
So each player for each plate appearance has a new measure of performance RESIDUAL that adjusts the runs measure by the ballpark.
The following graph plots the mean residuals against PA for all hitters in the 2015 season with a smoothing curve. Again the red dots correspond to the Rockies. What do we see? Now the red dots fall pretty uniformly on both sides of the curve. This tells us that the Rockie hitters are really average hitters, but they appeared to be better in the earlier graph due to the Coors Field effect.
Adjusting run value for the platoon effect
The same methodology can be used to adjust the batting performance for the platoon effect (batter side and pitcher arm). I break down the PA data by the platoon variable — the following graph shows the mean run values against the PA for each of the possible splits. I have added a red horizontal lines that show the general platoon effects — left-handed hitters tend to do well against right-handed pitchers, hitters against pitchers of the same side seem “average”, and left-handed hitters against left-handed pitches tend to do worse.
To adjust by platoon effect, I fit a regression model where the predictor is one of the four platoon types (I created a new factor variable that pastes the batter side and pitcher side variables). Again, I compute the residuals and the following graph shows the mean residual values for each platoon situation. Note that the red horizontal lines are now at zero that shows that we’ve adjusted the run values for the platoon effect.
Baumer et al perform many other adjustments of run values in their paper. But the methodology is the same — one first figures out suitable variables that influence run production, use regression to understand the effect of these variables, and compute residuals to perform the adjustment. In fact, much of sabermetrics research is focused on the identification of confounding effects (like ballpark and platoon effect) and the removal of these effects through adjustment.
Although I have not focused on the programming, the gist site contains all of the R work for this example.