Linear Weights on R

I recently received an email from Michael who was reading Curve Ball and was wondering how to compute the linear weights measure of offensive performance. (My coauthor Jay Bennett spent several chapters describing the assortment of measures of batting performance.) In this post, I’ll describe the R computation of these linear weights measures using data from the Lahman package and then discuss some interesting extensions of this method.

Computing the Linear Weights

Using the ` Teams ` data frame in the `Lahman` package, I define variables that standardize all of the count variables by dividing each by the number of games.

```library(Lahman)
library(dplyr)
Teams <- mutate(Teams,
R_G = R / G,
X1B = H - X2B - X3B - HR,
X1B_G = X1B / G,
X2B_G = X2B / G,
X3B_G = X3B / G,
HR_G = HR / G,
BB_G = BB / G,
SB_G = SB / G,
CS_G = CS / G)
```

Using the R function ` lm `, I predict (using data from the seasons 2000 through 2014) the team runs per game using the counts of singles, doubles, triples, walks, stolen bases, and caught stealing. The estimated coefficients in this linear estimate are similar in value to the estimated coefficients reported in Curve Ball using data from earlier seasons.

```fit2 <- lm(R_G ~ X1B_G + X2B_G + X3B_G + HR_G + BB_G + SB_G + CS_G,
+ data=filter(Teams, yearID >= 2000, yearID <= 2014))
fit2
Coefficients:
(Intercept)        X1B_G        X2B_G        X3B_G         HR_G
-3.07429      0.55615      0.77286      1.20914      1.49853
BB_G         SB_G         CS_G
0.33919      0.13084      0.07992
```

Graphing the Residuals

This model describes in a general way how singles, doubles, etc. contribute to scoring runs. But teams are less and more efficient in using these different count statistics. To illustrate, I plot the residuals of this least-squares fit for the season 2000-2014. Each point corresponds to a team and I have labeled four team/seasons with large residuals.

```library(ggplot2)
teams <- filter(Teams, yearID >= 2000, yearID <= 2014)
teams <- mutate(teams, residual=residuals(fit2))
ggplot(teams, aes(yearID, residual)) + geom_point() +
geom_smooth() +
geom_label(data=filter(teams, residual > .4 | residual < -.4),
aes(label=teamID)) +
ggtitle("Residuals from Least-Squares Fit with Several Large Residuals Labeled")
```

The 2002 Phillies, the 2005 Diamondbacks and the 2001 Giants were unusually poor in creating runs given their number of singles, doubles, etc. In contrast, the 2013 Cardinals were remarkably efficient in creating runs (there is probably a story one could write that would help to explain why).

Yearly Effects

Note that I used data over 15 seasons to estimate this linear weights model. That raises the question: how do the weights vary by season? I would expect the weights to show some considerable variation since one only has 30 data points (30 teams) to fit the model with 7 inputs for a single season. Below I use the ` sapply` function to find season-specific coefficients for a fifty season period and then graph the regression coefficients for a single, a double, and a home run over these 50 seasons.

```fit.season <- function(season){
coef(lm(R_G ~ X1B_G + X2B_G + X3B_G + HR_G + BB_G + SB_G + CS_G,
data=filter(Teams, yearID == season)))
}
S <- data.frame(t(sapply(1965:2014, fit.season)))
S <- mutate(S, Season=1965:2014)
ggplot(S, aes(Season, HR_G)) +
geom_point(color="red") + geom_smooth(color="red") +
geom_point(aes(Season, X1B_G), color="blue") +
geom_smooth(aes(Season, X1B_G), color="blue") +
geom_point(aes(Season, X2B_G), color="green") +
geom_smooth(aes(Season, X2B_G), color="green") +
annotate("text", label = "HR", x = 1995, y = 1.6, size = 8,
colour = "red") +
annotate("text", label = "2B", x = 1995, y = 0.8, size = 8,
colour = "green") +
annotate("text", label = "1B", x = 1995, y = 0.6, size = 8,
colour = "blue") +
ggtitle("Regression Coefficients for Seasons 1965-2014") +
ylab("Regression Coefficient")
```

There are two basic take-aways from this graph. First, the estimated coefficients for single season data are pretty unstable across seasons, and second, there is no basic pattern in these estimates. The weight for a home run is approximately 1.5, the weight for a double is about .75, and the weight for a single is about .50.

As a side comment, fitting this linear weights model across seasons motivates the consideration of a multilevel model. This type of model would allow one to get improved season-specific regression estimates and also allow one to quantify the variation of these coefficients across seasons.

6 responses

1. Shouldn’t there be some mention of the intercept when talking about the coefficients of your linear model?

1. Josh, you’re right. But in this case, the intercept does not have a simple interpretation. If I had centered the inputs, say be subtracting the means, then the intercept estimate would be helpful.

2. Under the Yearly Effects Season I get the following error when running your code to create the data frame “S”:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, …) :
0 (non-NA) cases

3. Patrick, I’m not sure about your problem. I’d first try out the fit.season function with a particular season input, before trying out sapply.

4. Hey Jim – great book. Are you saying that this “cluster luck” (being able to string together hits to form runs) is a skill rather than good fortune?

1. George. There are many reasons for the clustering effect (strong bats in the middle of the order, a weak pitcher, etc). I think it would be difficult to control for these different effects to show that there was any momentum in hitting.