Recently, Sean Lahman released the 2013 version of his baseball database. We’ll use this post to illustrate downloading the Lahman database, reading the batting season-to-season data into R, and using this to explore the OPS measures for all players in the 2012 and 2013 seasons.

We begin by visiting http://www.seanlahman.com/baseball-archive/statistics/ and clicking on the 2013 Beta Version, comma-delimited format. This will download a zip file — when you unzip this file, you’ll see a collection of files in csv format entitled “AllstarFull.csv”, “Appearances.csv”, etc.

In RStudio, one can easily import the season-to-season Batting.csv file by clicking on the “Import Dataset” button in the Environment tab. You indicate you are importing a Text File, navigate to the file location, and you indicate that the first line is a header with the variable names.

Batting <- read.csv("C:/Users/Jim/Desktop/lahman2013/Batting.csv")

We use the ` subset `

to create a new dataframe ` Batting.12.13 `

containing data only for the 2012 and 2013 seasons.

Batting.12.13 <- subset(Batting, yearID==2012 | yearID==2013)

One issue is the data file will contain separate lines for players contribution on different teams. For example, there will be two lines for Marlon Byrd’s 2013 season, one for the Mets and one for the Pirates. We use the ` ddply `

function in the ` plyr `

package to collapse the hitting stats (AB, H, RBI, etc.) over the “stint” variable.

sum.function <- function(d){ d1 <- d[, 6:23] apply(d1, 2, sum) } D <- ddply(Batting.12.13, .(playerID, yearID), sum.function)

Next we define the OPS statistic, first by defining the SLG and OBP variables and then adding the two to get the OPS values.

Batting.12.13$X1B <- with(Batting.12.13, H - X2B - X3B - HR) Batting.12.13$SLG <- with(Batting.12.13, (X1B + 2 * X2B + 3 * X3B + 4 * HR) / AB) Batting.12.13$OBP <- with(Batting.12.13, (H + BB + HBP) / (AB + BB + HBP + SF)) Batting.12.13$OPS <- with(Batting.12.13, SLG + OBP)

We want to reshape the data frame so that the 2012 and 2013 hitting stats for a particular player are on the same line. We do this in two steps: first we use the ` subset `

function to extract the 2012 and 2013 statistics, and then use the ` merge `

function to merge the two data frames, matching by the “playerID” variable.

Batting.2012 <- subset(Batting.12.13, yearID==2012) Batting.2013 <- subset(Batting.12.13, yearID==2013) merged.Batting <- merge(Batting.2012, Batting.2013, by="playerID")

Finally, we limit our exploration to player who had at least 300 AB for each season.

merged.Batting.300 <- subset(merged.Batting, AB.x >= 300 & AB.y >=300)

We are interested in the change in a player’s OPS from 2012 to 2013, so we define the “Improvement” variable.

merged.Batting.300$Improvement <- with(merged.Batting.300, OPS.y - OPS.x)

We use traditional R graphics to construct a scatterplot of a player’s 2012 OPS against the improvement in OPS. To show the general pattern, we fit a line and overlay the line on the graph by the ` abline `

function. Since we are interested in identifying players, we use the ` text `

function to plot the playerID codes on the scatterplot.

with(merged.Batting.300, plot(OPS.x, Improvement, pch=".", xlab="2012 OPS", ylab="Improvement in 2013 OPS", main="OPS for 2 Years for MLB Batters With Minimum 300 AB")) fit <- lm(Improvement ~ OPS.x, data=merged.Batting.300) abline(fit, lwd=3, col="red") with(merged.Batting.300, text(OPS.x, Improvement, substr(playerID, 1, 5)))

What do we see in this graph?

- We see the familiar regression effect — players who had low 2012 OPS values tend to improve in 2013, and players who did well in 2012 (like Joey Votto) tend to decline.
- Even though OPS is a pretty good measure of batting ability, the variability in the improvements is remarkable.
- For example, Hanley Ramirez (label “ramir”) had a .260 increase in OPS, Chris Davis (label “davis”) had a .180 improvement, Carlos Ruiz (label “ruizc”) had a .250 drop in OPS, and Melky Cabrera (label “cabre”) had a .220 drop in OPS.

Obviously, baseball GM’s (are you reading, Rubén Amaro?) should not get excited about a player’s OPS value in a single season. Extreme OPS values tend to move to the average, and one needs look at career season-to-season batting statistics for a given player to make reasonable predictions about a player’s batting performance in the 2014 season.

[…] an earlier post, I made the comment that extreme batting measures from one season tend to regress or move towards […]