Our book just came up last week — you can download the Kindle version today at Amazon. Essentially, the book is about sabermetrics and how one can explore the wealth of publicly available data using the statistical system R. Max and I live about 4536 miles apart, but we both are into baseball, statistics, and R, so it was a nice collaboration.

Here’s a simple study that one can do using the Lahman package in R. One point that was obvious in the recent World Series was that players were striking out a lot. That motivates the questions: (1) what is the trend in striking out over World Series series, and (2) was the 2013 series unusual with respect to strikeouts?

I open up RStudio (a nice interface for R) and load the Lahman package.

library(Lahman)

The Lahman package contains all of the datasets available on Sean Lahman’s database. We’ll focus on the data frame BattingPost that contains the batting statistics of the players who have played in the World Series. We are only interested in World Series data for the seasons since 1903 and we use the subset function to create a new data frame with the data we want.

wsdata = subset(BattingPost, round=="WS" & yearID >= 1903)

What I want to do is compute the sum of at-bats and sum of strikeouts for each series. This is conveniently done using the ddply function in the plyr package which we load.

library(plyr)

This function says we want to break up the wsdata by the yearID variable, and for each yearID compute AB, the sum of at-bats, and SO, the sum of strikeouts.

so.data <- ddply(wsdata, .(yearID), summarize,

AB = sum(AB, na.rm=TRUE),

SO = sum(SO, na.rm=TRUE))

From baseball-reference.com, we collect the strikeout data for the recent 2013 series and add this to the current data frame.

so.data <- rbind(so.data,

data.frame(yearID = 2013,

AB = 194 + 201,

SO = 59 + 43))

We compute a new variable SO.Rate equal to the percentage of at-bats that are strikeouts.

so.data$SO.Rate <- with(so.data, 100 * SO / AB)

We use the plot function to construct a scatterplot of strikeout rates over season.

We add a smoothing curve to see the basic pattern. Last, we use the identify function to label some interesting seasons corresponding to strikeout rates that don’t follow the general pattern.

with(so.data, plot(yearID, SO.Rate))

with(so.data, lines(lowess(yearID, SO.Rate)))

with(so.data, identify(yearID, SO.Rate,

n=11, labels=yearID))

As we would suspect, we see a steady increase in the strikeout rates over years. Although 2013 had a high strikeout rate, there were higher rates of strikeouts in the recent seasons 2001, 2009 and 2012 (remember the Tigers were in last year’s series). Also there were some historical seasons with high strikeout rates. One notable season was 1963 where the Dodgers with Sandy Koufax and Don Drysdale overwhelming the Yankess in 4 games.

Nice post. I created an interactive version of the same plot using rCharts (a package that I authored) and Polycharts. Here is a link http://ramnathv.github.io/rChartsNYT/

[…] the previous post Jim mentioned we live some 4,000 miles […]

For some reason when I try and get that graphic using the same code, I get one data point… Can someone please tell me what is going on? Much appreciated

Felix:

Here’s a new version of the code using the dplyr and ggplot2 packages:

library(Lahman)

library(dplyr)

library(ggplot2)

wsdata = 1903)

so.data <- summarize(group_by(wsdata, yearID),

AB = sum(AB, na.rm=TRUE),

SO = sum(SO, na.rm=TRUE),

SO.Rate = 100 * SO / AB)

ggplot(so.data, aes(yearID, SO.Rate)) +

geom_point() + geom_smooth()