Comparing the plyr and dplyr packages

Yesterday, I was revisiting the R code from Chapter 8 of Analyzing Baseball Using R on career trajectories. In our book, I focused on the use of the plyr package for the “splitting, applying and combining data” operation. But I have been recently using the dplyr package and have noticed a clear advantage, especially in terms of speed. I thought it would be worthwhile to compare the two packages for several baseball examples.

Collapsing Over Batting Stints

In the Lahman database, there will be a separate line for each stint of a player in a season. For example, Adam Dunn played for two teams in 2008 and his hitting stats are represented by two lines in the Batting data frame.

library(Lahman)
subset(Batting, playerID == "dunnad01" & yearID == 2008)
##      playerID yearID stint teamID lgID   G G_batting  AB  R  H X2B X3B HR RBI
##23644 dunnad01   2008     1    CIN   NL 114       114 373 58 87  14   0 32  74
##23645 dunnad01   2008     2    ARI   NL  44        44 144 21 35   9   0  8  26
##      SB CS BB  SO IBB HBP SH SF GIDP G_old
##23644  1  1 80 120   6   6  0  5    4   114
##23645  1  0 42  44   7   1  0  0    3    44

One common operation is to collapse over the stint variable. Here we will compute the number of hits and at-bats for all player/seasons in baseball history.

Here’s how I would do this in the dplyr package. I write a function collapse.stint that sums the at-bats and hits for a particular data frame, and then I use the ddply function to do the collapsing over all player/seasons.

library(Lahman)
library(plyr)
collapse.stint <- function(d){
  data.frame(AB=sum(d$AB), H=sum(d$H))
}
Batting.new <- ddply(Batting, .(playerID, yearID), 
                                 collapse.stint)

Here is the equivalent using the function summarize in the dplyr package.

library(dplyr)
Batting.new <- summarize(group_by(Batting, playerID, yearID),
                         AB=sum(AB), H=sum(H))

There is a dramatic difference in speed — using the System.time function, I found that it took 60.197 seconds using ddply and 0.223 seconds using summarize — that is a BIG increase in speed.

Fitting Many Models

One can use both packages to fit many models. For example, suppose we are interested in fitting the Pythagorean Formula model

\log \left( \frac{W}{L}\right) = \beta \left( \frac{R}{RA} \right)

for team data for all seasons from 1900 to 2013. (Note that I have written this using logs so it becomes a linear model.)

Here is how I would use the plyr package. I first write a small function pythag.model and then use ddply .

pythag.model <- function(d){
  coef(lm(log(W / L) ~ 0 + log(R / RA), data=d))
}
Fits <- ddply(subset(Teams, yearID >= 1900), 
   .(yearID), pythag.model)

Using the dplyr package, I use the do function to fit all the models and then use summarize to pick up the estimated slopes.

models <- group_by(subset(Teams, yearID >= 1900), yearID) %>% 
  do(mod = lm(log(W / L) ~ 0 + log(R / RA), data = .))
Fits <- summarize(models, 
          yearID = yearID,
          B = coef(mod)
         )

In this example, both packages were very fast — about 0.22 seconds for each method for performing these 115 regression fits.

In my work, I plan to primarily use the dplyr package for a variety of data wrangling operations, and I will illustrate dplyr functions in my BGSU “Computing with Data” course next fall.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: