# Comparing the plyr and dplyr packages

Yesterday, I was revisiting the R code from Chapter 8 of Analyzing Baseball Using R on career trajectories. In our book, I focused on the use of the  plyr  package for the “splitting, applying and combining data” operation. But I have been recently using the dplyr  package and have noticed a clear advantage, especially in terms of speed. I thought it would be worthwhile to compare the two packages for several baseball examples.

### Collapsing Over Batting Stints

In the Lahman database, there will be a separate line for each stint of a player in a season. For example, Adam Dunn played for two teams in 2008 and his hitting stats are represented by two lines in the Batting data frame.

library(Lahman)
subset(Batting, playerID == "dunnad01" & yearID == 2008)

##      playerID yearID stint teamID lgID   G G_batting  AB  R  H X2B X3B HR RBI
##23644 dunnad01   2008     1    CIN   NL 114       114 373 58 87  14   0 32  74
##23645 dunnad01   2008     2    ARI   NL  44        44 144 21 35   9   0  8  26
##      SB CS BB  SO IBB HBP SH SF GIDP G_old
##23644  1  1 80 120   6   6  0  5    4   114
##23645  1  0 42  44   7   1  0  0    3    44


One common operation is to collapse over the  stint variable. Here we will compute the number of hits and at-bats for all player/seasons in baseball history.

Here’s how I would do this in the dplyr package. I write a function  collapse.stint  that sums the at-bats and hits for a particular data frame, and then I use the ddply function to do the collapsing over all player/seasons.

library(Lahman)
library(plyr)
collapse.stint <- function(d){
data.frame(AB=sum(d$AB), H=sum(d$H))
}
Batting.new <- ddply(Batting, .(playerID, yearID),
collapse.stint)


Here is the equivalent using the function summarize in the dplyr package.

library(dplyr)
Batting.new <- summarize(group_by(Batting, playerID, yearID),
AB=sum(AB), H=sum(H))


There is a dramatic difference in speed — using the System.time function, I found that it took 60.197 seconds using ddply and 0.223 seconds using summarize — that is a BIG increase in speed.

### Fitting Many Models

One can use both packages to fit many models. For example, suppose we are interested in fitting the Pythagorean Formula model $\log \left( \frac{W}{L}\right) = \beta \left( \frac{R}{RA} \right)$

for team data for all seasons from 1900 to 2013. (Note that I have written this using logs so it becomes a linear model.)

Here is how I would use the  plyr  package. I first write a small function  pythag.model  and then use  ddply .

pythag.model <- function(d){
coef(lm(log(W / L) ~ 0 + log(R / RA), data=d))
}
Fits <- ddply(subset(Teams, yearID >= 1900),
.(yearID), pythag.model)


Using the  dplyr  package, I use the  do  function to fit all the models and then use  summarize  to pick up the estimated slopes.

models <- group_by(subset(Teams, yearID >= 1900), yearID) %>%
do(mod = lm(log(W / L) ~ 0 + log(R / RA), data = .))
Fits <- summarize(models,
yearID = yearID,
B = coef(mod)
)


In this example, both packages were very fast — about 0.22 seconds for each method for performing these 115 regression fits.

In my work, I plan to primarily use the  dplyr  package for a variety of data wrangling operations, and I will illustrate  dplyr  functions in my BGSU “Computing with Data” course next fall.