Yesterday, I was revisiting the R code from Chapter 8 of Analyzing Baseball Using R on career trajectories. In our book, I focused on the use of the ` plyr `

package for the “splitting, applying and combining data” operation. But I have been recently using the `dplyr `

package and have noticed a clear advantage, especially in terms of speed. I thought it would be worthwhile to compare the two packages for several baseball examples.

### Collapsing Over Batting Stints

In the Lahman database, there will be a separate line for each stint of a player in a season. For example, Adam Dunn played for two teams in 2008 and his hitting stats are represented by two lines in the Batting data frame.

library(Lahman) subset(Batting, playerID == "dunnad01" & yearID == 2008)

## playerID yearID stint teamID lgID G G_batting AB R H X2B X3B HR RBI ##23644 dunnad01 2008 1 CIN NL 114 114 373 58 87 14 0 32 74 ##23645 dunnad01 2008 2 ARI NL 44 44 144 21 35 9 0 8 26 ## SB CS BB SO IBB HBP SH SF GIDP G_old ##23644 1 1 80 120 6 6 0 5 4 114 ##23645 1 0 42 44 7 1 0 0 3 44

One common operation is to collapse over the ` stint`

variable. Here we will compute the number of hits and at-bats for all player/seasons in baseball history.

Here’s how I would do this in the `dplyr`

package. I write a function ` collapse.stint `

that sums the at-bats and hits for a particular data frame, and then I use the `ddply`

function to do the collapsing over all player/seasons.

library(Lahman) library(plyr) collapse.stint <- function(d){ data.frame(AB=sum(d$AB), H=sum(d$H)) } Batting.new <- ddply(Batting, .(playerID, yearID), collapse.stint)

Here is the equivalent using the function `summarize`

in the `dplyr`

package.

library(dplyr) Batting.new <- summarize(group_by(Batting, playerID, yearID), AB=sum(AB), H=sum(H))

There is a dramatic difference in speed — using the `System.time`

function, I found that it took 60.197 seconds using `ddply`

and 0.223 seconds using `summarize`

— that is a **BIG** increase in speed.

### Fitting Many Models

One can use both packages to fit many models. For example, suppose we are interested in fitting the Pythagorean Formula model

for team data for all seasons from 1900 to 2013. (Note that I have written this using logs so it becomes a linear model.)

Here is how I would use the ` plyr `

package. I first write a small function ` pythag.model `

and then use ` ddply `

.

pythag.model <- function(d){ coef(lm(log(W / L) ~ 0 + log(R / RA), data=d)) } Fits <- ddply(subset(Teams, yearID >= 1900), .(yearID), pythag.model)

Using the ` dplyr `

package, I use the ` do `

function to fit all the models and then use ` summarize `

to pick up the estimated slopes.

models <- group_by(subset(Teams, yearID >= 1900), yearID) %>% do(mod = lm(log(W / L) ~ 0 + log(R / RA), data = .)) Fits <- summarize(models, yearID = yearID, B = coef(mod) )

In this example, both packages were very fast — about 0.22 seconds for each method for performing these 115 regression fits.

In my work, I plan to primarily use the ` dplyr `

package for a variety of data wrangling operations, and I will illustrate ` dplyr `

functions in my BGSU “Computing with Data” course next fall.