Introducing data.table and Exploring Values of Home Runs

(NOTE: Evan and I are anxiously anticipating Ichiro Suzuki’s 3000th career hit and we plan a special post to commemorate this event. In the meantime, we’ll introduce the data.table package.)

Although I typically use the dplyr package in my work, the data.table package is an attractive alternative package for implementing many of these data wrangling tasks. Here I briefly introduce data.table for exploring how team home run hitting impacts run scoring for 2015 teams.

Read in csv Files Fast

The fread function in data.table quickly reads in csv files. To illustrate, below I read in the play-by-play Retrosheet data for a single season (over 189,000 rows). I measure the time reading this csv file in using both read.csv and fread functions — we see that fread is about 15 times faster.

 
library(data.table)
system.time(read.csv(file.loc))  # 17.743 sec
   user  system elapsed 
 16.318   0.610  17.622 
system.time(fread(file.loc))     # 1.182 sec
   user  system elapsed 
  1.067   0.075   1.168 

Variable Types: No Factors

I made some changes in my compute.runs.expectancy function to use fread . I convert the data frame to a data table by the data.table function.

 
d <- compute.runs.expectancy_new(2015)
d <- data.table(d)

One attractive feature of fread is that all non-numerical columns of the output data frame are stored as character-type, not factors. This is nice default behavior — it is a bit annoying to have factor variables that you want to convert to numeric.

Basic Data Table Tasks

Once a data frame has been converted to a data.table, say DT, then the basic command for data wrangling is

DT[i, j, by]

which is read “filter according to i”, then “calculate by j”, and “grouping by” the by variable. As we will see, this is a compact way of performing many of the typical tasks with data.tables (that are also data frames).

How many home runs were hit in the 2015 season?

Here I want to filter by EVENT_CD == 23 (the home run flag) and calculate using the .N variable.

 
d[EVENT_CD==23, .(count=.N)]
   count
1:  4909

Count of home runs for each team?

First, I use several data.table commands to define the BAT_TEAM . (Note that I use an initial comma, since I am not filtering and I use the := symbol to define a new variable.) Then I repeat the same command as above, but now adding BAT_TEAM to do the grouping.

 
d[, HOME_TEAM_ID := substr(GAME_ID, 1, 3)]
d[, BAT_TEAM := ifelse(BAT_HOME_ID==0, as.character(AWAY_TEAM_ID), 
                       HOME_TEAM_ID)]
Totals <- d[EVENT_CD==23, .(count=.N), BAT_TEAM]
head(Totals)
   BAT_TEAM count
1:      KCA   139
2:      ANA   176
3:      OAK   146
4:      TEX   172
5:      SEA   198
6:      HOU   230

Arranging a data.table by the value of a variable.

I use the order argument on the variable count to sort the home run counts for the 2015 teams.

 
Totals <- Totals[order(-count)]
head(Totals)
   BAT_TEAM count
1:      TOR   232
2:      HOU   230
3:      BAL   217
4:      NYA   212
5:      SEA   198
6:      LAN   187

Run Production by Home Runs?

Next, I want to explore how many runs are produced (or more precisely, the RBI) with runners on base. First I tabulate the number of RBI’s for home runs — here I am filtering over home runs, computing a count, and grouping by the runs scored variable.

 
(S <- d[EVENT_CD==23, .(count=.N), RUNS.SCORED])
   RUNS.SCORED count
1:           1  2887
2:           2  1360
3:           3   554
4:           4   108

Here I compute the total runs scored, total home runs, and total runs scored as a result of home runs:

 
d[, .(RUNS=sum(RUNS.SCORED), 
             HR=sum(EVENT_CD==23),
             RUNS.HR=sum(RUNS.SCORED * (EVENT_CD==23)))]
    RUNS   HR RUNS.HR
1: 20646 4909    7701

By adding BAT_TEAM at the end, I do this same computation for each team.

 
S <- d[, .(RUNS=sum(RUNS.SCORED), 
            HR=sum(EVENT_CD==23),
            RUNS.HR=sum(RUNS.SCORED * (EVENT_CD==23))), BAT_TEAM]
head(S)
   BAT_TEAM RUNS  HR RUNS.HR
1:      KCA  724 139     218
2:      ANA  661 176     257
3:      OAK  694 146     241
4:      TEX  751 172     258
5:      SEA  656 198     306
6:      HOU  729 230     347

Last, I create a new variable that is the fraction of runs produced by home runs — I include the names of the other variables since I still want them in the data.table.

 
S <- S[, .(BAT_TEAM, RUNS, HR, RUNS.HR, F_RUNS_HR=RUNS.HR / RUNS)]
head(S)
   BAT_TEAM RUNS  HR RUNS.HR F_RUNS_HR
1:      KCA  724 139     218 0.3011050
2:      ANA  661 176     257 0.3888048
3:      OAK  694 146     241 0.3472622
4:      TEX  751 172     258 0.3435419
5:      SEA  656 198     306 0.4664634
6:      HOU  729 230     347 0.4759945

Below I graph the runs and fraction of runs produced by home runs for all 2015 teams.

 
library(ggplot2)
ggplot(S, aes(RUNS, RUNS.HR / RUNS, label=BAT_TEAM)) + 
  geom_point() +
  geom_label() +
  ggtitle("Runs and Fraction by Home Runs for All Teams")

datatable_HR1
We see that although Toronto had the most number of home runs, the fraction of runs produced by home runs was high for Baltimore, Houston and the Yankees.

Or it might be interesting to plot the number of runs produced by home runs against the number of home runs not produced by home runs.

 
ggplot(S, aes(RUNS - RUNS.HR, RUNS.HR, label=BAT_TEAM)) + 
  geom_point() +
  geom_label() +
  ggtitle("Runs Scored by Home Runs and Other Runs")

datatable_hr2
Here we see a negative relationship — teams that produce many runs by home runs tend not to produce many runs not by home runs.

Runs Values of Home Runs

As we know, runs batted in is different from “runs created” since the opportunity to score runs decreases when you clear the bases with a home run. I have a RUNS.VALUE variable in my data frame. I use data.table to compute the average number of runs created by home runs. This is a reflection of how well a team hits home runs in “good” situations with runners on base.

 
S <- d[, .(RUNS=sum(RUNS.SCORED), 
           HR=sum(EVENT_CD==23),
           AVG.RUNS.VALUE=sum(RUNS.VALUE * (EVENT_CD==23)) /
              sum(EVENT_CD==23)), BAT_TEAM]
head(S)
   BAT_TEAM RUNS  HR AVG.RUNS.VALUE
1:      KCA  724 139       1.395463
2:      ANA  661 176       1.319281
3:      OAK  694 146       1.446837
4:      TEX  751 172       1.350066
5:      SEA  656 198       1.380826
6:      HOU  729 230       1.341109
 
ggplot(S, aes(HR, AVG.RUNS.VALUE, label=BAT_TEAM)) + 
  geom_point() +
  geom_label() +
  ggtitle("Runs and Average Home Run Values for all")

datatable_hr3
We see that the average runs value for each Yankees home run was high (close to 1.5); in contrast, Tampa Bay’s average runs value per home run was only about 1.25. (I presume they had a lot of solo home runs.)

Final Words

Anyway, I hope this post has helped the reader get introduced to the data.table package. It is very powerful and I appreciate the succinct way you can do operations in a single line of code. I likely will introduce data.table in my Computing with Data course this fall.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: