(NOTE: Evan and I are anxiously anticipating Ichiro Suzuki’s 3000th career hit and we plan a special post to commemorate this event. In the meantime, we’ll introduce the
Although I typically use the
dplyr package in my work, the
data.table package is an attractive alternative package for implementing many of these data wrangling tasks. Here I briefly introduce
data.table for exploring how team home run hitting impacts run scoring for 2015 teams.
Read in csv Files Fast
fread function in
data.table quickly reads in csv files. To illustrate, below I read in the play-by-play Retrosheet data for a single season (over 189,000 rows). I measure the time reading this csv file in using both
fread functions — we see that
fread is about 15 times faster.
library(data.table) system.time(read.csv(file.loc)) # 17.743 sec user system elapsed 16.318 0.610 17.622 system.time(fread(file.loc)) # 1.182 sec user system elapsed 1.067 0.075 1.168
Variable Types: No Factors
I made some changes in my
compute.runs.expectancy function to use
fread . I convert the data frame to a data table by the
d <- compute.runs.expectancy_new(2015) d <- data.table(d)
One attractive feature of
fread is that all non-numerical columns of the output data frame are stored as character-type, not factors. This is nice default behavior — it is a bit annoying to have factor variables that you want to convert to numeric.
Basic Data Table Tasks
Once a data frame has been converted to a data.table, say DT, then the basic command for data wrangling is
DT[i, j, by]
which is read “filter according to i”, then “calculate by j”, and “grouping by” the by variable. As we will see, this is a compact way of performing many of the typical tasks with data.tables (that are also data frames).
How many home runs were hit in the 2015 season?
Here I want to filter by
EVENT_CD == 23 (the home run flag) and calculate using the
d[EVENT_CD==23, .(count=.N)] count 1: 4909
Count of home runs for each team?
First, I use several
data.table commands to define the
BAT_TEAM . (Note that I use an initial comma, since I am not filtering and I use the
:= symbol to define a new variable.) Then I repeat the same command as above, but now adding
BAT_TEAM to do the grouping.
d[, HOME_TEAM_ID := substr(GAME_ID, 1, 3)] d[, BAT_TEAM := ifelse(BAT_HOME_ID==0, as.character(AWAY_TEAM_ID), HOME_TEAM_ID)] Totals <- d[EVENT_CD==23, .(count=.N), BAT_TEAM] head(Totals) BAT_TEAM count 1: KCA 139 2: ANA 176 3: OAK 146 4: TEX 172 5: SEA 198 6: HOU 230
data.table by the value of a variable.
I use the
order argument on the variable
count to sort the home run counts for the 2015 teams.
Totals <- Totals[order(-count)] head(Totals) BAT_TEAM count 1: TOR 232 2: HOU 230 3: BAL 217 4: NYA 212 5: SEA 198 6: LAN 187
Run Production by Home Runs?
Next, I want to explore how many runs are produced (or more precisely, the RBI) with runners on base. First I tabulate the number of RBI’s for home runs — here I am filtering over home runs, computing a count, and grouping by the runs scored variable.
(S <- d[EVENT_CD==23, .(count=.N), RUNS.SCORED]) RUNS.SCORED count 1: 1 2887 2: 2 1360 3: 3 554 4: 4 108
Here I compute the total runs scored, total home runs, and total runs scored as a result of home runs:
d[, .(RUNS=sum(RUNS.SCORED), HR=sum(EVENT_CD==23), RUNS.HR=sum(RUNS.SCORED * (EVENT_CD==23)))] RUNS HR RUNS.HR 1: 20646 4909 7701
BAT_TEAM at the end, I do this same computation for each team.
S <- d[, .(RUNS=sum(RUNS.SCORED), HR=sum(EVENT_CD==23), RUNS.HR=sum(RUNS.SCORED * (EVENT_CD==23))), BAT_TEAM] head(S) BAT_TEAM RUNS HR RUNS.HR 1: KCA 724 139 218 2: ANA 661 176 257 3: OAK 694 146 241 4: TEX 751 172 258 5: SEA 656 198 306 6: HOU 729 230 347
Last, I create a new variable that is the fraction of runs produced by home runs — I include the names of the other variables since I still want them in the data.table.
S <- S[, .(BAT_TEAM, RUNS, HR, RUNS.HR, F_RUNS_HR=RUNS.HR / RUNS)] head(S) BAT_TEAM RUNS HR RUNS.HR F_RUNS_HR 1: KCA 724 139 218 0.3011050 2: ANA 661 176 257 0.3888048 3: OAK 694 146 241 0.3472622 4: TEX 751 172 258 0.3435419 5: SEA 656 198 306 0.4664634 6: HOU 729 230 347 0.4759945
Below I graph the runs and fraction of runs produced by home runs for all 2015 teams.
library(ggplot2) ggplot(S, aes(RUNS, RUNS.HR / RUNS, label=BAT_TEAM)) + geom_point() + geom_label() + ggtitle("Runs and Fraction by Home Runs for All Teams")
Or it might be interesting to plot the number of runs produced by home runs against the number of home runs not produced by home runs.
ggplot(S, aes(RUNS - RUNS.HR, RUNS.HR, label=BAT_TEAM)) + geom_point() + geom_label() + ggtitle("Runs Scored by Home Runs and Other Runs")
Runs Values of Home Runs
As we know, runs batted in is different from “runs created” since the opportunity to score runs decreases when you clear the bases with a home run. I have a
RUNS.VALUE variable in my data frame. I use data.table to compute the average number of runs created by home runs. This is a reflection of how well a team hits home runs in “good” situations with runners on base.
S <- d[, .(RUNS=sum(RUNS.SCORED), HR=sum(EVENT_CD==23), AVG.RUNS.VALUE=sum(RUNS.VALUE * (EVENT_CD==23)) / sum(EVENT_CD==23)), BAT_TEAM] head(S) BAT_TEAM RUNS HR AVG.RUNS.VALUE 1: KCA 724 139 1.395463 2: ANA 661 176 1.319281 3: OAK 694 146 1.446837 4: TEX 751 172 1.350066 5: SEA 656 198 1.380826 6: HOU 729 230 1.341109
ggplot(S, aes(HR, AVG.RUNS.VALUE, label=BAT_TEAM)) + geom_point() + geom_label() + ggtitle("Runs and Average Home Run Values for all")
We see that the average runs value for each Yankees home run was high (close to 1.5); in contrast, Tampa Bay’s average runs value per home run was only about 1.25. (I presume they had a lot of solo home runs.)
Anyway, I hope this post has helped the reader get introduced to the
data.table package. It is very powerful and I appreciate the succinct way you can do operations in a single line of code. I likely will introduce
data.table in my Computing with Data course this fall.