(NOTE: Evan and I are anxiously anticipating Ichiro Suzuki’s 3000th career hit and we plan a special post to commemorate this event. In the meantime, we’ll introduce the data.table
package.)
Although I typically use the dplyr
package in my work, the data.table
package is an attractive alternative package for implementing many of these data wrangling tasks. Here I briefly introduce data.table
for exploring how team home run hitting impacts run scoring for 2015 teams.
Read in csv Files Fast
The fread
function in data.table
quickly reads in csv files. To illustrate, below I read in the play-by-play Retrosheet data for a single season (over 189,000 rows). I measure the time reading this csv file in using both read.csv
and fread
functions — we see that fread
is about 15 times faster.
library(data.table) system.time(read.csv(file.loc)) # 17.743 sec user system elapsed 16.318 0.610 17.622 system.time(fread(file.loc)) # 1.182 sec user system elapsed 1.067 0.075 1.168
Variable Types: No Factors
I made some changes in my compute.runs.expectancy
function to use fread
. I convert the data frame to a data table by the data.table
function.
d <- compute.runs.expectancy_new(2015) d <- data.table(d)
One attractive feature of fread
is that all non-numerical columns of the output data frame are stored as character-type, not factors. This is nice default behavior — it is a bit annoying to have factor variables that you want to convert to numeric.
Basic Data Table Tasks
Once a data frame has been converted to a data.table, say DT, then the basic command for data wrangling is
DT[i, j, by]
which is read “filter according to i”, then “calculate by j”, and “grouping by” the by variable. As we will see, this is a compact way of performing many of the typical tasks with data.tables (that are also data frames).
How many home runs were hit in the 2015 season?
Here I want to filter by EVENT_CD == 23
(the home run flag) and calculate using the .N
variable.
d[EVENT_CD==23, .(count=.N)] count 1: 4909
Count of home runs for each team?
First, I use several data.table
commands to define the BAT_TEAM
. (Note that I use an initial comma, since I am not filtering and I use the :=
symbol to define a new variable.) Then I repeat the same command as above, but now adding BAT_TEAM
to do the grouping.
d[, HOME_TEAM_ID := substr(GAME_ID, 1, 3)] d[, BAT_TEAM := ifelse(BAT_HOME_ID==0, as.character(AWAY_TEAM_ID), HOME_TEAM_ID)] Totals <- d[EVENT_CD==23, .(count=.N), BAT_TEAM] head(Totals) BAT_TEAM count 1: KCA 139 2: ANA 176 3: OAK 146 4: TEX 172 5: SEA 198 6: HOU 230
Arranging a data.table
by the value of a variable.
I use the order
argument on the variable count
to sort the home run counts for the 2015 teams.
Totals <- Totals[order(-count)] head(Totals) BAT_TEAM count 1: TOR 232 2: HOU 230 3: BAL 217 4: NYA 212 5: SEA 198 6: LAN 187
Run Production by Home Runs?
Next, I want to explore how many runs are produced (or more precisely, the RBI) with runners on base. First I tabulate the number of RBI’s for home runs — here I am filtering over home runs, computing a count, and grouping by the runs scored variable.
(S <- d[EVENT_CD==23, .(count=.N), RUNS.SCORED]) RUNS.SCORED count 1: 1 2887 2: 2 1360 3: 3 554 4: 4 108
Here I compute the total runs scored, total home runs, and total runs scored as a result of home runs:
d[, .(RUNS=sum(RUNS.SCORED), HR=sum(EVENT_CD==23), RUNS.HR=sum(RUNS.SCORED * (EVENT_CD==23)))] RUNS HR RUNS.HR 1: 20646 4909 7701
By adding BAT_TEAM
at the end, I do this same computation for each team.
S <- d[, .(RUNS=sum(RUNS.SCORED), HR=sum(EVENT_CD==23), RUNS.HR=sum(RUNS.SCORED * (EVENT_CD==23))), BAT_TEAM] head(S) BAT_TEAM RUNS HR RUNS.HR 1: KCA 724 139 218 2: ANA 661 176 257 3: OAK 694 146 241 4: TEX 751 172 258 5: SEA 656 198 306 6: HOU 729 230 347
Last, I create a new variable that is the fraction of runs produced by home runs — I include the names of the other variables since I still want them in the data.table.
S <- S[, .(BAT_TEAM, RUNS, HR, RUNS.HR, F_RUNS_HR=RUNS.HR / RUNS)] head(S) BAT_TEAM RUNS HR RUNS.HR F_RUNS_HR 1: KCA 724 139 218 0.3011050 2: ANA 661 176 257 0.3888048 3: OAK 694 146 241 0.3472622 4: TEX 751 172 258 0.3435419 5: SEA 656 198 306 0.4664634 6: HOU 729 230 347 0.4759945
Below I graph the runs and fraction of runs produced by home runs for all 2015 teams.
library(ggplot2) ggplot(S, aes(RUNS, RUNS.HR / RUNS, label=BAT_TEAM)) + geom_point() + geom_label() + ggtitle("Runs and Fraction by Home Runs for All Teams")
We see that although Toronto had the most number of home runs, the fraction of runs produced by home runs was high for Baltimore, Houston and the Yankees.
Or it might be interesting to plot the number of runs produced by home runs against the number of home runs not produced by home runs.
ggplot(S, aes(RUNS - RUNS.HR, RUNS.HR, label=BAT_TEAM)) + geom_point() + geom_label() + ggtitle("Runs Scored by Home Runs and Other Runs")
Here we see a negative relationship — teams that produce many runs by home runs tend not to produce many runs not by home runs.
Runs Values of Home Runs
As we know, runs batted in is different from “runs created” since the opportunity to score runs decreases when you clear the bases with a home run. I have a RUNS.VALUE
variable in my data frame. I use data.table to compute the average number of runs created by home runs. This is a reflection of how well a team hits home runs in “good” situations with runners on base.
S <- d[, .(RUNS=sum(RUNS.SCORED), HR=sum(EVENT_CD==23), AVG.RUNS.VALUE=sum(RUNS.VALUE * (EVENT_CD==23)) / sum(EVENT_CD==23)), BAT_TEAM] head(S) BAT_TEAM RUNS HR AVG.RUNS.VALUE 1: KCA 724 139 1.395463 2: ANA 661 176 1.319281 3: OAK 694 146 1.446837 4: TEX 751 172 1.350066 5: SEA 656 198 1.380826 6: HOU 729 230 1.341109
ggplot(S, aes(HR, AVG.RUNS.VALUE, label=BAT_TEAM)) + geom_point() + geom_label() + ggtitle("Runs and Average Home Run Values for all")
We see that the average runs value for each Yankees home run was high (close to 1.5); in contrast, Tampa Bay’s average runs value per home run was only about 1.25. (I presume they had a lot of solo home runs.)
Final Words
Anyway, I hope this post has helped the reader get introduced to the data.table
package. It is very powerful and I appreciate the succinct way you can do operations in a single line of code. I likely will introduce data.table
in my Computing with Data course this fall.