Welcome to R – Part 4 – The data.table Package

Will We See Baseball in 2020?

It is disappointing to see the current breakdown in negotiations between the owners and the players. But at this time it appears that there will be a shortened 2020 baseball season where the number of games and playoff structure will be determined by Major League Baseball. I look forward to exploring current baseball data, although one wonders what it means for a team to be the champion of a shortened 2020 season.

Introduction to data.table

If you have been following my Welcome to R series, we have talked in parts 1, 2, and 3 about vectors, data frames, and graphical displays for studying relationships. In this Part 4 of the Welcome to R series, I will introduce the data.table package which is an enhancement on the default data frame structure in R for handling tabular data. The R community, especially the RStudio folks, embrace the tidyverse which includes the dplyr package for manipulations with data frames. But I think data.table is a very attractive alternative to dplyr used by many data scientists. The goal here is to introduce the basic data.table syntax and illustrate it on a small baseball study. I hope this will encourage the reader to try data.table out.

Reading in Some Statcast Data

In my ABWRdata package available on Github, I have included several datasets to accompany our Analyzing Baseball Data with R book. Using the fread() function in the data.table package, I can read in a Statcast dataset directly from my Github site. This dataset contains 100K records from the 2017 season — each row corresponds to a pitch — there are a total of 91 variables measured.

By default, the fread() function will save the result as a data table.

library(data.table)
sc <- fread("https://raw.githubusercontent.com/bayesball/ABWRdata/master/data/statcast2017.txt")

What is a data.table?

We are already familiar with a data frame in R. One can think of a data table as an enhanced version of a data frame, or maybe a data frame on steroids. A data table is also a data frame, so you can manipulate a data table in the same way as a data frame. But we’ll see that a data table has extra capabilities which makes it useful for performing data science operations quickly using a concise notation.

data.table Basic Syntax

We are already familiar with the bracket notation in a data frame — if we have a data frame, say DF, then DF[i, j] refers to rows i and variables j. In a data table DT, we have the general notation

DT[i, j, by]

where i indicates how we select rows, j indicates how we choose variables, and by indicates how we divide the data table by groups.

It seems best to illustrate these different operations by examples using our Statcast data table.

Selecting Rows

These examples will look familiar to those who use the bracket notation to select rows in data frames. First, I choose the rows where the player name is “Mike Trout”, in the second example, I choose the rows corresponding to fast ball pitches, and the third example selects the pitches thrown to Mike Trout that are curve balls. Note several things: (1) I can just use the variable name in the brackets (I don’t need to use the $ notation), (2) I don’t need an extra comma inside the brackets.

sc[player_name == "Mike Trout"]
sc[pitch_type %in%
        c("FC", "FF", "FO", "FS", "FT")]
sc[player_name == "Mike Trout"
            & pitch_type == "CU"]

Selecting Variables

The j argument lets one select one or more variables from the data table. Below I first select the pitch type, and then I select four variables. Again note that I just use the variable name, and when I want to select more than one variable, I place the variable names inside .().

sc[, pitch_type]
sc[, .(plate_x, plate_z, description, events)]

Computing Summaries and Summaries by Group

Selecting rows and variables in data tables should look pretty familiar to R users. But the j argument can also be used to compute summaries. Here are a couple of examples.

First I want to compute the mean launch speed for all balls put in play (use type == “X” to select the rows in play).

sc[type == "X", mean(launch_speed)]

[1] 86.69978

Okay, the mean launch speed is 86.7 mph. How does the speed vary by the number of strikes in the count? To answer this question, we now use the gp argument — we add a .(strikes) value as the 3rd argument. The output below shows a mean launch speed for 0, 1, and 2 strikes. If the third argument is .(strikes, balls), it will produce the mean launch speed for each combination of strikes and balls.

sc[type == "X", mean(launch_speed),
   .(strikes)]

strikes  V1
1	86.69287		
2	86.13330		
0	87.44394

Swing and Miss Study

To illustrate using data.table in a small study, consider a player hitting a line drive. First, the batter has to decide to swing. Next, the batter has to make contact with the pitch. Finally, assuming that the batter swings and doesn’t miss the pitch, he hits a line drive. How does the likelihood of each of these three steps of this process depend on the count? We are interested in ordering the counts from most likely to least likely for each of these operations.

Some Preliminaries

First I construct vectors consisting of the swing and miss situations.

swing_situations <- c("hit_into_play", "foul",
                      "swinging_strike",
                      "swinging_strike_blocked",
                      "missed_bunt", "hit_into_play_no_out",
                      "foul_bunt", "foul_tip",
                      "hit_into_play_score")
miss_situations <-
  c("swinging_strike", "swinging_strike_blocked")

I define new variables Count, swing, and miss — the last two variables are indicator functions equal to 1 or 0 if the event (swing or miss) occurs or doesn’t occur.

sc <- sc[, Count := paste(balls, strikes, sep="-")]
sc <- sc[, c("swing", "miss") :=
           .(ifelse(description %in%
                    swing_situations, 1, 0),
             ifelse(description %in%
                      miss_situations, 1, 0))]

Swinging by Count

I compute the number of occurrences N and the fraction of swings M, and group these summaries for each possible count, and then I order the result by the fraction M in descending order. Note the reluctance of the batter to swing in early counts like 0-0 and counts where he has the advantage (like 2-0 or 3-0). It is most likely for the batter to swing in two-strike counts with the exception of a 0-2 count.

sc[, .(.N, M = round(mean(swing), 3)), 
    by = Count][order(- M)]

    Count     N     M
 1:   3-2  4930 0.727
 2:   2-2  8195 0.640
 3:   2-1  5285 0.582
 4:   1-2  9536 0.580
 5:   3-1  2290 0.578
 6:   1-1 10328 0.536
 7:   0-2  6374 0.507
 8:   0-1 12832 0.468
 9:   2-0  3531 0.432
10:   1-0 10095 0.421
11:   0-0 25542 0.289
12:   3-0  1062 0.103

Making Contact by Count

Next, assuming that the batter has swung (want to only consider rows where swing == 1), we compute the count and fraction of misses, grouping by count. Here we order in ascending order, so the top rows correspond to the counts where the batter is least likely to miss or most likely to make contact. Generally, it appears that the batter is most likely to make contact when he is ahead in the count. The three most likely counts all have three balls. He is most likely to miss on a 0-2 count.

sc[swing == 1,
    .(.N, M = round(mean(miss), 3)), 
   by = Count][order(M)]

    Count    N     M
 1:   3-0  109 0.083
 2:   3-2 3585 0.155
 3:   3-1 1324 0.162
 4:   2-0 1526 0.172
 5:   2-1 3076 0.203
 6:   2-2 5248 0.216
 7:   1-0 4250 0.227
 8:   1-1 5535 0.230
 9:   0-0 7384 0.232
10:   1-2 5534 0.248
11:   0-1 6002 0.248
12:   0-2 3230 0.253

Hitting a Line Drive by Count

Last, assuming that the batter has swung (swing == 1) and has made contact (miss == 0), and we are interested in finding the fraction of line drives, again ordering from most likely to least likely. Here the pattern is not quite as obvious. The most likely line drive counts are 1-0 and 2-2 and the least likely line drive counts are 3-0 and 1-2.

sc[swing == 1 & miss == 0, 
    .(N = .N,
     M = round(mean(bb_type == "line_drive",
                     na.rm = TRUE), 3)),
    by = Count][order(-M)]

    Count    N     M
 1:   1-0 3287 0.267
 2:   2-2 4115 0.266
 3:   1-1 4264 0.258
 4:   2-0 1264 0.257
 5:   3-2 3031 0.251
 6:   0-1 4515 0.250
 7:   2-1 2453 0.249
 8:   0-0 5673 0.248
 9:   3-1 1110 0.241
10:   0-2 2412 0.237
11:   3-0  100 0.236
12:   1-2 4160 0.235

Why Do I Like data table?

Many years ago, I became infatuated with the programming language APL. I appreciated the simplity and power of programming using vectors and matrices and APL had special symbols for performing these operations. One could do a relatively complex calculation using one or two lines of APL code. With all of the special symbols, it could be hard to read APL code, but I was impressed with the efficient use of code.

In a similar way, I think data.table provides a very impressive succient syntax for implementing the common data science operations on a table. Once you know the basic bracketing operations in a data frame, then it is a small step to learn the enhanced operations in the data.table package. If one is thinking about the size of the vocabulary one needs to learn, there is no contest between the data.table and dplyr packages — data.table is a clear winner. (By the way, I am disappointed that RStudio, one of the main commercial R companies, appears to ignore the data.table package in their documentation. I would think they should introduce the most useful R packages on their site.)

I encourage you to try data.table out. Here is a blog post that provides a more detailed introduction to the data.table package. It is easy to install the package from CRAN and I have posted all of the R work in this post as a Markdown file on my Github Gist site. I have explored here how the rate of swinging, contact, and line drives depends on the count. The interested reader could instead look at how the average launch speed or average launch angle depends on the situation. At the very least, I provide some easy access to some Statcast data that is fun to play with.

2 responses

  1. Really great explanation. I’m an R noob and have dabbled with dplyr a good bit but data.table is so Show. Can’t wait to play with it on my own. Question though, why do use .N in the first 2 scenarios then N = .N in the last?

    1. I believe in the second case, you are just naming the .N variable to be N.

Leave a comment