Modifying and Querying a PITCHf/x database with dplyr

In my last post, we got up and running with a PITCHf/x database. This time, I’ll share some thoughts on querying this database. I am by no means a database expert, so please share ideas for improvement if you have them. Before we get started, let me share some opinions I’ve gathered while developing pitchRx.

A bit of my philosophy

When handling such a large and complex data source, there are a number of benefits to using a database. In my opinion, the biggest benefit is the ability to work around memory limitations on your machine. That is, if you read.table or read.csv every time you want to run a large scale analysis, you will most likely exhaust memory limitations on your machine and causes it to be sluggish. Another benefit is a more reproducible workflow with less overhead due to managing loads of spreadsheets. I hope this post helps motivate pitchRx users to think carefully about the data required for a particular analysis and construct an appropriate database query when necessary.

pitchRx tries not to make assumptions about the particular type of analysis that you’d like to run. Its main goal is to provide PITCHf/x data in its raw form. pitchRx does add some columns for sake of convenience (such as the pitch count and pitcher/batter names), but some of you may want to modify this raw form yourself. I encourage pitchRx users to perform modifications and also let me know if something important is missing or could be improved. That being said, this post demonstrates some methods I’ve found useful for querying and modifying a database from within R using dplyr and DBI.

First things first, let’s establish connection to our PITCHf/x database. Note that my database is saved under my ~/pitchfx directory as “pitchRx.sqlite”.

library(dplyr)
db <- src_sqlite('~/pitchfx/pitchRx.sqlite3')
db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner,
##   umpire

When in doubt, start with atbat

I view the atbat table a central component the database, since it has a direct link to the most tables. Thus, in most cases, it makes sense to first subset the atbat table based on our level of interest then join it will another table. Before we explore some common subsetting operations, let’s grab the atbat table.

atbats <- tbl(db, 'atbat')

It’s important to note that we haven’t actually brought this data into R. dplyr is smart enough to postpone data collection until it’s absolutely necessary. In fact, atbats is currently a list containing meta-information that dplyr will use to (eventually) perform an SQL query. To emphasize my point, let’s take a look at the SQL query associated with atbats.

atbats$query
## <Query> SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link'
## FROM 'atbat'
## <SQLiteConnection: DBI CON (7835, 0)>

This query would pull the entire atbat table into R if we were to collect(atbats). The neat thing about dplyr is that we can modify atbats as if it were a data frame before performing an expensive query.

Appending a date column

NOTE: This section is outdated since pitchRx now adds a date column to the atbat table; however, this section may still be useful if you’d like to append a column to a table in your database.

Dates are embedded between the 5th and 15th character of values in the “gameday_link” column. Using dplyr’s mutate function, we could append a date column to the atbat table.

atbat_date <- mutate(atbats, date = substr(gameday_link, 15L, -10L))
atbat_date$query
## <Query> SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', SUBSTR('gameday_link', 15,  - 10) AS 'date'
## FROM 'atbat'
## <SQLiteConnection: DBI CON (7835, 0)>

One lofty goal of dplyr is to provide a set of high level verbs that abstract out common SQL operations. This allows dplyr to be agnostic in regards to where your data actually lives. While this is often very nice, one also has to be careful of various “gotchas” when mapping R functions to SQL functions. For example, arguments to the substr function in R are not exactly the same as arguments to the SQL core function SUBSTR. For this reason, you might find the core function reference helpful at some point (at least I did when writing this post :). Using dplyr’s compute function, we can perform this query without actually bringing the resulting “atbat_date” table into R.

compute(atbat_date, name = 'atbat_date', temporary = FALSE)
## Source: sqlite 3.7.17 [pitchRx.sqlite3]
## From: atbat_date [1,248,357 x 28]
## 
##    pitcher batter num b s o start_tfs start_tfs_zulu stand b_height
## 1   277417 493479   1 0 3 1                              R     5-11
## 2   277417 493468   3 0 3 3                              R      6-0
## 3   277417 493488   8 0 3 2                              L      6-3
## 4   277417 493470   2 0 0 2                              R      6-1
## 5   277417 493472   7 0 0 1                              R      6-3
## 6   277417 507000   9 0 3 3                              L     5-11
## 7   216897 493479  32 0 0 1                              R     5-11
## 8   216897 493470  33 0 0 2                              R      6-1
## 9   216897 506988  24 0 3 1                              R      6-2
## 10  216897 493468  34 0 0 3                              R      6-0
## ..     ...    ... ... . . .       ...            ...   ...      ...
## Variables not shown: p_throws (chr), atbat_des (chr), atbat_des_es (chr),
##   event (chr), score (chr), home_team_runs (chr), away_team_runs (chr),
##   url (chr), inning_side (chr), inning (dbl), next_ (chr), event2 (chr),
##   event3 (chr), batter_name (chr), pitcher_name (chr), event4 (chr),
##   gameday_link (chr), date (chr)
db # Note the new atbat_date table
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat, atbat_date, coach, game, hip, media, pitch, player,
##   po, runner, umpire

Once this table is computed, there is no reason to hang onto the original atbat table. We can easily remove this table using the DBI package (R’s database interface). Both pitchRx and dplyr import this package, so it will be installed already.

library(DBI)
dbRemoveTable(db$con, name = 'atbat')
## [1] TRUE
db # Note removal of atbat table
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat_date, coach, game, hip, media, pitch, player, po,
##   runner, umpire

It is also a good idea to rename the new “atbat_date” table as “atbat” so that pitchRx can recognize it when updating the databaseDBI’s dbSendQuery function allows us to send arbitrary queries to the database, so let’s use it to rename the table.

dbSendQuery(db$con, 'ALTER TABLE atbat_date RENAME TO atbat')
## <SQLiteResult: DBI RES (7835, 0, 13)>
db # Now atbat has a date column
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner,
##   umpire

Query by date

Assuming we have data on multiple years, we could grab 2013 data in the following manner. Also, before we perform an SQL query, we can check the  using explain:

atbat13 <- filter(tbl(db, 'atbat'), date >= '2013_01_01' & date <= '2014_01_01') 
explain(atbat13)
## <SQL>
## SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', 'date'
## FROM 'atbat'
## WHERE 'date' >= '2013_01_01' AND 'date' <= '2014_01_01'
## 
## <PLAN>
##   selectid order from                          detail
## 1        0     0    0 SCAN TABLE atbat (~111111 rows)

Notice the SCAN TABLE atbat part under . This means that SQL will actually check each record to see if it meets our date criteria when the query is performed. We can do much better than this if we know a little about indicies and query planning. If we create an index on the date column, SQL no longer has to check every record and can perform a binary search instead. Let’s use dbSendQuery again, this time to create an index on the date column.

dbSendQuery(db$con, 'CREATE INDEX date_idx ON atbat(date)')
## <SQLiteResult: DBI RES (7835, 0, 18)>
explain(atbat13)
## <SQL>
## SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', 'date'
## FROM 'atbat'
## WHERE 'date' >= '2013_01_01' AND 'date' <= '2014_01_01'
## 
## <PLAN>
##   selectid order from
## 1        0     0    0
##                                                                      detail
## 1 SEARCH TABLE atbat USING INDEX date_idx (date>? AND date<?) (~62500 rows)

Now that we have an index on the date column, the  section now says SEARCH TABLE atbat USING INDEX and as a result, our query will be faster. Note that indices only have to be created once, so now that we have date_idx, we can use it for other queries.

Subsetting by player

In addition to dates, we often restrict our analysis to the player level. Thankfully, pitchRx adds player names to the atbat table so we can subset the atbats by name rather than ID. Suppose I’m interested in gathering every pitch thrown by Clayton Kershaw in 2013. I can simply apply another filter to atbat13:

kershaw13 <- filter(atbat13, pitcher_name == 'Clayton Kershaw')

Since we are subsetting by pitcher name, it’s a good idea to create another index on the pitcher_name column.

dbSendQuery(db$con, 'CREATE INDEX pitcher_idx ON atbat(pitcher_name)')
## <SQLiteResult: DBI RES (7835, 0, 20)>
explain(kershaw13)
## <SQL>
## SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', 'date'
## FROM 'atbat'
## WHERE 'date' >= '2013_01_01' AND 'date' <= '2014_01_01' AND 'pitcher_name' = 'Clayton Kershaw'
## 
## <PLAN>
##   selectid order from
## 1        0     0    0
##                                                                  detail
## 1 SEARCH TABLE atbat USING INDEX pitcher_idx (pitcher_name=?) (~2 rows)

Joining atbat with pitch

Remember that the pitch table is the one that contains the valuable PITCHf/x information so we usually want to join this data with the relevant atbat information.

pitches <- tbl(db, 'pitch')
k13 <- inner_join(pitches, kershaw13, by = c('num', 'gameday_link'))

Note that I have included both the “num” and “gameday_link” as columns to join by. The “gameday_link” column gives us an indication of which game a particular pitch/atbat came from and the “num” column keeps track of the atbat within a game. Together, these columns create a unique key which allows us to join these tables. Note that an inner join will drop records that don’t match on the unique key. In other words, this join will drop pitches where Kershaw was not the pitcher.

For reasons similar to before, this query will be much, much faster if we create an index for the pitch table corresponding to the unique key.

dbSendQuery(db$con, 'CREATE INDEX pitch_idx ON pitch(gameday_link, num)')
## <SQLiteResult: DBI RES (7835, 0, 24)>

Finally, let’s collect the query and be amazed at how fast we get a result.

kershaw <- collect(k13)

Chaining operations in dplyr

dplyr‘s %.% operator allows ’piping’ of objects through a series of operations. This %.% guy can save us a lot of typing and produces much cleaner code when chaining many operations together. For example, I could grab all of Verlander’s pitches in 2013 like so:

atbats <- tbl(db, 'atbat') %.%
  filter(date >= '2013_01_01' & date <= '2014_01_01') %.%
  filter(pitcher_name == 'Justin Verlander')
pitches <- tbl(db, 'pitch')  
verlander <- collect(inner_join(pitches, atbats, by = c('num', 'gameday_link')))

Animating pitches

I know you’re dying for some graphics at this point, so let’s use the animateFX function from the pitchRx package to animate “average” trajectories for each one of Kershaw’s pitch types. I recommend using the saveHTML function from the animation package so you can view animations in a web browser.

library(pitchRx)
library(animation)
saveHTML(
  animateFX(kershaw, avg.by = 'pitch_types', layer = list(theme_bw(), facet_grid(.~stand))),
  img.name = "kershaw"
)

And now Verlander’s:

saveHTML(
  animateFX(verlander, avg.by = 'pitch_types', layer = list(theme_bw(), facet_grid(. ~ stand))),
  img.name = "verlander"
)

6 responses

  1. Great info and explanation. I was struggling with this. The indexing sped up my queries immensely. Thank you for sharing.

  2. Hi,

    Is dbSendQuery still available in 3.8.6? I get Error: could not find function “dbSendQuery”.

    Thanks,
    Jon

    1. dbSendQuery() is in the DBI package, so load the library first: library(DBI)

    2. Thank you. That was an easy one. I had to scrape one year at a time but I finally have all of the data and was able to animate pitches for John Lackey’s pitches this year. Very cool.

      I have hit a brick wall trying to migrate the R code into R Markdown. Have you had any luck creating animations in R Markdown?

      Cheers,
      Jon

  3. I noticed that several libraries disappeared from my work laptop today so I had to reinstall the pitchRx library. When I ran through my code to animate pitches for Jake Arieta, I noticed the strike zone was missing. I am unable to reproduce my animation that I created back in December 2015 which is located here, http://www.vizitdata.com/Cubs/index.html. I do get the animation but the black box outlining the strike zone are missing. Any ideas?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: