In my last post, we got up and running with a PITCHf/x database. This time, I’ll share some thoughts on querying this database. I am by no means a database expert, so please share ideas for improvement if you have them. Before we get started, let me share some opinions I’ve gathered while developing pitchRx.
A bit of my philosophy
When handling such a large and complex data source, there are a number of benefits to using a database. In my opinion, the biggest benefit is the ability to work around memory limitations on your machine. That is, if you read.table
or read.csv
every time you want to run a large scale analysis, you will most likely exhaust memory limitations on your machine and causes it to be sluggish. Another benefit is a more reproducible workflow with less overhead due to managing loads of spreadsheets. I hope this post helps motivate pitchRx
users to think carefully about the data required for a particular analysis and construct an appropriate database query when necessary.
pitchRx
tries not to make assumptions about the particular type of analysis that you’d like to run. Its main goal is to provide PITCHf/x data in its raw form. pitchRx
does add some columns for sake of convenience (such as the pitch count and pitcher/batter names), but some of you may want to modify this raw form yourself. I encourage pitchRx
users to perform modifications and also let me know if something important is missing or could be improved. That being said, this post demonstrates some methods I’ve found useful for querying and modifying a database from within R using dplyr and DBI.
First things first, let’s establish connection to our PITCHf/x database. Note that my database is saved under my ~/pitchfx
directory as “pitchRx.sqlite”.
library(dplyr) db <- src_sqlite('~/pitchfx/pitchRx.sqlite3') db
## src: sqlite 3.7.17 [pitchRx.sqlite3] ## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner, ## umpire
When in doubt, start with atbat
I view the atbat table a central component the database, since it has a direct link to the most tables. Thus, in most cases, it makes sense to first subset the atbat table based on our level of interest then join it will another table. Before we explore some common subsetting operations, let’s grab the atbat table.
atbats <- tbl(db, 'atbat')
It’s important to note that we haven’t actually brought this data into R. dplyr
is smart enough to postpone data collection until it’s absolutely necessary. In fact, atbats
is currently a list containing meta-information that dplyr
will use to (eventually) perform an SQL query. To emphasize my point, let’s take a look at the SQL query associated with atbats
.
atbats$query
## <Query> SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link' ## FROM 'atbat' ## <SQLiteConnection: DBI CON (7835, 0)>
This query would pull the entire atbat table into R if we were to collect(atbats)
. The neat thing about dplyr
is that we can modify atbats
as if it were a data frame before performing an expensive query.
Appending a date column
NOTE: This section is outdated since pitchRx now adds a date column to the atbat table; however, this section may still be useful if you’d like to append a column to a table in your database.
Dates are embedded between the 5th and 15th character of values in the “gameday_link” column. Using dplyr
’s mutate
function, we could append a date column to the atbat table.
atbat_date <- mutate(atbats, date = substr(gameday_link, 15L, -10L)) atbat_date$query
## <Query> SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', SUBSTR('gameday_link', 15, - 10) AS 'date' ## FROM 'atbat' ## <SQLiteConnection: DBI CON (7835, 0)>
One lofty goal of dplyr
is to provide a set of high level verbs that abstract out common SQL operations. This allows dplyr
to be agnostic in regards to where your data actually lives. While this is often very nice, one also has to be careful of various “gotchas” when mapping R functions to SQL functions. For example, arguments to the substr
function in R are not exactly the same as arguments to the SQL core function SUBSTR
. For this reason, you might find the core function reference helpful at some point (at least I did when writing this post :). Using dplyr
’s compute
function, we can perform this query without actually bringing the resulting “atbat_date” table into R.
compute(atbat_date, name = 'atbat_date', temporary = FALSE)
## Source: sqlite 3.7.17 [pitchRx.sqlite3] ## From: atbat_date [1,248,357 x 28] ## ## pitcher batter num b s o start_tfs start_tfs_zulu stand b_height ## 1 277417 493479 1 0 3 1 R 5-11 ## 2 277417 493468 3 0 3 3 R 6-0 ## 3 277417 493488 8 0 3 2 L 6-3 ## 4 277417 493470 2 0 0 2 R 6-1 ## 5 277417 493472 7 0 0 1 R 6-3 ## 6 277417 507000 9 0 3 3 L 5-11 ## 7 216897 493479 32 0 0 1 R 5-11 ## 8 216897 493470 33 0 0 2 R 6-1 ## 9 216897 506988 24 0 3 1 R 6-2 ## 10 216897 493468 34 0 0 3 R 6-0 ## .. ... ... ... . . . ... ... ... ... ## Variables not shown: p_throws (chr), atbat_des (chr), atbat_des_es (chr), ## event (chr), score (chr), home_team_runs (chr), away_team_runs (chr), ## url (chr), inning_side (chr), inning (dbl), next_ (chr), event2 (chr), ## event3 (chr), batter_name (chr), pitcher_name (chr), event4 (chr), ## gameday_link (chr), date (chr)
db # Note the new atbat_date table
## src: sqlite 3.7.17 [pitchRx.sqlite3] ## tbls: action, atbat, atbat_date, coach, game, hip, media, pitch, player, ## po, runner, umpire
Once this table is computed, there is no reason to hang onto the original atbat table. We can easily remove this table using the DBI package (R’s database interface). Both pitchRx
and dplyr
import this package, so it will be installed already.
library(DBI) dbRemoveTable(db$con, name = 'atbat')
## [1] TRUE
db # Note removal of atbat table
## src: sqlite 3.7.17 [pitchRx.sqlite3] ## tbls: action, atbat_date, coach, game, hip, media, pitch, player, po, ## runner, umpire
It is also a good idea to rename the new “atbat_date” table as “atbat” so that pitchRx
can recognize it when updating the database. DBI
’s dbSendQuery
function allows us to send arbitrary queries to the database, so let’s use it to rename the table.
dbSendQuery(db$con, 'ALTER TABLE atbat_date RENAME TO atbat')
## <SQLiteResult: DBI RES (7835, 0, 13)>
db # Now atbat has a date column
## src: sqlite 3.7.17 [pitchRx.sqlite3] ## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner, ## umpire
Query by date
Assuming we have data on multiple years, we could grab 2013 data in the following manner. Also, before we perform an SQL query, we can check the using
explain
:
atbat13 <- filter(tbl(db, 'atbat'), date >= '2013_01_01' & date <= '2014_01_01') explain(atbat13)
## <SQL> ## SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', 'date' ## FROM 'atbat' ## WHERE 'date' >= '2013_01_01' AND 'date' <= '2014_01_01' ## ## <PLAN> ## selectid order from detail ## 1 0 0 0 SCAN TABLE atbat (~111111 rows)
Notice the SCAN TABLE atbat
part under . This means that SQL will actually check each record to see if it meets our date criteria when the query is performed. We can do much better than this if we know a little about indicies and query planning. If we create an index on the date column, SQL no longer has to check every record and can perform a binary search instead. Let’s use
dbSendQuery
again, this time to create an index on the date column.
dbSendQuery(db$con, 'CREATE INDEX date_idx ON atbat(date)')
## <SQLiteResult: DBI RES (7835, 0, 18)>
explain(atbat13)
## <SQL> ## SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', 'date' ## FROM 'atbat' ## WHERE 'date' >= '2013_01_01' AND 'date' <= '2014_01_01' ## ## <PLAN> ## selectid order from ## 1 0 0 0 ## detail ## 1 SEARCH TABLE atbat USING INDEX date_idx (date>? AND date<?) (~62500 rows)
Now that we have an index on the date column, the section now says
SEARCH TABLE atbat USING INDEX
and as a result, our query will be faster. Note that indices only have to be created once, so now that we have date_idx
, we can use it for other queries.
Subsetting by player
In addition to dates, we often restrict our analysis to the player level. Thankfully, pitchRx
adds player names to the atbat table so we can subset the atbats by name rather than ID. Suppose I’m interested in gathering every pitch thrown by Clayton Kershaw in 2013. I can simply apply another filter
to atbat13
:
kershaw13 <- filter(atbat13, pitcher_name == 'Clayton Kershaw')
Since we are subsetting by pitcher name, it’s a good idea to create another index on the pitcher_name column.
dbSendQuery(db$con, 'CREATE INDEX pitcher_idx ON atbat(pitcher_name)')
## <SQLiteResult: DBI RES (7835, 0, 20)>
explain(kershaw13)
## <SQL> ## SELECT 'pitcher', 'batter', 'num', 'b', 's', 'o', 'start_tfs', 'start_tfs_zulu', 'stand', 'b_height', 'p_throws', 'atbat_des', 'atbat_des_es', 'event', 'score', 'home_team_runs', 'away_team_runs', 'url', 'inning_side', 'inning', 'next_', 'event2', 'event3', 'batter_name', 'pitcher_name', 'event4', 'gameday_link', 'date' ## FROM 'atbat' ## WHERE 'date' >= '2013_01_01' AND 'date' <= '2014_01_01' AND 'pitcher_name' = 'Clayton Kershaw' ## ## <PLAN> ## selectid order from ## 1 0 0 0 ## detail ## 1 SEARCH TABLE atbat USING INDEX pitcher_idx (pitcher_name=?) (~2 rows)
Joining atbat with pitch
Remember that the pitch table is the one that contains the valuable PITCHf/x information so we usually want to join this data with the relevant atbat information.
pitches <- tbl(db, 'pitch') k13 <- inner_join(pitches, kershaw13, by = c('num', 'gameday_link'))
Note that I have included both the “num” and “gameday_link” as columns to join by. The “gameday_link” column gives us an indication of which game a particular pitch/atbat came from and the “num” column keeps track of the atbat within a game. Together, these columns create a unique key which allows us to join these tables. Note that an inner join will drop records that don’t match on the unique key. In other words, this join will drop pitches where Kershaw was not the pitcher.
For reasons similar to before, this query will be much, much faster if we create an index for the pitch table corresponding to the unique key.
dbSendQuery(db$con, 'CREATE INDEX pitch_idx ON pitch(gameday_link, num)')
## <SQLiteResult: DBI RES (7835, 0, 24)>
Finally, let’s collect
the query and be amazed at how fast we get a result.
kershaw <- collect(k13)
Chaining operations in dplyr
dplyr
‘s %.%
operator allows ’piping’ of objects through a series of operations. This %.%
guy can save us a lot of typing and produces much cleaner code when chaining many operations together. For example, I could grab all of Verlander’s pitches in 2013 like so:
atbats <- tbl(db, 'atbat') %.% filter(date >= '2013_01_01' & date <= '2014_01_01') %.% filter(pitcher_name == 'Justin Verlander') pitches <- tbl(db, 'pitch') verlander <- collect(inner_join(pitches, atbats, by = c('num', 'gameday_link')))
Animating pitches
I know you’re dying for some graphics at this point, so let’s use the animateFX
function from the pitchRx
package to animate “average” trajectories for each one of Kershaw’s pitch types. I recommend using the saveHTML
function from the animation package so you can view animations in a web browser.
library(pitchRx) library(animation) saveHTML( animateFX(kershaw, avg.by = 'pitch_types', layer = list(theme_bw(), facet_grid(.~stand))), img.name = "kershaw" )
And now Verlander’s:
saveHTML( animateFX(verlander, avg.by = 'pitch_types', layer = list(theme_bw(), facet_grid(. ~ stand))), img.name = "verlander" )
Great info and explanation. I was struggling with this. The indexing sped up my queries immensely. Thank you for sharing.
Hi,
Is dbSendQuery still available in 3.8.6? I get Error: could not find function “dbSendQuery”.
Thanks,
Jon
dbSendQuery() is in the DBI package, so load the library first: library(DBI)
Thank you. That was an easy one. I had to scrape one year at a time but I finally have all of the data and was able to animate pitches for John Lackey’s pitches this year. Very cool.
I have hit a brick wall trying to migrate the R code into R Markdown. Have you had any luck creating animations in R Markdown?
Cheers,
Jon
Have a look at knitr’s ‘animate’ chunk option — http://yihui.name/knitr/options/
I have an example here — https://raw.githubusercontent.com/cpsievert/pitchRx/gh-pages/index.Rmd
I noticed that several libraries disappeared from my work laptop today so I had to reinstall the pitchRx library. When I ran through my code to animate pitches for Jake Arieta, I noticed the strike zone was missing. I am unable to reproduce my animation that I created back in December 2015 which is located here, http://www.vizitdata.com/Cubs/index.html. I do get the animation but the black box outlining the strike zone are missing. Any ideas?