Tag Archives: pitchRx

Starting and updating a PITCHf/x database with pitchRx and dplyr

First of all, let me say that I am truly honored Jim Albert invited me to join Ben Baumer in contributing to this blog. I hope to share some useful knowledge I’ve gained in relation to baseball data and R over the past couple years. Most of my experience comes from developing the R package pitchRx which provides tools for collecting, storing and visualizing PITCHf/x data. If your interested, my pitchRx paper is available online via the RJournal, but you also can catch the cliff notes here.

Setting up a PITCHf/x Database

For my first post, I want to clearly explain how to get up and running with a PITCHf/x database. There are a lot of database options, but most are difficult and/or time consuming to setup. Luckily, the dplyr package makes it incredibly simple to create a SQLite database.

library(dplyr)
my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)

This will create a file in the current directory named ‘pitchRx.sqlite3’. This is the SQLite database. There are many SQLite interfaces, but we will do everything from within the R session. Most importantly, the my_db object contains a connection that allows us to interact with the database from within R! Before we write data to this database, make sure the database is empty (i.e., it has no tables).

my_db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls:

Using the scrape function in the pitchRx package, we can collect and store data in my_db using the database connection (my_db$con). This approach also works with a MySQL database connection.

library(pitchRx)
scrape(start = "2008-01-01", end = Sys.Date() - 1, connect = my_db$con)

This snippet of code will download, manipulate, and store all available PITCHf/x starting with the 2008 season and all the way up to yesterday. Here is an example of just one of thousands of files it will query. This is a lot of data – it takes a few hours to run – but your patience will pay off as you will soon have data on every pitch, atbat, and other “plays” over the past six years. One may also obtain supplementary data on games, players, coaches, umpires, and even locations of balls hit into play from other files using the suffix argument (here are some example files):

files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml")
scrape(start = "2008-01-01", end = Sys.Date() - 1, suffix = files, connect = my_db$con)

Once this code is done running, you should have these tables in your database:

my_db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner, umpire

Note that the ‘pitch’ table contains PITCHf/x data. The other tables can easily be linked to the ‘pitch’ table (or to each other) depending on the needs of the analysis. I recommend this post for an introduction to retrieving (and manipulating) data from your database within R.

Updating your PITCHf/x Database

NOTE: The section below is a bit outdated since pitchRx can now perform automatic updates

For those of you that want to keep your database up-to-date throughout the season, one option is to use appropriate start and end values. However, this can lead to unintended results when scraping data on a particular day where game(s) have yet to be played. I see at least two immediate work-arounds: (1) only scrape data for a particular day after all the games are played (2) remove duplicate rows from your tables after updating.

No matter what route you take, the first thing you’ll want is the date of the most recent data point in your database. Doing this is easy if you have the ‘game’ table from the miniscoreboard.xml files since it has an ‘original_date’ column which can be used to find the most recent date.

dates <- collect(select(tbl(my_db, "game"), original_date))
max.date <- max(as.Date(dates[!is.na(dates)], "%Y/%m/%d"))

If you’re taking option (1), we can set start to be max.date + 1 and end to be Sys.Date() without worrying about data being duplicated in the database.

# Append new PITCHf/x data
scrape(start = max.date + 1, end = Sys.Date(), connect = my_db$con)
# Append other data
scrape(start = max.date + 1, end = Sys.Date(), suffix = files, connect = my_db$con)
Advertisements