Starting and updating a PITCHf/x database with pitchRx and dplyr

First of all, let me say that I am truly honored Jim Albert invited me to join Ben Baumer in contributing to this blog. I hope to share some useful knowledge I’ve gained in relation to baseball data and R over the past couple years. Most of my experience comes from developing the R package pitchRx which provides tools for collecting, storing and visualizing PITCHf/x data. If your interested, my pitchRx paper is available online via the RJournal, but you also can catch the cliff notes here.

Setting up a PITCHf/x Database

For my first post, I want to clearly explain how to get up and running with a PITCHf/x database. There are a lot of database options, but most are difficult and/or time consuming to setup. Luckily, the dplyr package makes it incredibly simple to create a SQLite database.

library(dplyr)
my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)

This will create a file in the current directory named ‘pitchRx.sqlite3’. This is the SQLite database. There are many SQLite interfaces, but we will do everything from within the R session. Most importantly, the my_db object contains a connection that allows us to interact with the database from within R! Before we write data to this database, make sure the database is empty (i.e., it has no tables).

my_db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls:

Using the scrape function in the pitchRx package, we can collect and store data in my_db using the database connection (my_db$con). This approach also works with a MySQL database connection.

library(pitchRx)
scrape(start = "2008-01-01", end = Sys.Date() - 1, connect = my_db$con)

This snippet of code will download, manipulate, and store all available PITCHf/x starting with the 2008 season and all the way up to yesterday. Here is an example of just one of thousands of files it will query. This is a lot of data – it takes a few hours to run – but your patience will pay off as you will soon have data on every pitch, atbat, and other “plays” over the past six years. One may also obtain supplementary data on games, players, coaches, umpires, and even locations of balls hit into play from other files using the suffix argument (here are some example files):

files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml")
scrape(start = "2008-01-01", end = Sys.Date() - 1, suffix = files, connect = my_db$con)

Once this code is done running, you should have these tables in your database:

my_db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner, umpire

Note that the ‘pitch’ table contains PITCHf/x data. The other tables can easily be linked to the ‘pitch’ table (or to each other) depending on the needs of the analysis. I recommend this post for an introduction to retrieving (and manipulating) data from your database within R.

Updating your PITCHf/x Database

NOTE: The section below is a bit outdated since pitchRx can now perform automatic updates

For those of you that want to keep your database up-to-date throughout the season, one option is to use appropriate start and end values. However, this can lead to unintended results when scraping data on a particular day where game(s) have yet to be played. I see at least two immediate work-arounds: (1) only scrape data for a particular day after all the games are played (2) remove duplicate rows from your tables after updating.

No matter what route you take, the first thing you’ll want is the date of the most recent data point in your database. Doing this is easy if you have the ‘game’ table from the miniscoreboard.xml files since it has an ‘original_date’ column which can be used to find the most recent date.

dates <- collect(select(tbl(my_db, "game"), original_date))
max.date <- max(as.Date(dates[!is.na(dates)], "%Y/%m/%d"))

If you’re taking option (1), we can set start to be max.date + 1 and end to be Sys.Date() without worrying about data being duplicated in the database.

# Append new PITCHf/x data
scrape(start = max.date + 1, end = Sys.Date(), connect = my_db$con)
# Append other data
scrape(start = max.date + 1, end = Sys.Date(), suffix = files, connect = my_db$con)

22 responses

  1. Hey Carson,

    Thanks for contributing to this blog! I’ve used XML2R to set up a pitch f/x database similar to how pitchRx does it, it worked very well. One thing I’ve had trouble doing is matching hit locations to the atbat table – there are duplicates when there are errors so it’s tough to match things up exactly. Have you looked into this at all?

    Thanks again for your great work!

    Nick Wheatley-Schaller

    1. Hi Nick,

      Wow! That was a quick comment 😉

      Glad you find XML2R and pitchRx useful. I haven’t looked into joining the hit locations with the atbat table, but it’s on my TODO list. Could you explain in greater detail what you mean by “there are duplicates when there are errors”? It would be great if could post this as an issue on GitHub with a reproducible example — http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

  2. Carson thanks for doing a tutorial like this! After some trial and error (and figuring out how to install packages and such) I got the first couples lines of code to work. Big accomplishment, I know. Before using the scrape function to download a couple hours worth of data, should I be concerned about it slowing my computer down a lot? (I use a relatively fast/uncluttered Macbook Air if that helps) Or are these txt files light on memory and such. Thanks!

    1. Hmm. The reason for using the database connection is to avoid creating a bunch of txt files. Are they txt files or csv files? It could be that your database connection expired at some point. If connection is expired, scrape will currently fail to write to the database and write csv files instead…

      1. Sorry, poor word choice on my part, I don’t have txt files of the data. I haven’t actually scraped any data yet so my database is empty. I was just asking generally, how big (MB wise) a database of this size is and if I should be concerned of it significantly slowing down my laptop.

      2. Oh, OK. The database that I have is about 4.5 GB. Saving a file of this size to your disk shouldn’t slow down your computer. However, your computer would slow down if you try to read all that data into virtual memory. The main benefit of using a database is that we can query subsets of the data without bringing all of it into virtual memory.

      3. Gotcha, I’ll spend some time tomorrow trying understand the rest of the article. Keep it up and I’ll try to follow along.

  3. Carson, I was going to download all the data tonight, but for some reason I cannot get the pitchRx package to work. When I type in: library(pitchRx), I get the following error message:

    Error in dyn.load(file, DLLpath = DLLpath, …) :
    unable to load shared object ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgl/libs/rgl.so’:
    dlopen(/Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgl/libs/rgl.so, 6): Library not loaded: /opt/X11/lib/libGLU.1.dylib
    Referenced from: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgl/libs/rgl.so
    Reason: image not found
    Error: package or namespace load failed for ‘pitchRx’

    Any idea what the issue is?

    1. I haven’t seen that error before and it appears to be difficult for me to diagnose since it seems to be due to your system configuration. Here are a couple threads that might be helpful. Let me know how it goes!

      https://github.com/hadley/devtools/issues/373

      http://r.789695.n4.nabble.com/Problem-with-dyn-load-in-R-2-13-0-td3447872.html

    2. Hi psingman, I have the same problem with you while installing package. It almost drives me crazy. Could you tell me how to figure it in detailed please. Thanks very much!

      1. I read into this issue a bit more. Its seems like installing the rgl package from source might fix the issue —

        install.packages(“rgl”, type = “source”)

        If that doesn’t work, I suggest following the advice in these threads —

        http://stackoverflow.com/questions/11365391/trouble-while-loading-r-package-rgl
        http://stackoverflow.com/questions/9878693/error-in-loading-rgl-package-with-mac-os-x

    3. You’ve likely solved this by now, but in case anyone else has this problem and comes across this thread via Google like I did, here’s a solution that worked for me: https://www.mail-archive.com/freesurfer@nmr.mgh.harvard.edu/msg28885.html

  4. Hi Carson,
    I appear to be doing something wrong

    my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
    scrape(start = "2008-01-01", end = Sys.Date(), connect = my_db$con)
    ….
    http://gd2.mlb.com/components/game/mlb/year_2008/month_03/day_03/gid_2008_03_03_chnmlb_seamlb_1/inning/inning_all.xml
    Error in names(value) <- sub("\\.", "_", names(value)) :
    #attempt to set an attribute on NULL

    other attached packages:
    [1] rCharts_0.4.2 googleVis_0.5.1 zoo_1.7-11 data.table_1.9.2
    [5] shiny_0.9.1.9000 pitchRx_1.3 ggplot2_0.9.3.1 stringr_0.6.2
    [9] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.2-7 dplyr_0.1.3
    [13] Rcpp_0.11.1 devtools_1.5

    1. Hmm. Make sure my_db is empty (in a clean session) before you use scrape(). If that doesn’t work, let’s continue the discussion on GitHub. This is what I see:

      > my_db
      src: sqlite 3.7.17 [pitchRx.sqlite3]
      tbls:
      > scrape(start = “2008-01-01”, end = Sys.Date(), connect = my_db$con)
      If file names don’t print right away, please be patient.
      http://gd2.mlb.com/components/game/mlb/year_2008/month_02/day_28/gid_2008_02_28_bocbbc_bosmlb_1/inning/inning_all.xml
      http://gd2.mlb.com/components/game/mlb/year_2008/month_02/day_29/gid_2008_02_29_detmlb_tormlb_1/inning/inning_all.xml

  5. Thanks for getting back
    I created a new sqlite3 db, restarted R session and now it appears to be chugging along nicely
    Presumably as and when I want to add new data, I just repeat code dropping the create=True argument and changing dates. And that once I have created the indices they do not need to be redone each time? Although that is trivial to redo, of course

    1. Great! create=T is deceptive because it will actually connect to an existing db if the filename matches (and won’t create a new one). I don’t think you have to recreate indices after an update.

  6. I run into this error:

    > library(pitchRx)
    > library(RSQLite)
    > library(dplyr)
    >
    > my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
    Loading required package: RSQLite.extfuns
    Error: RSQLite.extfuns package required to effectively use sqlite db
    In addition: Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
    there is no package called ‘RSQLite.extfuns’

    1. Try installing this package —

      install.packages(“RSQLite.extfuns”)

      1. Ha! Ok I was silly for not trying that on my own. It worked.

      2. Thanks.

  7. Hi,

    I stumbled upon this blog yesterday and it is excellent! I am just getting started with my database. I successfully loaded tbls: action, atbat, game, media, pitch, po, runner.

    When I try to bring in other files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml") it gets stuck on Collecting garbage.

    I'd like to have all of the action, atbat, coach, game, hip, media, pitch, player, po, runner, umpire.

    Any thoughts on why scrape does not return hip,player and umpire tables?

    Thanks,
    Jon

    1. Hmm, I would try again in a new R session. If you still have problems, please share your code and post an issue here — https://github.com/cpsievert/pitchRx/issues/new

Leave a comment