Automatic PITCHf/x database updates with pitchRx

This post is a follow up of sorts to modifying and querying a PITCHf/x database. The query portion of that post is still very relevant, but I didn't want to impose the hack to obtain dates upon users, so I updated pitchRx to explicitly include dates by default. I also added some functionality to make database updates painless.

Dates are now supported!

Since dates are so vital to PITCHf/x analysis in practice, I decided to officially add support for a 'date' column in the 'atbat' table. Just to demonstrate, let's create a new database, add one day's worth of data to it, and print a few records from the date column:

library(dplyr)
library(pitchRx) 
# Make sure you have pitchRx 1.5 -- as 1.4 has a couple bugs...
# You can `devtools::install_github("cpsievert/pitchRx")` to get newest version
packageVersion("pitchRx") 
## [1] '1.5'
db <- src_sqlite("new-db.sqlite3", create = TRUE)
scrape(start = "2014-06-01", end = "2014-06-01", connect = db$con)
select(tbl(db, "atbat"), date)
## Source: sqlite 3.7.17 [new-db.sqlite3]
## From: atbat [1,096 x 1]
## 
##          date
## 1  2014_06_01
## 2  2014_06_01
## 3  2014_06_01
## 4  2014_06_01
## 5  2014_06_01
## 6  2014_06_01
## 7  2014_06_01
## 8  2014_06_01
## 9  2014_06_01
## 10 2014_06_01
## ..        ...

If you already have a PITCHf/x database without a date column, scrape now knows to create this date column for you (using a process similar to what I describe here).

Introducing update_db

In the past, my recommended solution for updating a PITCHf/x database was admittingly ambiguous/painful, so I've decided to automate that job. The new update_db function is my attempt at doing so. Give this function a database connection and it knows how to append new data to all your existing tables. It does this by finding the most recent game from the 'gameday_link' column in the 'atbat' table, then grabs data for all games after that date. For this reason, if your tables conflict in terms of the most recent date, do not use this function and use scrape to perform an update instead. Now, by default, update_db includes games up to and including yesterday. The reason is that files are updated in real-time on MLBAM's servers which can lead to missing/duplicate records if you aren't incredibly careful about adding today's data to your database. Of course, you can still get your hands on today's data, but don't say I didn't warn you if you append it to your database! 🙂

dat <- scrape(start = Sys.Date(), end = Sys.Date())

Automatic updates!

I'm a bit embarassed it took me so long to learn about cron jobs considering how easy it is to create scheduled jobs to run R scripts on a unix/linux based machine. Thanks to cron jobs, I no longer have to think about updating my database – it updates by itself. To make this happen, I first save this script under the /Users/cpsievert/pitchfx/ directory. This directory also has my PITCHf/x database saved as pitchRx.sqlite3. Next, I open my terminal and enter

sudo crontab -e

After entering my password, this opens my default text editor where I then have:

0 12 * * * Rscript /Users/cpsievert/pitchfx/update.R

This tells my machine to update my database everyday at noon. If you use Windows, you won’t be able to use cron, but you can still schedule an automated task (see Jake’s comment below).

Advertisement

8 responses

  1. I’m a Windows user and I was able to set up an automated process with Task Scheduler. First, modify Carson’s update.R script to point to the location of your database file. Then on Windows 7 do the following,

    1. Start Menu > All Programs > Accessories > System Tools > Task Scheduler
    2. Click *Create Basic Task*
    3. Choose task name, description, and trigger schedule.
    4. Action *start a program*
    5. In the program/script box, put the file path to your R.exe program. File path in quotes. (eg. “C:\Program Files\R\R-2.13.1\bin\R.exe”)
    6. In the optional arguments box add,
    CMD BATCH –vanilla –slave “file path to update.R script”

    vanilla and slave are optional, but from what I read they look like good things to include.

    Done.

  2. Carson, thanks for providing pitchRx! I’m trying to download the database and was able to get 2008-2010, but starting with 2011 am encountering errors. I am doing:
    update_db(my_db$con, end = “2011/2/24”)
    and getting:
    http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_24/gid_2011_02_24_fanbbc_phimlb_1/inning/inning_hit.xml
    Successfully copied hip table to database connection.
    Collecting garbage
    libidn: warning: libiconv not installed, cannot convert data to UTF-8
    Error in function (type, msg, asError = TRUE) :
    Could not resolve host: Þ; No data record of requested type
    In addition: Warning messages:
    1: In export(connect, name = i, value = tables[[i]], template = fields[[i]]) :
    The value data.frame has variables that are not in the corresponding table. Writing data.frame to a new table instead.
    2: In export(connect, name = i, value = tables[[i]], template = fields[[i]]) :
    The value data.frame has variables that are not in the corresponding table. Writing data.frame to a new table instead.

    If I do a scrape I only see the could not resolve host message, but I suspect iconv is the real problem. I am running Windows Vista and using the http://www.zlatkovic.com/libxml.en.html Sxslt dependencies as part of installing openWAR.

    It seems there is a problem with iconv in some other contexts:
    http://www.perzl.org/aix/index.php?n=FAQs.FAQs#wget-warning-msg
    http://smartos.org/bugview/OS-2658
    http://stackoverflow.com/questions/18664740/libiconv-error-when-using-twitter-package-on-r
    http://stackoverflow.com/questions/9922562/how-to-resolve-curl-error-7-couldnt-connect-to-host

    Can you offer any suggestions on how to debug this? For example, is there any way to get the cURL handle pitchRx is using?

    1. I recently switched from RCurl to httr in hopes of avoiding some of these ambiguous errors. I have a feeling if you grab the most recent version of XML2R off of github it may fix your issue.

  3. Thanks for the suggestion. I grabbed the most recent github XML2R and am getting a different error:


    http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_24/gid_2011_02_24_fanbbc_phimlb_1/inning/inning_hit.xml
    Successfully copied hip table to database connection.
    Collecting garbage
    Error in names(l) <- url.count :
    'names' attribute [1] must be the same length as the vector [0]
    In addition: Warning messages:
    1: In export(connect, name = i, value = tables[[i]], template = fields[[i]]) :
    The value data.frame has variables that are not in the corresponding table. Writing data.frame to a new table instead.
    2: In export(connect, name = i, value = tables[[i]], template = fields[[i]]) :
    The value data.frame has variables that are not in the corresponding table. Writing data.frame to a new table instead.
    3: In urlsToDocs(urls, async = async, quiet = quiet) :
    No content found. Please double check your urls.

    What mystifies me is why everything seemed to work until I got to the 2011 data.

    P.S. I think I fixed my iconv problem. There was an old iconv.dll in my PATH.

    1. If it helps I can reproduce the error inside the RStudio browser. The problem is occurring in:
      listsToObs <- function(l, urls, append.value=TRUE, as.equiv=TRUE, url.map=TRUE) {
      #add a prefix to the names of the list to help content to a url
      url.count <- paste0("url", seq_len(length(urls)))
      url_map <- cbind(url_key=url.count, url=urls)
      browser()
      names(l) l
      list()
      Warning message:
      In urlsToDocs(urls, async = async, quiet = quiet) :
      No content found. Please double check your urls.

      1. Would you mind posting your problem on GitHub?

        https://github.com/cpsievert/pitchRx/issues

        I’m not sure if this will help, but you should use “2011-02-24″ — not “2011/2/24″.

        Also, please report what you see when you run —

        db
        gidz <- sort(DBI::dbGetQuery(db$con, "SELECT DISTINCT gameday_link FROM atbat")[,1])
        head(gidz)
        tail(gidz)

      2. Posted on github as https://github.com/cpsievert/pitchRx/issues/18
        along with some additional information (as requested and I also added sessionInfo()).
        Thanks for looking into this!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: