Starting and updating a PITCHf/x database with pitchRx and dplyr

By cpsievert on March 24, 2014

First of all, let me say that I am truly honored Jim Albert invited me to join Ben Baumer in contributing to this blog. I hope to share some useful knowledge I’ve gained in relation to baseball data and R over the past couple years. Most of my experience comes from developing the R package pitchRx which provides tools for collecting, storing and visualizing PITCHf/x data. If your interested, my pitchRx paper is available online via the RJournal, but you also can catch the cliff notes here.

Setting up a PITCHf/x Database

For my first post, I want to clearly explain how to get up and running with a PITCHf/x database. There are a lot of database options, but most are difficult and/or time consuming to setup. Luckily, the dplyr package makes it incredibly simple to create a SQLite database.

library(dplyr)
my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)

This will create a file in the current directory named ‘pitchRx.sqlite3’. This is the SQLite database. There are many SQLite interfaces, but we will do everything from within the R session. Most importantly, the my_db object contains a connection that allows us to interact with the database from within R! Before we write data to this database, make sure the database is empty (i.e., it has no tables).

my_db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls:

Using the scrape function in the pitchRx package, we can collect and store data in my_db using the database connection (my_db$con). This approach also works with a MySQL database connection.

library(pitchRx)
scrape(start = "2008-01-01", end = Sys.Date() - 1, connect = my_db$con)

This snippet of code will download, manipulate, and store all available PITCHf/x starting with the 2008 season and all the way up to yesterday. Here is an example of just one of thousands of files it will query. This is a lot of data – it takes a few hours to run – but your patience will pay off as you will soon have data on every pitch, atbat, and other “plays” over the past six years. One may also obtain supplementary data on games, players, coaches, umpires, and even locations of balls hit into play from other files using the suffix argument (here are some example files):

files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml")
scrape(start = "2008-01-01", end = Sys.Date() - 1, suffix = files, connect = my_db$con)

Once this code is done running, you should have these tables in your database:

my_db
## src:  sqlite 3.7.17 [pitchRx.sqlite3]
## tbls: action, atbat, coach, game, hip, media, pitch, player, po, runner, umpire

Note that the ‘pitch’ table contains PITCHf/x data. The other tables can easily be linked to the ‘pitch’ table (or to each other) depending on the needs of the analysis. I recommend this post for an introduction to retrieving (and manipulating) data from your database within R.

Updating your PITCHf/x Database

NOTE: The section below is a bit outdated since pitchRx can now perform automatic updates

For those of you that want to keep your database up-to-date throughout the season, one option is to use appropriate start and end values. However, this can lead to unintended results when scraping data on a particular day where game(s) have yet to be played. I see at least two immediate work-arounds: (1) only scrape data for a particular day after all the games are played (2) remove duplicate rows from your tables after updating.

No matter what route you take, the first thing you’ll want is the date of the most recent data point in your database. Doing this is easy if you have the ‘game’ table from the miniscoreboard.xml files since it has an ‘original_date’ column which can be used to find the most recent date.

dates <- collect(select(tbl(my_db, "game"), original_date))
max.date <- max(as.Date(dates[!is.na(dates)], "%Y/%m/%d"))

If you’re taking option (1), we can set start to be max.date + 1 and end to be Sys.Date() without worrying about data being duplicated in the database.

# Append new PITCHf/x data
scrape(start = max.date + 1, end = Sys.Date(), connect = my_db$con)
# Append other data
scrape(start = max.date + 1, end = Sys.Date(), suffix = files, connect = my_db$con)

Posted in: pitchRx, R | Tagged: dplyr, PITCHf/x, pitchRx, R, SQLite

22 responses

neastws March 24, 2014 at 3:43 am | Reply

Hey Carson,

Thanks for contributing to this blog! I’ve used XML2R to set up a pitch f/x database similar to how pitchRx does it, it worked very well. One thing I’ve had trouble doing is matching hit locations to the atbat table – there are duplicates when there are errors so it’s tough to match things up exactly. Have you looked into this at all?

Thanks again for your great work!

Nick Wheatley-Schaller
1. cpsievert March 24, 2014 at 3:51 am | Reply
  
  Hi Nick,
  
  Wow! That was a quick comment 😉
  
  Glad you find XML2R and pitchRx useful. I haven’t looked into joining the hit locations with the atbat table, but it’s on my TODO list. Could you explain in greater detail what you mean by “there are duplicates when there are errors”? It would be great if could post this as an issue on GitHub with a reproducible example — http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
psingman March 28, 2014 at 4:38 am | Reply

Carson thanks for doing a tutorial like this! After some trial and error (and figuring out how to install packages and such) I got the first couples lines of code to work. Big accomplishment, I know. Before using the scrape function to download a couple hours worth of data, should I be concerned about it slowing my computer down a lot? (I use a relatively fast/uncluttered Macbook Air if that helps) Or are these txt files light on memory and such. Thanks!
1. cpsievert March 28, 2014 at 6:02 am | Reply
  
  Hmm. The reason for using the database connection is to avoid creating a bunch of txt files. Are they txt files or csv files? It could be that your database connection expired at some point. If connection is expired, scrape will currently fail to write to the database and write csv files instead…
  1. psingman March 28, 2014 at 6:58 am
    
    Sorry, poor word choice on my part, I don’t have txt files of the data. I haven’t actually scraped any data yet so my database is empty. I was just asking generally, how big (MB wise) a database of this size is and if I should be concerned of it significantly slowing down my laptop.
  2. cpsievert March 28, 2014 at 7:09 am
    
    Oh, OK. The database that I have is about 4.5 GB. Saving a file of this size to your disk shouldn’t slow down your computer. However, your computer would slow down if you try to read all that data into virtual memory. The main benefit of using a database is that we can query subsets of the data without bringing all of it into virtual memory.
  3. psingman March 28, 2014 at 8:00 am
    
    Gotcha, I’ll spend some time tomorrow trying understand the rest of the article. Keep it up and I’ll try to follow along.
psingman April 15, 2014 at 12:11 am | Reply

Carson, I was going to download all the data tonight, but for some reason I cannot get the pitchRx package to work. When I type in: library(pitchRx), I get the following error message:

Error in dyn.load(file, DLLpath = DLLpath, …) :
unable to load shared object ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgl/libs/rgl.so’:
dlopen(/Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgl/libs/rgl.so, 6): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.1/Resources/library/rgl/libs/rgl.so
Reason: image not found
Error: package or namespace load failed for ‘pitchRx’

Any idea what the issue is?
1. cpsievert April 15, 2014 at 12:31 am | Reply
  
  I haven’t seen that error before and it appears to be difficult for me to diagnose since it seems to be due to your system configuration. Here are a couple threads that might be helpful. Let me know how it goes!
  
  https://github.com/hadley/devtools/issues/373
  
  http://r.789695.n4.nabble.com/Problem-with-dyn-load-in-R-2-13-0-td3447872.html
2. Sherry May 2, 2014 at 3:30 am | Reply
  
  Hi psingman, I have the same problem with you while installing package. It almost drives me crazy. Could you tell me how to figure it in detailed please. Thanks very much!
  1. cpsievert May 2, 2014 at 4:28 am
    
    I read into this issue a bit more. Its seems like installing the rgl package from source might fix the issue —
    
    install.packages(“rgl”, type = “source”)
    
    If that doesn’t work, I suggest following the advice in these threads —
    
    http://stackoverflow.com/questions/11365391/trouble-while-loading-r-package-rgl
    http://stackoverflow.com/questions/9878693/error-in-loading-rgl-package-with-mac-os-x
3. nosajwp May 27, 2014 at 4:25 am | Reply
  
  You’ve likely solved this by now, but in case anyone else has this problem and comes across this thread via Google like I did, here’s a solution that worked for me: https://www.mail-archive.com/freesurfer@nmr.mgh.harvard.edu/msg28885.html
Andrew Clark (@pssGuy) April 15, 2014 at 8:56 pm | Reply

Hi Carson,
I appear to be doing something wrong

my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
scrape(start = "2008-01-01", end = Sys.Date(), connect = my_db$con)
….
http://gd2.mlb.com/components/game/mlb/year_2008/month_03/day_03/gid_2008_03_03_chnmlb_seamlb_1/inning/inning_all.xml
Error in names(value) <- sub("\\.", "_", names(value)) :
#attempt to set an attribute on NULL

other attached packages:
[1] rCharts_0.4.2 googleVis_0.5.1 zoo_1.7-11 data.table_1.9.2
[5] shiny_0.9.1.9000 pitchRx_1.3 ggplot2_0.9.3.1 stringr_0.6.2
[9] RSQLite.extfuns_0.0.1 RSQLite_0.11.4 DBI_0.2-7 dplyr_0.1.3
[13] Rcpp_0.11.1 devtools_1.5
1. cpsievert April 15, 2014 at 9:16 pm | Reply
  
  Hmm. Make sure my_db is empty (in a clean session) before you use scrape(). If that doesn’t work, let’s continue the discussion on GitHub. This is what I see:
  
  > my_db
  src: sqlite 3.7.17 [pitchRx.sqlite3]
  tbls:
  > scrape(start = “2008-01-01”, end = Sys.Date(), connect = my_db$con)
  If file names don’t print right away, please be patient.
  http://gd2.mlb.com/components/game/mlb/year_2008/month_02/day_28/gid_2008_02_28_bocbbc_bosmlb_1/inning/inning_all.xml
  http://gd2.mlb.com/components/game/mlb/year_2008/month_02/day_29/gid_2008_02_29_detmlb_tormlb_1/inning/inning_all.xml
  
  …
Andrew Clark (@pssGuy) April 15, 2014 at 11:31 pm | Reply

Thanks for getting back
I created a new sqlite3 db, restarted R session and now it appears to be chugging along nicely
Presumably as and when I want to add new data, I just repeat code dropping the create=True argument and changing dates. And that once I have created the indices they do not need to be redone each time? Although that is trivial to redo, of course
1. cpsievert April 16, 2014 at 12:09 am | Reply
  
  Great! create=T is deceptive because it will actually connect to an existing db if the filename matches (and won’t create a new one). I don’t think you have to recreate indices after an update.
skyjo May 1, 2014 at 5:02 am | Reply

I run into this error:

> library(pitchRx)
> library(RSQLite)
> library(dplyr)
>
> my_db <- src_sqlite("pitchRx.sqlite3", create = TRUE)
Loading required package: RSQLite.extfuns
Error: RSQLite.extfuns package required to effectively use sqlite db
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘RSQLite.extfuns’
1. cpsievert May 1, 2014 at 5:20 am | Reply
  
  Try installing this package —
  
  install.packages(“RSQLite.extfuns”)
  1. skyjo May 1, 2014 at 6:15 pm
    
    Ha! Ok I was silly for not trying that on my own. It worked.
  2. skyjo May 1, 2014 at 6:33 pm
    
    Thanks.
ecocarlisle December 10, 2015 at 9:22 pm | Reply

Hi,

I stumbled upon this blog yesterday and it is excellent! I am just getting started with my database. I successfully loaded tbls: action, atbat, game, media, pitch, po, runner.

When I try to bring in other files <- c("inning/inning_hit.xml", "miniscoreboard.xml", "players.xml") it gets stuck on Collecting garbage.

I'd like to have all of the action, atbat, coach, game, hip, media, pitch, player, po, runner, umpire.

Any thoughts on why scrape does not return hip,player and umpire tables?

Thanks,
Jon
1. cpsievert December 11, 2015 at 6:41 pm | Reply
  
  Hmm, I would try again in a new R session. If you still have problems, please share your code and post an issue here — https://github.com/cpsievert/pitchRx/issues/new

	Jim Albert on retrosheet Package and Compari…
	addisonmcg99 on retrosheet Package and Compari…
	Jim Albert on Calculation of Win Probabiliti…
	John Purlia on Calculation of Win Probabiliti…
	bbaumer21 on New Edition of Analyzing Baseb…

Exploring Baseball Data with R

Starting and updating a PITCHf/x database with pitchRx and dplyr

Setting up a PITCHf/x Database

Updating your PITCHf/x Database

22 responses

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Exploring Baseball Data with R

Starting and updating a PITCHf/x database with pitchRx and dplyr

Setting up a PITCHf/x Database

Updating your PITCHf/x Database

Share this:

Related

22 responses

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta