I’m learning more about reproducible research and I happened to find this helpful post by Jon Zelner who describes how to use the docopt
package to execute a R script as a program from the command line. Let me illustrate doing this for a baseball example.
Graphing Batting Averages
Suppose I’m interested in graphing the distribution of batting averages of all players with a minimum number of at-bats for a particular season. I write the following R script which uses 2014 data with a minimum number of at-bats to be 300.
require(Lahman) require(dplyr) require(ggplot2) y <- 2014 m <- 300 png("myplot.png") d <- filter(Batting, yearID==y) S <- summarize(group_by(d, playerID), H=sum(H), AB=sum(AB)) ggplot(filter(S, AB>=m), aes(H/AB)) + geom_histogram(aes(x=H/AB, y=..density..)) + geom_density(size=2, color="red") + ggtitle(paste(y, 'Batting Averages')) + theme(plot.title = element_text(size = rel(2), hjust=0.5, color = "blue")) dev.off()
If I source this file within R, I’ll find a file “myplot.png” that contains a histogram and density plot of the batting averages.
Making the R Script a Program
Suppose I want to run this script as a program from Terminal (the command line interface for Macs) outside of R. There will be three inputs: y (season), m (minimum number of at-bats) and o (output file). I revise my R script as follows:
#!/usr/bin/Rscript require(docopt) 'Usage: bavg_normal.R [-y <season> -m <min AB> -o <output>] Options: -y season [default: 2014] -m minimum number of at-bats [default: 300] -o Output file [default: bavg.png] ]' -> doc opts <- docopt(doc) require(Lahman) require(dplyr) require(ggplot2) png(opts$o) d <- filter(Batting, yearID==as.numeric(opts$y)) S <- summarize(group_by(d, playerID), H=sum(H), AB=sum(AB)) ggplot(filter(S, AB>=as.numeric(opts$m)), aes(H/AB)) + geom_histogram(aes(x=H/AB, y=..density..)) + geom_density(size=2, color="red") + ggtitle(paste(opts$y, 'Batting Averages')) + theme(plot.title = element_text(size = rel(2), hjust=0.5, color = "blue")) dev.off()
What extra things did I add?
- I added a string variable
doc
that indicates the syntax of the command line including the options and default values - The
docopt
function (from the package of the same name) will put the inputs into a listopts
— each element will be character valued with the corresponding input - Last, I write my function using the inputs
opts$y
,opts$m
, andopts$o
, making sure to convert them to numeric values if needed.
Running the R “Program” from the Command Line
Now I can quit R, open up Terminal and navigate to the folder that contains this script called “bavg_normal.R”. Then I type
chmod +x bavg_normal.R ./bavg_normal.R -y 1964 -m 300 -o bavg_1964.png
The first line makes the file “bavg_normal.R” executable, and then the second line runs this as a program using inputs (year 1964, 300 min AB, and output file “bag_1964”).
I look in my folder and I see the following graphics file that contains the following histogram of 1964 batting averages of players with at least 300 AB.
Running R scripts as executable programs seems to be attractive, and I hope to be able to illustrate the usefulness of this feature in future posts.
NOTE: My colleague Maria commented that this is similar to the use of the BATCH mode where you run R scripts from the terminal. Perhaps each method has particular advantages for a certain application.