In the previous post Jim mentioned we live some 4,000 miles apart.
There are several on-line services that provide flying distances between any two places, but since this is an R blog I’m going to show how to calculate them in R. And since it’s a blog about baseball, I’ll then show an application to baseball: in this post I’ll just compute distances bewteen ballparks used for the 2012 MLB season; in another post we will see how to calculate the traveling of teams during a season.
The R package
ggmap provides a function, named
geocode, which given a location (a city, an address…) returns the latitude and longitude coordinates.
Let’s get the coordinates for where both of us live.
library(ggmap) JA = geocode("Findlay, Ohio") MM = geocode("Sasso Marconi, Bologna")
ggmap package also features a
mapdist function that calculates distances between places using the GoogleMaps API. Unfortunately that won’t work in our case because there’s an ocean in the way.
Thus we will be using
geosphere, a package which implements spherical trigonometry, useful for calculating great distances on the globe.
library(geosphere) distHaversine(JA, MM) * 0.000621371  4536.283
The line after the one that loads the package calculates the Haversine distance between our home places, the multiplying constant being the factor to convert meters to miles.
OK, let’s now move to baseball. In order to run the following code you should first download (and unzip) from Retrosheet the Game Log file for the 2012 season, the latest currently available (2013 should be available soon).
options(stringsAsFactors=F) season = 2012 # download the 2012 Game Log file, unzip it, and place it in a folder of choice # then modify the following line accordingly games = read.table(paste("your/gamelogs/folder/gl", season, ".txt", sep=""), sep=",") #download game_log_header.csv from our GitHub repository glheaders = read.csv("game_log_header.csv") names(games) = names(glheaders)
First we tell R to not consider strings as factors (as it would do by default) and we set the season at 2012.
Then we read the Game Log file in R and we change its headers. The file
game_log_header.csv is available for download at the GitHub page dedicated to the book, as are all the codes and the files used in the book.
Then we store the identifiers of parks that featured a game in the 2012 season in the vector
parks = unique(games$ParkID)
A really useful database about ballparks is available at SeamHeads website. At the bottom of the page is a link for downloading the Excel files composing the database. The
xlsx package will be useful to read such files in R: the
read.xlsx function used in the code below, tells R to load the data in the first worksheet of the
#get park info from KJOK database @ SeamHeads library(xlsx) allparks = read.xlsx("your/path/to/KJOK/database/Parks.xlsx", 1) #keep only parks used for the season parksinfo = subset(allparks, PARKID %in% parks)
Now, let’s get the coordinates of the ballparks. Notice that I actually get the coordinates for the cities in which ballparks are located (so, for example, City Field and Yankee Stadium will get the same coordinates). You can try to modify the code to include the name of the ballpark in the string to be geocoded: that will get more precise results for most of the parks, but you’ll have to manually assign coordinates to a few of them that are not correctly geocoded.
locations = geocode(paste(parksinfo$CITY, parksinfo$STATE, sep=", ")) parksinfo = cbind(parksinfo, locations)
cbind function in the second line above simply appends the two coordinates columns that make up the locations data frame to the right of the
parksinfo data frame.
Let’s see how the geocoding process went, by plotting the ballparks on a map. The
get_map function from the
ggmap package grabs the relevant map from Google services, then
ggmap() allows to display data on it using the
ggplot syntax showed in Chapter 6 of our book.
map = get_map("united states", zoom=4, maptype="roadmap") ggmap(map) + geom_point(data=parksinfo, aes(x=lon, y=lat), size=3) + geom_text(data=parksinfo, aes(x=lon, y=lat, label=NAME), size=4, col="blue", vjust=-.5)
geom_point function adds points to the map, at the position specified in the aesthetics (
aes) parameter, while
geom_text adds labels just above those points (the vertical positioing is fine tuned by using the
Here’s the resulting output (click on the image for a bigger version of the map).
The code above returns a couple of warnings in the R console: that’s because it can’t display the Tokyo Dome, used for a couple of games in 2012.
Let’s stop here for today. We’ll be back soon with a follow-up post, in which we will calculate how much each team traveled during the 2012 season, and compare the results with teams mileages of the past.