Calculating distances in R

In the previous post Jim mentioned we live some 4,000 miles apart.

There are several on-line services that provide flying distances between any two places, but since this is an R blog I’m going to show how to calculate them in R. And since it’s a blog about baseball, I’ll then show an application to baseball: in this post I’ll just compute distances bewteen ballparks used for the 2012 MLB season; in another post we will see how to calculate the traveling of teams during a season.

The R package ggmap provides a function, named geocode, which given a location (a city, an address…) returns the latitude and longitude coordinates.

Let’s get the coordinates for where both of us live.

library(ggmap)
JA = geocode("Findlay, Ohio")
MM = geocode("Sasso Marconi, Bologna")

The ggmap package also features a mapdist function that calculates distances between places using the GoogleMaps API. Unfortunately that won’t work in our case because there’s an ocean in the way.

Thus we will be using geosphere, a package which implements spherical trigonometry, useful for calculating great distances on the globe.

library(geosphere)
distHaversine(JA, MM) * 0.000621371
[1] 4536.283

The line after the one that loads the package calculates the Haversine distance between our home places, the multiplying constant being the factor to convert meters to miles.

OK, let’s now move to baseball. In order to run the following code you should first download (and unzip) from Retrosheet the Game Log file for the 2012 season, the latest currently available (2013 should be available soon).

options(stringsAsFactors=F)
season = 2012
# download the 2012 Game Log file, unzip it, and place it in a folder of choice
# then modify the following line accordingly
games = read.table(paste("your/gamelogs/folder/gl", season, ".txt", sep=""), sep=",")
#download game_log_header.csv from our GitHub repository
glheaders = read.csv("game_log_header.csv")
names(games) = names(glheaders)

First we tell R to not consider strings as factors (as it would do by default) and we set the season at 2012.
Then we read the Game Log file in R and we change its headers. The file game_log_header.csv is available for download at the GitHub page dedicated to the book, as are all the codes and the files used in the book.

Then we store the identifiers of parks that featured a game in the 2012 season in the vector parks.

parks = unique(games$ParkID)

A really useful database about ballparks is available at SeamHeads website. At the bottom of the page is a link for downloading the Excel files composing the database. The xlsx package will be useful to read such files in R: the read.xlsx function used in the code below, tells R to load the data in the first worksheet of the Parks.xlsx file.

#get park info from KJOK database @ SeamHeads
library(xlsx)
allparks = read.xlsx("your/path/to/KJOK/database/Parks.xlsx", 1)
#keep only parks used for the season
parksinfo = subset(allparks, PARKID %in% parks)

Now, let’s get the coordinates of the ballparks. Notice that I actually get the coordinates for the cities in which ballparks are located (so, for example, City Field and Yankee Stadium will get the same coordinates). You can try to modify the code to include the name of the ballpark in the string to be geocoded: that will get more precise results for most of the parks, but you’ll have to manually assign coordinates to a few of them that are not correctly geocoded.

locations = geocode(paste(parksinfo$CITY, parksinfo$STATE, sep=", "))
parksinfo = cbind(parksinfo, locations)

The cbind function in the second line above simply appends the two coordinates columns that make up the locations data frame to the right of the parksinfo data frame.

Let’s see how the geocoding process went, by plotting the ballparks on a map. The get_map function from the ggmap package grabs the relevant map from Google services, then ggmap() allows to display data on it using the ggplot syntax showed in Chapter 6 of our book.

map = get_map("united states", zoom=4, maptype="roadmap")
ggmap(map) +
  geom_point(data=parksinfo, aes(x=lon, y=lat), size=3) +
  geom_text(data=parksinfo, aes(x=lon, y=lat, label=NAME), size=4, col="blue", vjust=-.5)

The geom_point function adds points to the map, at the position specified in the aesthetics (aes) parameter, while geom_text adds labels just above those points (the vertical positioing is fine tuned by using the vjust parameter).

Here’s the resulting output (click on the image for a bigger version of the map).

Ballparks used in the 2012 season mapped.

Ballparks used in the 2012 season mapped (click on the picture to enlarge).

The code above returns a couple of warnings in the R console: that’s because it can’t display the Tokyo Dome, used for a couple of games in 2012.

Let’s stop here for today. We’ll be back soon with a follow-up post, in which we will calculate how much each team traveled during the 2012 season, and compare the results with teams mileages of the past.

Advertisements

One response

  1. […] week we showed how to geocode locations and calculate distances in R. At the end of the post we mapped the ballparks used for the 2012 […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: