Calculating distances in R

In the previous post Jim mentioned we live some 4,000 miles apart.

There are several on-line services that provide flying distances between any two places, but since this is an R blog I’m going to show how to calculate them in R. And since it’s a blog about baseball, I’ll then show an application to baseball: in this post I’ll just compute distances bewteen ballparks used for the 2012 MLB season; in another post we will see how to calculate the traveling of teams during a season.

The R package `ggmap` provides a function, named `geocode`, which given a location (a city, an address…) returns the latitude and longitude coordinates.

Let’s get the coordinates for where both of us live.

```library(ggmap)
JA = geocode("Findlay, Ohio")
MM = geocode("Sasso Marconi, Bologna")
```

The `ggmap` package also features a `mapdist` function that calculates distances between places using the GoogleMaps API. Unfortunately that won’t work in our case because there’s an ocean in the way.

Thus we will be using `geosphere`, a package which implements spherical trigonometry, useful for calculating great distances on the globe.

```library(geosphere)
distHaversine(JA, MM) * 0.000621371
[1] 4536.283
```

The line after the one that loads the package calculates the Haversine distance between our home places, the multiplying constant being the factor to convert meters to miles.

OK, let’s now move to baseball. In order to run the following code you should first download (and unzip) from Retrosheet the Game Log file for the 2012 season, the latest currently available (2013 should be available soon).

```options(stringsAsFactors=F)
season = 2012
# download the 2012 Game Log file, unzip it, and place it in a folder of choice
# then modify the following line accordingly
games = read.table(paste("your/gamelogs/folder/gl", season, ".txt", sep=""), sep=",")
#download game_log_header.csv from our GitHub repository
glheaders = read.csv("game_log_header.csv")
names(games) = names(glheaders)
```

First we tell R to not consider strings as factors (as it would do by default) and we set the season at 2012.
Then we read the Game Log file in R and we change its headers. The file `game_log_header.csv` is available for download at the GitHub page dedicated to the book, as are all the codes and the files used in the book.

Then we store the identifiers of parks that featured a game in the 2012 season in the vector `parks`.

```parks = unique(games\$ParkID)
```

A really useful database about ballparks is available at SeamHeads website. At the bottom of the page is a link for downloading the Excel files composing the database. The `xlsx` package will be useful to read such files in R: the `read.xlsx` function used in the code below, tells R to load the data in the first worksheet of the `Parks.xlsx` file.

```#get park info from KJOK database @ SeamHeads
library(xlsx)
allparks = read.xlsx("your/path/to/KJOK/database/Parks.xlsx", 1)
#keep only parks used for the season
parksinfo = subset(allparks, PARKID %in% parks)
```

Now, let’s get the coordinates of the ballparks. Notice that I actually get the coordinates for the cities in which ballparks are located (so, for example, City Field and Yankee Stadium will get the same coordinates). You can try to modify the code to include the name of the ballpark in the string to be geocoded: that will get more precise results for most of the parks, but you’ll have to manually assign coordinates to a few of them that are not correctly geocoded.

```locations = geocode(paste(parksinfo\$CITY, parksinfo\$STATE, sep=", "))
parksinfo = cbind(parksinfo, locations)
```

The `cbind` function in the second line above simply appends the two coordinates columns that make up the locations data frame to the right of the `parksinfo` data frame.

Let’s see how the geocoding process went, by plotting the ballparks on a map. The `get_map` function from the `ggmap` package grabs the relevant map from Google services, then `ggmap()` allows to display data on it using the `ggplot` syntax showed in Chapter 6 of our book.

```map = get_map("united states", zoom=4, maptype="roadmap")
ggmap(map) +
geom_point(data=parksinfo, aes(x=lon, y=lat), size=3) +
geom_text(data=parksinfo, aes(x=lon, y=lat, label=NAME), size=4, col="blue", vjust=-.5)
```

The `geom_point` function adds points to the map, at the position specified in the aesthetics (`aes`) parameter, while `geom_text` adds labels just above those points (the vertical positioing is fine tuned by using the `vjust` parameter).

Here’s the resulting output (click on the image for a bigger version of the map).

Ballparks used in the 2012 season mapped (click on the picture to enlarge).

The code above returns a couple of warnings in the R console: that’s because it can’t display the Tokyo Dome, used for a couple of games in 2012.

Let’s stop here for today. We’ll be back soon with a follow-up post, in which we will calculate how much each team traveled during the 2012 season, and compare the results with teams mileages of the past.

Advertisements

One response

1. […] week we showed how to geocode locations and calculate distances in R. At the end of the post we mapped the ballparks used for the 2012 […]