How much do teams travel?

Last week we showed how to geocode locations and calculate distances in R. At the end of the post we mapped the ballparks used for the 2012 season.

The goal today is to compute the distance each team traveled during the season.
First, we’ll re-use some of the code from last post.


#set the season
season = 2012

# download the 2012 Game Log file, unzip it, and place it in a folder of choice
# then modify the following line accordingly
games = read.table(paste("your/gamelogs/folder/gl", season, ".txt", sep=""), sep=",")
#download game_log_header.csv from our GitHub repository (
glheaders = read.csv("game_log_header.csv")
names(games) = names(glheaders)

#vector of unique park identifiers
parks = unique(games$ParkID)

#get park info from KJOK database @ SeamHeads
allparks = read.xlsx("your/path/to/KJOK/database/Parks.xlsx", 1)

#keep only parks used for the season
parksinfo = subset(allparks, PARKID %in% parks)

#get lat/lon coordinates of ballparks
locations = geocode(paste(parksinfo$CITY, parksinfo$STATE, sep=", "))
parksinfo = cbind(parksinfo, locations)

Let’s now calculate the distances between any pair of ballparks. That’s accomplished by the distm function of the geosphere package (introduced at the beginning of last post).

# matrix of distances
distances = distm(locations)

The resulting object distances is a 31×31 matrix. Below the first 4 rows and 4 columns are displayed.

        [,1]    [,2]      [,3]      [,4]
[1,]       0 1936759 3087662.6 3709945.1
[2,] 1936759       0 1188874.1 1978930.8
[3,] 3087663 1188874       0.0  929071.8
[4,] 3709945 1978931  929071.8       0.0

There is no reference to the ballparks in the matrix above. This is resolved by naming the rows and the columns of the distances matrix.

rownames(distances) = parksinfo$PARKID
colnames(distances) = parksinfo$PARKID

Before moving on, we transform the 31×31 distances matrix into a 961×3 data frame, having, as its columns, the starting ballpark, the arriving ballpark and the distance between the two in meters. Then we add a fourth column, featuring the distance in miles.

# transform matrix to data frame
distances =
names(distances) = c("fromPARKID", "toPARKID", "meters")
# calculate distance in miles
distances$miles = distances$meters * 0.000621371
# display first lines of the distances data.frame

fromPARKID toPARKID meters miles
1 ANA01 ANA01 0 0.000
2 ARL02 ANA01 1936759 1203.446
3 ATL02 ANA01 3087663 1918.584
4 BAL12 ANA01 3709945 2305.252
5 BOS07 ANA01 4158206 2583.788
6 CHI11 ANA01 2791219 1734.383

The following code defines a function which calculates the total number of miles a team has traveled during the season, while moving from park to park.
It requires, as parameters, the code of a team, the data frame containing games information, and the data frame of the distances between parks.
Take a look at the comments inside the function to get an idea of how the final result is computed.

getTravel = function(team, gamesdata, distancematrix){
  #select games (home+away) played by team
  #and keep the game number for the team and the park identifier
  homegames = subset(gamesdata, HomeTeam==team)[,c("HomeTeamGameNumber", "ParkID")]
  awaygames = subset(gamesdata, VisitingTeam==team)[,c("VisitingTeamGameNumber", "ParkID")]
  #rename the HomeTeamGameNumber and VisitingTeamGameNumber columns to make them consistent
  names(homegames)[1] = "gameNumber"
  names(awaygames)[1] = "gameNumber"
  #combine the homegames and awaygames data frames by rows
  allgames = rbind(homegames, awaygames)
  #identify the previous game "gameNumber"
  allgames$previousGame = allgames$gameNumber - 1
  #merge the allgames data frame with itself
  #matching gameNumber with previousGame
  #to compute where the previous game was played
  allgames = merge(allgames, allgames[,c("gameNumber", "ParkID")], by.x="previousGame", by.y="gameNumber", suffixes=c("", "previous"))
  #merge the allgames data frame with the dataframe containig park-to-park distances
  #to get game by game travel
  allgames = merge(allgames, distancematrix, by.x=c("ParkIDprevious", "ParkID"), by.y=c("fromPARKID", "toPARKID"))
  #compute total travel for the season

Let’s test the function for a couple of teams, the Seattle Mariners and the New York Yankees.

getTravel("SEA", games, distances)
[1] 49897.11

getTravel("NYA", games, distances)
[1] 29094.99

Finally, let’s calculate the seasonal travel for all the 30 teams and display summary statistics for that.
First we create a data frame consisting of teams identifiers, then we use the sapply function to apply getTravel to each row of the data frame.

seasonTravel = data.frame(teamID = unique(games$VisitingTeam))
seasonTravel$miles = sapply(seasonTravel$teamID, function(x) getTravel(x, games, distances))

Let’s quickly summarize the miles column, featuring the seasonal travel of the 30 teams.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  23420   27310   29900   32100   35160   50960

And here are the 30 clubs sorted by miles traveled, from lowest (the Cincinnati Reds) to highest (the Oakland A’s barely edging the Mariners thanks to their opening trip to Japan).


   teamID    miles
21    CIN 23423.67
22    DET 24712.97
10    KCA 25340.03
23    CLE 25760.27
17    MIL 26012.98
11    MIN 26222.81
19    PIT 26512.58
5     WAS 27195.23
2     SLN 27664.46
30    CHN 27703.80
3     TOR 28170.20
12    NYA 29094.99
13    CHA 29427.09
7     ATL 29592.97
8     PHI 29735.92
29    NYN 30058.90
15    COL 31294.53
26    BAL 31849.27
20    ARI 32045.36
28    HOU 33538.70
6     MIA 34210.73
4     BOS 34227.03
9     LAN 35476.72
27    SDN 36117.23
24    TEX 36548.32
18    TBA 37454.80
14    SFN 38549.61
16    ANA 44308.27
1     SEA 49897.11
25    OAK 50956.80

Note that by simply modifying the value of season at the beginning of the code, you can get traveling numbers for any MLB season (provided you downloaded the relevant game log file from Retrosheet).
Here are, for example, the results for 1950, before expansion was on the horizon.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  10170   10670   10950   11260   11620   13140

Notice how the difference between the most and the least traveling teams was about 3,000 miles back then, roughly ten times less than the gap between the A’s and the Reds in 2012.

Now the big question is: How much the 50,000 miles of travel affected the playing performances of the A’s and the Mariners?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: