Using ggmap and Lahman to Find the Hometown College Rosters

Hey everyone. First time poster here. I am honored to be able to contribute here at Baseball With R, and would like to thank Jim, Max, Ben, and Carson for the opportunity to contribute.

For those of you that don’t know me (probably mostly all of you), my name is Brian Mills, and I am an Assistant Professor at the University of Florida. My general research area is under the umbrella of Managerial Sports Economics; however, I have worked a lot with R over the years. I’ve had an interest in baseball since the age of 5, playing through college at the Division III level.  I gained experience with R as a graduate student at Michigan, which helped my inquisitive nature blossom with baseball data. Prior to being here, I ran a series R tutorials at my old—now basically defunct—blog called sab-R-metrics.  I hope that I can provide some fun posts for everyone to practice with R.

For my first post, I wanted to use the Lahman, ggplot2, and ggmap package to do a little bit of exploration into the geographic distribution of hometowns and colleges of players on MLB teams. I am going to start pretty simple here, and hopefully build on it for the next post or two.

The other contributors here have talked a good bit about Lahman across this blog, so I’m not going to give too much description here. And Max has already used the ggmap package last year to look at stadium distances. This package is used to access the Google Maps API and grab the driving, walking, or biking distance between two places on the map. It is a relatively simple function, which asks for two character strings describing the places. In fact, these character strings can be pretty flexible. You could type “Oriole Park” and “University of Alabama”, and it will return the distance between these two places. This will be helpful, since we have three different place types in my first few posts here:

  • Zip Codes
  • College/University Names
  • Player Birth Places (City, State, Country)

Normally, I would recommend grabbing the team stadium names from Lahman, but I wanted to work with various types of inputs to show that it is a pretty flexible function, and Max already went through getting those from Lahman and using them in the context of distance measurement. Let’s start by loading in the data and getting our variables set.

You will want to make sure to also put this file in your directory and name it “TeamZips.csv” (or, alternatively, change the file name in the code provided below to whatever you want). It is a comma separated data file containing the zip codes of each MLB stadium. The code below uses merge to integrate the “Salaries” and “SchoolsPlayers” data from the Lahman package into a single data frame. It merges each by the playerID, and informs R that we want to include all rows of “Salaries” (but not “SchoolsPlayers”). I then subset the data to only include only the 2013 season, and those players with an entry for where they played collegiate baseball.

#load in Lahman library for data on player schools
library(Lahman)

#merge salary-team data with player-school data
SalSch <- merge(Salaries, SchoolsPlayers, by="playerID", all.x=T)

#subset data to remove non-college players, and data before 2013
SSal <- subset(SalSch, SalSch$yearID==2013 & is.na(SalSch$schoolID)==F)
row.names(SSal) <- 1:nrow(SSal)

Next, I merge this data set with the Schools data from Lahman so that we have full character string names of the colleges and universities, rather than just the schoolID. We’ll need the full names for use with the ggmap::mapdist function.

#add in information on school names
SSal <- merge(SSal, Schools, by="schoolID", all.x=T)

I then grab the zip code file and merge this with the “SSal” data set using the teamID variable. Again, we want all rows of the “SSal” data to be included, but not necessarily all rows of the zips data frame. Also, when loading into R, the zip codes were not read as characters, which is required for the mapdist function. Therefore, I double check this, and tell R to read these as characters.

#get zip codes of team stadiums
setwd("c:/Users/bmmillsy/Dropbox/Baseball With R")
zips <- read.csv(file="TeamZips.csv", h=T)

#merge stadium zip codes with player data
SSal <- merge(SSal, zips, by="teamID", all.x=T)
head(SSal)

  teamID  schoolID  playerID yearID lgID  salary yearMin yearMax
1    ARI       lsu  hillaa01   2013   NL 5500000    2001    2003
2    ARI       usc kenneia01   2013   NL 4265000    2004    2006
3    ARI arizonast bloomwi01   2013   NL 1900000    1997    1999
4    ARI   clemson  sippto01   2013   NL 1275000    2004    2004
5    ARI  arkansas hinsker01   2013   NL 1350000    1996    1998
6    ARI    swmost zieglbr01   2013   NL 3150000    2000    2003
                         schoolName   schoolCity schoolState schoolNick     zip
1        Louisiana State University  Baton Rouge          LA     Tigers '85004'
2 University of Southern California  Los Angeles          CA    Trojans '85004'
3          Arizona State University        Tempe          AZ Sun Devils '85004'
4                Clemson University      Clemson          SC     Tigers '85004'
5            University of Arkansas Fayetteville          AR Razorbacks '85004'
6         Missouri State University  Springfield          MO      Bears '85004'

#ensure zip codes are entered as characters for mapdist function
SSal$zip <- as.character(SSal$zip)
is.character(SSal$zip)

[1] TRUE

Next, we’ll create a variable called distCat, which identifies unique School-Team Zip Code matches within the data. This will be used to scroll through the data frame one by one, and return the respective distance between that zip code, and the player’s college.

#create variable of zip code and college
SSal$distCat <- paste(SSal$zip,"_",SSal$schoolName,sep="")
head(SSal$distCat)

[1] "'85004'_Louisiana State University"        "'85004'_University of Southern California"
[3] "'85004'_Arizona State University"          "'85004'_Clemson University&quot;               
[5] "'85004'_University of Arkansas"            "'85004'_Missouri State University" 

Now let me preface the following code with the caveat that I shouldn’t have to write a for loop to get this done. The function should do this automatically by including the character vectors of zip and schoolName. But it turns out that mapdist does not handle missing data (or, more specifically, returning missing data) very well from my experience, and I’ve found that the function just stops running once it hits one of these. It kept failing to return any distance data any time this happened. So, I made a dirty workaround as described below. This is a great thing about using R: the ability to look inside functions and snoop around allows for finding ways around issues you’re having with your code.

First, I initialize three vectors. The first, “AllDist”, will be where the data from mapdist is stored from each distance measurement. The other two, m and seconds, are created as missing values to place into the mapdist output when there are missing values. For whatever reason, a missing distance returns NA values for all but meters and seconds, and therefore won’t attach to the existing data frame. Maybe Jim, Carson or Ben can help out with this one, but my workaround seems to do the trick (and we get to play with for loops).

This loop uses the distCat variable to identify unique combinations of team zip codes and its respective players’ college or university. For each unique identifier here, we take the zip and schoolName, and place them into the mapdist function. We want driving distance (side note: mapdist does not work for overseas addresses, as we’ll see in my next post) and this returns distance in both imperial and metric measurements, as well as driving time. If there is a missing value returned, then we know that we’ll have to build up the output so that it easily attaches to our “AllDist” data frame. This is the reason for the if portion of the loop. Then we add the distCat variable so that we can merge this with our larger data frame. Keep in mind that Google limits you to 2,500 queries per 24 hours, so this isn’t optimal for very large data sets.

#get stadium distances from college through Google Maps API
library(ggmap)

#initiate object
AllDist <- NULL
m <- NA
seconds <- NA

#run for loop with mapdist
for(i in unique(SSal$distCat)) {
     j <- SSal$zip[SSal$distCat==i][1]
     k <- SSal$schoolName[SSal$distCat==i][1]
     d <- mapdist(from=j, to=k, mode="driving", output="simple")
          if(is.na(d[,3])) {
               d <- data.frame(d[,1:2],m,d[,3:4],seconds,d[,5:6])
               }
          AllDist <- data.frame(rbind(AllDist, d))
     }
AllDist$distCat <- paste(AllDist$from,"_",AllDist$to,sep="")

#merge distances into SSal data
SSal <- merge(SSal, AllDist, by="distCat", all.x=T)
head(SSal)

                           distCat teamID  schoolID  playerID yearID lgID   salary yearMin yearMax
1 '02215'_Arizona State University    BOS arizonast pedrodu01   2013   AL 10250000    2002    2004
2        '02215'_Auburn University    BOS    auburn  rossda01   2013   AL  3100000    1996    1997
3 '02215'_Florida State University    BOS floridast  drewst01   2013   AL  9500000    2002    2004
4       '02215'_Gonzaga University    BOS   gonzaga mortecl01   2013   AL   504500    2006    2007
5 '02215'_McNeese State University    BOS   mcneese buchhcl01   2013   AL  5750000    2004    2004
6  '02215'_Oregon State University    BOS  oregonst ellsbja01   2013   AL  9000000    2003    2005
                schoolName   schoolCity schoolState schoolNick     zip    from                       to
1 Arizona State University        Tempe          AZ Sun Devils '02215' '02215' Arizona State University
2        Auburn University       Auburn          AL     Tigers '02215' '02215'        Auburn University
3 Florida State University  Tallahassee          FL  Seminoles '02215' '02215' Florida State University
4       Gonzaga University      Spokane          WA   Bulldogs '02215' '02215'       Gonzaga University
5 McNeese State University Lake Charles          LA    Cowboys '02215' '02215' McNeese State University
6  Oregon State University    Corvallis          OR    Beavers '02215' '02215'  Oregon State University
        m       km    miles seconds  minutes    hours
1 4243876 4243.876 2637.145  137839 2297.317 38.28861
2 1943765 1943.765 1207.856   64630 1077.167 17.95278
3 2110217 2110.217 1311.289   69806 1163.433 19.39056
4 4450428 4450.428 2765.496  143379 2389.650 39.82750
5 2747486 2747.486 1707.288   89492 1491.533 24.85889
6 5104309 5104.309 3171.818  164139 2735.650 45.59417

Taking a look at the data now, we can see that everything seems to have merged nicely together, and for each player, we have the distance that his college is away from the stadium. Let’s now summarize the distance data by team.

#summarize average distance by team
sumSchool <- data.frame(tapply(SSal$miles, SSal$teamID, mean))
sumSchool$teamID <- row.names(sumSchool)
colnames(sumSchool)[1] <- "miles"

#count the number of collegiate players on each roster
sumSchool$numPlayers <- tapply(SSal$miles, SSal$teamID, length)
row.names(sumSchool) <- 1:nrow(sumSchool)
sumSchool <- subset(sumSchool, is.na(sumSchool$miles)==FALSE)

#take a look at the data
sumSchool

I’m likely going to get some grief from my blogging co-authors for not using dplyr. I’ll use it in the future, but there were some logistical issues with my work computer and updating R so that dplyr could be included in this post. Instead, I’m using tapply to calculate the average player distance by team, as well as the number of players on each team that went played baseball in college. Some quick and dirty column and row name manipulation, and we should now have a new data frame called “sumSchool” with three variables: miles, teamID, and numPlayers.

Let’s just do some very simple data exploration with this summary data frame by making bar plots of the average distance for each team as well as the number of college players on each team for 2013. Below, I make sure to have the bar plot sorted from shortest to longest average distance, and from fewest to most college players, respectively. Alternatively, you could sort them both by the average distance, so that each team lines up in both bar graphs.

#create barplots of summarized data
png(file="CollegeDist.png", height=800, width=1300)
par(mfrow=c(2,1))
barplot(sort(sumSchool$miles),
     main="Average Players' College Distance",
     ylab="Avg. Miles from Stadium to College",
     ylim=c(0,2000))
grid()

barplot(sort(sumSchool$numPlayers),
     main="College Players",
     ylab="Number of College Players on Roster",
     ylim=c(0,25))
grid()
dev.off()

CollegeDist

You should now have a png file in your working directory. Here, we see that Tampa Bay has the longest average distance between their stadium and their collegiate players, while Atlanta has the shortest average distance. Meanwhile, San Diego and the LA Angels have the most collegiate players, while the Rockies and Tigers seem to be made up mostly of non-collegiate players.

Now I’m going to steal a little bit of code from Carson and take a look at some boxplots of the distance data by team. For this, we’ll return to the SSal data frame and load in the ggplot2 package. The boxplots should tell us a bit more about the distribution of player college distances for each team, rather than just the average distance.

#create boxplot of distances by team
library(ggplot2)
SSal2 <- names(sort(tapply(SSal$miles, INDEX = SSal$teamID, median)))

png(file="milesBoxplot.png", h=750, w=450)
qplot(data = SSal2, x = reorder(teamID, miles, FUN=median), y = miles, geom = "boxplot") + coord_flip() + ylab("Miles from College") + xlab("Team")
dev.off()

milesBoxplot

We can see here that while the Angels have the lowest average distance away, there is substantial variation among their players. Colorado, on the other hand, has pretty low variation; however, we saw earlier that they only had three collegiate players on the roster.

Now, this is all pretty simple exploration—especially just looking at bar plots—but next time, we’ll combine the mapdist function with some birthplace information for U.S. players, fit a (overly simplistic) regression model, and see if teams with players that were born closer to the stadium see higher levels of attendance, on average.

Leave a comment