Using ggmap and Lahman to Find the Hometown College Rosters
Hey everyone. First time poster here. I am honored to be able to contribute here at Baseball With R, and would like to thank Jim, Max, Ben, and Carson for the opportunity to contribute.
For those of you that don’t know me (probably mostly all of you), my name is Brian Mills, and I am an Assistant Professor at the University of Florida. My general research area is under the umbrella of Managerial Sports Economics; however, I have worked a lot with R over the years. I’ve had an interest in baseball since the age of 5, playing through college at the Division III level. I gained experience with R as a graduate student at Michigan, which helped my inquisitive nature blossom with baseball data. Prior to being here, I ran a series R tutorials at my old—now basically defunct—blog called sab-R-metrics. I hope that I can provide some fun posts for everyone to practice with R.
For my first post, I wanted to use the Lahman, ggplot2, and ggmap package to do a little bit of exploration into the geographic distribution of hometowns and colleges of players on MLB teams. I am going to start pretty simple here, and hopefully build on it for the next post or two.
The other contributors here have talked a good bit about Lahman across this blog, so I’m not going to give too much description here. And Max has already used the ggmap package last year to look at stadium distances. This package is used to access the Google Maps API and grab the driving, walking, or biking distance between two places on the map. It is a relatively simple function, which asks for two character strings describing the places. In fact, these character strings can be pretty flexible. You could type “Oriole Park” and “University of Alabama”, and it will return the distance between these two places. This will be helpful, since we have three different place types in my first few posts here:
- Zip Codes
- College/University Names
- Player Birth Places (City, State, Country)
Normally, I would recommend grabbing the team stadium names from Lahman, but I wanted to work with various types of inputs to show that it is a pretty flexible function, and Max already went through getting those from Lahman and using them in the context of distance measurement. Let’s start by loading in the data and getting our variables set.
You will want to make sure to also put this file in your directory and name it “TeamZips.csv” (or, alternatively, change the file name in the code provided below to whatever you want). It is a comma separated data file containing the zip codes of each MLB stadium. The code below uses merge
to integrate the “Salaries” and “SchoolsPlayers” data from the Lahman package into a single data frame. It merges each by the playerID, and informs R that we want to include all rows of “Salaries” (but not “SchoolsPlayers”). I then subset the data to only include only the 2013 season, and those players with an entry for where they played collegiate baseball.
#load in Lahman library for data on player schools library(Lahman) #merge salary-team data with player-school data SalSch <- merge(Salaries, SchoolsPlayers, by="playerID", all.x=T) #subset data to remove non-college players, and data before 2013 SSal <- subset(SalSch, SalSch$yearID==2013 & is.na(SalSch$schoolID)==F) row.names(SSal) <- 1:nrow(SSal)
Next, I merge this data set with the Schools data from Lahman so that we have full character string names of the colleges and universities, rather than just the schoolID. We’ll need the full names for use with the ggmap::mapdist
function.
#add in information on school names SSal <- merge(SSal, Schools, by="schoolID", all.x=T)
I then grab the zip code file and merge this with the “SSal” data set using the teamID variable. Again, we want all rows of the “SSal” data to be included, but not necessarily all rows of the zips data frame. Also, when loading into R, the zip codes were not read as characters, which is required for the mapdist
function. Therefore, I double check this, and tell R to read these as characters.
#get zip codes of team stadiums setwd("c:/Users/bmmillsy/Dropbox/Baseball With R") zips <- read.csv(file="TeamZips.csv", h=T) #merge stadium zip codes with player data SSal <- merge(SSal, zips, by="teamID", all.x=T) head(SSal) teamID schoolID playerID yearID lgID salary yearMin yearMax 1 ARI lsu hillaa01 2013 NL 5500000 2001 2003 2 ARI usc kenneia01 2013 NL 4265000 2004 2006 3 ARI arizonast bloomwi01 2013 NL 1900000 1997 1999 4 ARI clemson sippto01 2013 NL 1275000 2004 2004 5 ARI arkansas hinsker01 2013 NL 1350000 1996 1998 6 ARI swmost zieglbr01 2013 NL 3150000 2000 2003 schoolName schoolCity schoolState schoolNick zip 1 Louisiana State University Baton Rouge LA Tigers '85004' 2 University of Southern California Los Angeles CA Trojans '85004' 3 Arizona State University Tempe AZ Sun Devils '85004' 4 Clemson University Clemson SC Tigers '85004' 5 University of Arkansas Fayetteville AR Razorbacks '85004' 6 Missouri State University Springfield MO Bears '85004' #ensure zip codes are entered as characters for mapdist function SSal$zip <- as.character(SSal$zip) is.character(SSal$zip) [1] TRUE
Next, we’ll create a variable called distCat, which identifies unique School-Team Zip Code matches within the data. This will be used to scroll through the data frame one by one, and return the respective distance between that zip code, and the player’s college.
#create variable of zip code and college SSal$distCat <- paste(SSal$zip,"_",SSal$schoolName,sep="") head(SSal$distCat) [1] "'85004'_Louisiana State University" "'85004'_University of Southern California" [3] "'85004'_Arizona State University" "'85004'_Clemson University" [5] "'85004'_University of Arkansas" "'85004'_Missouri State University"
Now let me preface the following code with the caveat that I shouldn’t have to write a for loop to get this done. The function should do this automatically by including the character vectors of zip and schoolName. But it turns out that mapdist
does not handle missing data (or, more specifically, returning missing data) very well from my experience, and I’ve found that the function just stops running once it hits one of these. It kept failing to return any distance data any time this happened. So, I made a dirty workaround as described below. This is a great thing about using R: the ability to look inside functions and snoop around allows for finding ways around issues you’re having with your code.
First, I initialize three vectors. The first, “AllDist”, will be where the data from mapdist
is stored from each distance measurement. The other two, m and seconds, are created as missing values to place into the mapdist
output when there are missing values. For whatever reason, a missing distance returns NA values for all but meters and seconds, and therefore won’t attach to the existing data frame. Maybe Jim, Carson or Ben can help out with this one, but my workaround seems to do the trick (and we get to play with for loops).
This loop uses the distCat variable to identify unique combinations of team zip codes and its respective players’ college or university. For each unique identifier here, we take the zip and schoolName, and place them into the mapdist
function. We want driving distance (side note: mapdist
does not work for overseas addresses, as we’ll see in my next post) and this returns distance in both imperial and metric measurements, as well as driving time. If there is a missing value returned, then we know that we’ll have to build up the output so that it easily attaches to our “AllDist” data frame. This is the reason for the if
portion of the loop. Then we add the distCat variable so that we can merge this with our larger data frame. Keep in mind that Google limits you to 2,500 queries per 24 hours, so this isn’t optimal for very large data sets.
#get stadium distances from college through Google Maps API library(ggmap) #initiate object AllDist <- NULL m <- NA seconds <- NA #run for loop with mapdist for(i in unique(SSal$distCat)) { j <- SSal$zip[SSal$distCat==i][1] k <- SSal$schoolName[SSal$distCat==i][1] d <- mapdist(from=j, to=k, mode="driving", output="simple") if(is.na(d[,3])) { d <- data.frame(d[,1:2],m,d[,3:4],seconds,d[,5:6]) } AllDist <- data.frame(rbind(AllDist, d)) } AllDist$distCat <- paste(AllDist$from,"_",AllDist$to,sep="") #merge distances into SSal data SSal <- merge(SSal, AllDist, by="distCat", all.x=T) head(SSal) distCat teamID schoolID playerID yearID lgID salary yearMin yearMax 1 '02215'_Arizona State University BOS arizonast pedrodu01 2013 AL 10250000 2002 2004 2 '02215'_Auburn University BOS auburn rossda01 2013 AL 3100000 1996 1997 3 '02215'_Florida State University BOS floridast drewst01 2013 AL 9500000 2002 2004 4 '02215'_Gonzaga University BOS gonzaga mortecl01 2013 AL 504500 2006 2007 5 '02215'_McNeese State University BOS mcneese buchhcl01 2013 AL 5750000 2004 2004 6 '02215'_Oregon State University BOS oregonst ellsbja01 2013 AL 9000000 2003 2005 schoolName schoolCity schoolState schoolNick zip from to 1 Arizona State University Tempe AZ Sun Devils '02215' '02215' Arizona State University 2 Auburn University Auburn AL Tigers '02215' '02215' Auburn University 3 Florida State University Tallahassee FL Seminoles '02215' '02215' Florida State University 4 Gonzaga University Spokane WA Bulldogs '02215' '02215' Gonzaga University 5 McNeese State University Lake Charles LA Cowboys '02215' '02215' McNeese State University 6 Oregon State University Corvallis OR Beavers '02215' '02215' Oregon State University m km miles seconds minutes hours 1 4243876 4243.876 2637.145 137839 2297.317 38.28861 2 1943765 1943.765 1207.856 64630 1077.167 17.95278 3 2110217 2110.217 1311.289 69806 1163.433 19.39056 4 4450428 4450.428 2765.496 143379 2389.650 39.82750 5 2747486 2747.486 1707.288 89492 1491.533 24.85889 6 5104309 5104.309 3171.818 164139 2735.650 45.59417
Taking a look at the data now, we can see that everything seems to have merged nicely together, and for each player, we have the distance that his college is away from the stadium. Let’s now summarize the distance data by team.
#summarize average distance by team sumSchool <- data.frame(tapply(SSal$miles, SSal$teamID, mean)) sumSchool$teamID <- row.names(sumSchool) colnames(sumSchool)[1] <- "miles" #count the number of collegiate players on each roster sumSchool$numPlayers <- tapply(SSal$miles, SSal$teamID, length) row.names(sumSchool) <- 1:nrow(sumSchool) sumSchool <- subset(sumSchool, is.na(sumSchool$miles)==FALSE) #take a look at the data sumSchool
I’m likely going to get some grief from my blogging co-authors for not using dplyr. I’ll use it in the future, but there were some logistical issues with my work computer and updating R so that dplyr could be included in this post. Instead, I’m using tapply
to calculate the average player distance by team, as well as the number of players on each team that went played baseball in college. Some quick and dirty column and row name manipulation, and we should now have a new data frame called “sumSchool” with three variables: miles, teamID, and numPlayers.
Let’s just do some very simple data exploration with this summary data frame by making bar plots of the average distance for each team as well as the number of college players on each team for 2013. Below, I make sure to have the bar plot sorted from shortest to longest average distance, and from fewest to most college players, respectively. Alternatively, you could sort them both by the average distance, so that each team lines up in both bar graphs.
#create barplots of summarized data png(file="CollegeDist.png", height=800, width=1300) par(mfrow=c(2,1)) barplot(sort(sumSchool$miles), main="Average Players' College Distance", ylab="Avg. Miles from Stadium to College", ylim=c(0,2000)) grid() barplot(sort(sumSchool$numPlayers), main="College Players", ylab="Number of College Players on Roster", ylim=c(0,25)) grid() dev.off()
You should now have a png
file in your working directory. Here, we see that Tampa Bay has the longest average distance between their stadium and their collegiate players, while Atlanta has the shortest average distance. Meanwhile, San Diego and the LA Angels have the most collegiate players, while the Rockies and Tigers seem to be made up mostly of non-collegiate players.
Now I’m going to steal a little bit of code from Carson and take a look at some boxplots of the distance data by team. For this, we’ll return to the SSal data frame and load in the ggplot2 package. The boxplots should tell us a bit more about the distribution of player college distances for each team, rather than just the average distance.
#create boxplot of distances by team library(ggplot2) SSal2 <- names(sort(tapply(SSal$miles, INDEX = SSal$teamID, median))) png(file="milesBoxplot.png", h=750, w=450) qplot(data = SSal2, x = reorder(teamID, miles, FUN=median), y = miles, geom = "boxplot") + coord_flip() + ylab("Miles from College") + xlab("Team") dev.off()
We can see here that while the Angels have the lowest average distance away, there is substantial variation among their players. Colorado, on the other hand, has pretty low variation; however, we saw earlier that they only had three collegiate players on the roster.
Now, this is all pretty simple exploration—especially just looking at bar plots—but next time, we’ll combine the mapdist
function with some birthplace information for U.S. players, fit a (overly simplistic) regression model, and see if teams with players that were born closer to the stadium see higher levels of attendance, on average.
Recent Comments