Team Attendance and Nearby Hometown Players

In my first post here at Baseball With R, we took a look at the distance between players’ MLB teams and where they played in college. Today, I’m going to continue on with this example, and integrate a little more data from the Lahman package: attendance.

There’s usually a lot of chatter about people coming out to see the hometown boy. Joe Mauer and the Twins. Jason Heyward and the Braves. Jered Weaver and the Angels. Cal Ripken and the Orioles. Ok, so maybe he’s a bit outside our data sample. But you get the idea. There’s some rhetoric stating that teams should go after a prospect that is a hometown guy. But are the players playing closer to home really attracting more fans? Let’s take an overly simplistic look at the relationship.

For the purposes of this post, we’ll again use the team zip code file from last time, and look at U.S. based players. To mix things up, we’ll be looking at birthplace instead of college this time around as well. We should start by loading in our necessary libraries for today. Note that I’m introducing a new library, which I’ll explain a little later.

##load in libraries
library(Lahman)
library(ggmap)
library(visreg)

##set working directory and load in zip code
setwd("c:/Documents/Baseball With R")
zips <- read.csv(file="TeamZips.csv", h=T)

And we’ll again deal with data from the Salaries data, merged with the Master data in the Lahman package, using data from 2008 in the American League only (and excluding Houston, since they switched during the time of our data). We subsequently merge this with our zip code data and make sure that the zip codes are characters for the mapdist function.

###now use the Master file instead of schools to get birthCity
SalSub <- subset(Salaries, Salaries$yearID > 2008 & Salaries$lgID=="AL" & Salaries$teamID!="HOU")
SalCity <- merge(SalSub, Master, by="playerID", all.x=T)

##ensure zip codes are entered as characters for mapdist function
SalCity <- merge(SalCity, zips, by="teamID", all.x=T)
SalCity$zip <- as.character(SalCity$zip)
is.character(SalCity$zip)
[1] TRUE

Next, we create a full home address for each American born player that has this information entered in the data set. From there, we create the distCat variable to identify unique Birthplace-Stadium matches in the data:

##create variable for full address
SalCity$birthAdd <- ifelse(is.na(SalCity$birthState)==FALSE, 
            paste(SalCity$birthCity,", ",SalCity$birthState,", ",SalCity$birthCountry,sep=""),
            paste(SalCity$birthCity,", ",SalCity$birthCountry,sep=""))

##create variable of zip code and college
SalCity$distCat <- paste(SalCity$zip,"_",SalCity$birthAdd,sep="")

And we can take a look at the data to make sure everything looks OK:

head(SalCity)

  teamID  playerID yearID lgID  salary birthYear birthMonth birthDay birthCountry birthState    birthCity
1    BAL johnsji04   2010   AL  440000      1983          6       27          USA         NY Johnson City
2    BAL walkeja01   2009   AL 4500000      1971          7        1          USA         TN  McMinnville
3    BAL wiggity01   2010   AL 3500000      1977         10       11          USA         CA    San Diego
4    BAL bergebr02   2010   AL  405000      1985          9       25          USA         CA    Fairfield
5    BAL wietema01   2010   AL  400000      1986          5       21          USA         SC  Goose Creek
6    BAL hendrma01   2010   AL 1200000      1974          6       23          USA         WA Mount Vernon
  deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst    nameLast       nameGiven weight
1        NA         NA       NA         <NA>       <NA>      <NA>       Jim     Johnson    James Robert    240
2        NA         NA       NA         <NA>       <NA>      <NA>     Jamie      Walker      James Ross    190
3        NA         NA       NA         <NA>       <NA>      <NA>        Ty   Wigginton        Ty Allen    225
4        NA         NA       NA         <NA>       <NA>      <NA>      Brad    Bergesen  Bradley Steven    210
5        NA         NA       NA         <NA>       <NA>      <NA>      Matt     Wieters Matthew Richard    240
6        NA         NA       NA         <NA>       <NA>      <NA>      Mark Hendrickson      Mark Allan    240
  height bats throws      debut  finalGame  retroID   bbrefID deathDate  birthDate     zip              birthAdd
1     78    R      R 2006-07-29 2013-09-29 johnj010 johnsji04      <NA> 1983-06-27 '21201' Johnson City, NY, USA
2     74    L      L 1997-04-02 2009-06-02 walkj001 walkeja01      <NA> 1971-07-01 '21201'  McMinnville, TN, USA
3     72    R      R 2002-05-16 2013-07-05 wiggt001 wiggity01      <NA> 1977-10-11 '21201'    San Diego, CA, USA
4     74    L      R 2009-04-21 2012-10-01 bergb002 bergebr02      <NA> 1985-09-25 '21201'    Fairfield, CA, USA
5     77    B      R 2009-05-29 2013-09-28 wietm001 wietema01      <NA> 1986-05-21 '21201'  Goose Creek, SC, USA
6     81    L      L 2002-08-06 2011-08-28 hendm001 hendrma01      <NA> 1974-06-23 '21201' Mount Vernon, WA, USA
                        distCat
1 '21201'_Johnson City, NY, USA
2  '21201'_McMinnville, TN, USA
3    '21201'_San Diego, CA, USA
4    '21201'_Fairfield, CA, USA
5  '21201'_Goose Creek, SC, USA
6 '21201'_Mount Vernon, WA, USA

As I did last time, I’m going to use a for loop in order to grab the distance data using mapdist. This code is identical to my last post, but this time we measure the distance between full City-State addresses and the zip codes for the stadium that those players now call home.

##get stadium distances from birthplaces through Google Maps API
AllDist <- NULL
m <- NA
seconds <- NA
for(i in unique(SalCity$distCat)) {
            j <- SalCity$zip[SalCity$distCat==i][1]
            k <- SalCity$birthAdd[SalCity$distCat==i][1]
            d <- mapdist(from=j, to=k, mode="driving", output="simple")
            if(is.na(d[,3])) {
                        d <- data.frame(d[,1:2],m,d[,3:4],seconds,d[,5:6])
                        }
            AllDist <- data.frame(rbind(AllDist, d))
            }
AllDist$distCat <- paste(AllDist$from,"_",AllDist$to,sep="")

##merge distances into SalCity data—excludes those crossing bodies of water
SalCity <- merge(SalCity, AllDist, by="distCat", all.x=T)
SalCity$tmYrID <- paste(SalCity$teamID,"_",SalCity$yearID,sep="")
head(SalCity)

                      distCat teamID  playerID yearID lgID   salary birthYear birthMonth birthDay birthCountry
1    '02215'_Abilene, TX, USA    BOS lackejo01   2010   AL 18700000      1978         10       23          USA
2    '02215'_Abilene, TX, USA    BOS lackejo01   2012   AL 15950000      1978         10       23          USA
3    '02215'_Abilene, TX, USA    BOS lackejo01   2013   AL 15950000      1978         10       23          USA
4    '02215'_Abilene, TX, USA    BOS lackejo01   2011   AL 15950000      1978         10       23          USA
5    '02215'_Atlanta, GA, USA    BOS hermije01   2010   AL  3345000      1984          1       30          USA
6 '02215'_Bainbridge, GA, USA    BOS  rossda01   2013   AL  3100000      1977          3       19          USA
  birthState  birthCity deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst nameLast
1         TX    Abilene        NA         NA       NA         <NA>       <NA>      <NA>      John   Lackey
2         TX    Abilene        NA         NA       NA         <NA>       <NA>      <NA>      John   Lackey
3         TX    Abilene        NA         NA       NA         <NA>       <NA>      <NA>      John   Lackey
4         TX    Abilene        NA         NA       NA         <NA>       <NA>      <NA>      John   Lackey
5         GA    Atlanta        NA         NA       NA         <NA>       <NA>      <NA>    Jeremy  Hermida
6         GA Bainbridge        NA         NA       NA         <NA>       <NA>      <NA>     David     Ross
    nameGiven weight height bats throws      debut  finalGame  retroID   bbrefID deathDate  birthDate     zip
1 John Derran    235     78    R      R 2002-06-24 2013-09-24 lackj001 lackejo01      <NA> 1978-10-23 '02215'
2 John Derran    235     78    R      R 2002-06-24 2013-09-24 lackj001 lackejo01      <NA> 1978-10-23 '02215'
3 John Derran    235     78    R      R 2002-06-24 2013-09-24 lackj001 lackejo01      <NA> 1978-10-23 '02215'
4 John Derran    235     78    R      R 2002-06-24 2013-09-24 lackj001 lackejo01      <NA> 1978-10-23 '02215'
5 Jeremy Ryan    220     75    L      R 2005-08-31 2012-04-26 hermj001 hermije01      <NA> 1984-01-30 '02215'
6  David Wade    230     74    R      R 2002-06-29 2013-09-28 rossd001  rossda01      <NA> 1977-03-19 '02215'
             birthAdd    from                  to       m       km    miles seconds  minutes    hours   tmYrID
1    Abilene, TX, USA '02215'    Abilene, TX, USA 3133530 3133.530 1947.176  102473 1707.883 28.46472 BOS_2010
2    Abilene, TX, USA '02215'    Abilene, TX, USA 3133530 3133.530 1947.176  102473 1707.883 28.46472 BOS_2012
3    Abilene, TX, USA '02215'    Abilene, TX, USA 3133530 3133.530 1947.176  102473 1707.883 28.46472 BOS_2013
4    Abilene, TX, USA '02215'    Abilene, TX, USA 3133530 3133.530 1947.176  102473 1707.883 28.46472 BOS_2011
5    Atlanta, GA, USA '02215'    Atlanta, GA, USA 1770484 1770.484 1100.179   59022  983.700 16.39500 BOS_2010
6 Bainbridge, GA, USA '02215' Bainbridge, GA, USA 2018130 2018.130 1254.066   70463 1174.383 19.57306 BOS_2013

Now that the data is all together, let’s take a look at some summaries of our distances by each team in each year:

##summarize average distance by team
sumDist <- data.frame(tapply(SalCity$miles, SalCity$tmYrID, mean, na.rm=T))
sumDist$tmYrID <- row.names(sumDist)
colnames(sumDist)[1] <- "miles"
sumDist$numPlayers <- tapply(SalCity$miles, SalCity$tmYrID, length)
row.names(sumDist) <- 1:nrow(sumDist)
sumDist <- subset(sumDist, is.na(sumDist$miles)==FALSE)
head(sumDist)

     miles   tmYrID numPlayers
1 1252.589 BAL_2009         26
2 1456.551 BAL_2010         26
3 1473.730 BAL_2011         26
4 1328.243 BAL_2012         28
5 1132.858 BAL_2013         26
6 1735.820 BOS_2009         29

And add in attendance and Win-Loss records for each, merging the two together to run a regression model.

##add in other team information from 2010 through 2014
teamDat <- subset(Teams, Teams$yearID > 2008 & Teams$lgID=="AL" & Teams$teamID!="HOU")
teamDat$tmYrID <- paste(teamDat$teamID,"_",teamDat$yearID,sep="")
head(teamDat)

     yearID lgID teamID franchID divID Rank   G Ghome   W  L DivWin WCWin LgWin WSWin   R   AB    H X2B X3B  HR  BB   SO
2596   2009   AL    MIN      MIN     C    1 163    82  87 76      Y     N     N     N 817 5608 1539 271  40 172 585 1021
2597   2009   AL    DET      DET     C    2 163    81  86 77      N     N     N     N 743 5540 1443 245  35 183 540 1114
2598   2009   AL    CHA      CHW     C    3 162    81  79 83      N     N     N     N 724 5463 1410 246  20 184 534 1022
2599   2009   AL    CLE      CLE     C    4 162    81  65 97      N     N     N     N 773 5568 1468 314  28 161 582 1211
2600   2009   AL    KCA      KCR     C    4 162    81  65 97      N     N     N     N 686 5532 1432 276  51 144 457 1091
2601   2009   AL    NYA      NYY     E    1 162    81 103 59      Y     N     Y     Y 915 5660 1604 325  21 244 663 1014
      SB CS HBP SF  RA  ER  ERA CG SHO SV IPouts   HA HRA BBA  SOA   E  DP   FP               name
2596  85 32  45 57 765 726 4.50  4   7 48   4359 1542 185 466 1052  89 133 0.98    Minnesota Twins
2597  72 33  61 39 745 690 4.34  4   9 42   4341 1449 182 594 1102 106 164 0.98     Detroit Tigers
2598 113 49  62 39 732 663 4.16  4  11 36   4319 1438 169 507 1119 128 156 0.98  Chicago White Sox
2599  84 31  81 50 865 806 5.07  5   6 25   4302 1570 183 598  986 114 169 0.98  Cleveland Indians
2600  88 29  42 32 842 765 4.83 10   9 34   4278 1486 166 600 1153 127 159 0.98 Kansas City Royals
2601 111 28  54 39 753 687 4.28  3   8 51   4350 1386 181 574 1260 101 131 0.98   New York Yankees
                            park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro   tmYrID
2596 Hubert H Humphrey Metrodome    2416237  99  98      MIN            MIN         MIN MIN_2009
2597               Comerica Park    2567165 101 102      DET            DET         DET DET_2009
2598         U.S. Cellular Field    2284163 105 105      CHW            CHA         CHA CHA_2009
2599                Jacobs Field    1766242  95  95      CLE            CLE         CLE CLE_2009
2600            Kauffman Stadium    1797891  97  99      KCR            KCA         KCA KCA_2009
2601          Yankee Stadium III    3719358 105 103      NYY            NYA         NYA NYA_2009


sumTeam <- data.frame(tapply(teamDat$W, teamDat$tmYrID, sum), 
            tapply(teamDat$L, teamDat$tmYrID, sum), 
            tapply(teamDat$attendance, teamDat$tmYrID, sum))
sumTeam$teamID <- row.names(sumTeam)
colnames(sumTeam) <- c("W", "L", "Att", "tmYrID")
sumTeam <- subset(sumTeam, is.na(sumTeam$W)==F)
row.names(sumTeam) <- 1:nrow(sumTeam)
head(sumTeam)

   W  L     Att   tmYrID
1 64 98 1907163 BAL_2009
2 66 96 1733018 BAL_2010
3 69 93 1755461 BAL_2011
4 93 69 2102240 BAL_2012
5 85 77 2357561 BAL_2013
6 95 67 3062699 BOS_2009


teams <- merge(sumTeam, sumDist, by="tmYrID", all=T)
teams$teamID <- substr(teams$tmYrID, start=1, stop=3)
teams$yearID <- substr(teams$tmYrID, start=5, stop=8)
teams$AttGame <- teams$Att/81
head(teams)

    tmYrID  W  L     Att    miles numPlayers teamID yearID  AttGame
1 BAL_2009 64 98 1907163 1252.589         26    BAL   2009 23545.22
2 BAL_2010 66 96 1733018 1456.551         26    BAL   2010 21395.28
3 BAL_2011 69 93 1755461 1473.730         26    BAL   2011 21672.36
4 BAL_2012 93 69 2102240 1328.243         28    BAL   2012 25953.58
5 BAL_2013 85 77 2357561 1132.858         26    BAL   2013 29105.69
6 BOS_2009 95 67 3062699 1735.820         29    BOS   2009 37811.10

You should be able to see a row for each team in each year, with the respective team information as the variable columns. This is in a nice structure to use the lm function, with Attendance Per Game as the dependent variable, and Wins and miles as our predictors of attendance. Additionally, I include team level dummy variables to control for unobserved factors within each market. This is a simplistic way to include individual intercept terms for team attendance, since we know that the Yankees are always going to have higher attendance than the Rays just based on their market. (Note: there are some important assumptions I am ignoring for this analysis for the sake of showing the functionality of R—exploratory analysis and checking assumptions in the data should always be performed).

One of the nice things about R is that it automatically knows that you want individual (0,1) dummy variables for each of the levels of a factor in your variables. It even leaves out a level to ensure we don’t fall into the dummy variable trap. Note that when fitting models in R, you generally want to ensure you save the model as an object. Otherwise, returning the output will require re-running the model. This isn’t a big deal for what we’re doing today, but it’s essential when estimating models like Carson did last week.

##fit regression of attendance on wins and miles with team dummy fixed effects
fitAtt <- lm(AttGame ~ W + miles + factor(teamID), data=teams)
summary(fitAtt)

Call:
lm(formula = AttGame ~ W + miles + factor(teamID), data = teams)

Residuals:
    Min      1Q  Median      3Q     Max 
-7518.8 -1856.8    20.8  1674.9  6975.4 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       28290.277   5339.330   5.298 2.21e-06 ***
W                    57.335     46.310   1.238 0.221040    
miles                -6.230      3.083  -2.021 0.048274 *  
factor(teamID)BOS 15263.899   2569.903   5.939 2.13e-07 ***
factor(teamID)CHA  1289.917   2138.997   0.603 0.549002    
factor(teamID)CLE -4066.934   1983.864  -2.050 0.045232 *  
factor(teamID)DET  8753.694   2079.895   4.209 9.77e-05 ***
factor(teamID)KCA -4179.994   2091.697  -1.998 0.050722 .  
factor(teamID)LAA 14843.023   2084.774   7.120 2.63e-09 ***
factor(teamID)MIN  8890.144   2119.000   4.195 0.000102 ***
factor(teamID)NYA 20258.870   2261.144   8.960 2.88e-12 ***
factor(teamID)OAK -3129.982   2342.939  -1.336 0.187178    
factor(teamID)SEA  2964.032   2531.427   1.171 0.246783    
factor(teamID)TBA -4508.083   2103.983  -2.143 0.036667 *  
factor(teamID)TEX  7407.394   2488.599   2.977 0.004356 ** 
factor(teamID)TOR   981.022   2029.312   0.483 0.630748    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3136 on 54 degrees of freedom
Multiple R-squared:  0.8921,    Adjusted R-squared:  0.8621 
F-statistic: 29.77 on 15 and 54 DF,  p-value: < 2.2e-16

Here, you can see we have the coefficient estimate for wins, miles, and each individual team dummy fixed effect as compared to the Anaheim Angels (R uses the first level in alphabetical order as the left-out level). You also have a standard error, and t-value and p-value for each coefficient in the model. Conveniently, R even stars your statistically significant predictor variables.

The variables we’re most interested in are wins and miles. We see that wins aren’t actually a significant predictor of attendance in our model. This is likely because the dummy variables for the teams account for a lot of the attendance variation over the short time period here.

But we do see a statistically significant estimate for miles. Note that it is negative, meaning the higher the average miles away that the team’s U.S.-born players live, the lower the attendance for that team. For every mile larger the average distance is, we expect to see about 6 fewer people at each game for that team.

Sometimes it’s not so easy to visualize coefficient estimates when controlling for other factors. One way to visualize the relationship between attendance and average birthplace distance—while holding constant individual team effects and team wins—is to use the partial residuals from the model. In fact, the package visreg that we loaded up earlier will do this for us with a few simple lines of code. This package plots the partial residuals along with a regression line and the respective confidence interval for the estimate directly from the model fit we named fitAtt. Below, I also save the file as a .png to ensure it’s in my working directory for future viewing.

##show slope of coefficient estimate with confidence band
png(file="AttDistSlope.png", height=550, width=700)
visreg(fitAtt, "miles", type="conditional",
            fill=list(col="#00990033"), 
            point=list(cex=1.8, col="#00009999"), 
            line=list(col=c("darkred", "blue"), lwd=3),
            overlay=T, 
            main="Attendance Increase by Average Distance to U.S. Mainland Players' Birthplace", xlab="Distance", ylab="Attendance per Game")
grid()
dev.off()

AttDistSlope

You can see here the partial residuals plotted in blue, with the regression line in red, and the confidence interval in green. This function works nicely when using interactions in your regressions. For dummy/factor variables, it allows separate panels to see the slope across each factor. For continuous variable interactions, it allows the plotting of multiple slopes on the same plot for various levels of that interaction. Those require a bit more work and some data transformation, but can be extremely helpful in interpreting regression output in R. I’ll hopefully get into visualizing interactions in some more useful regressions in the future, possibly even with Generalized Additive Model output that Carson presented last week (supposedly supported by visreg).

I actually expected to get a quick and easy null result here with respect to player birthplace distance and include it as part of my last post. I don’t actually believe we’re finding a real (and almost certainly not causal) effect here. So what do I think is going on? I suspect there are plenty of things that have gone unaccounted for in this simplistic model, and it could be a product of the dummy effects in the regression. But I mainly wanted to use this post to show a bit more about the Lahman package and mapdist function, while introducing visreg .

In any case, hopefully these two first posts of mine have shown the power of R in grabbing and merging together data sets, as well as its flexibility when working with certain functions like mapdist , and visualizing relationships to help along interpretability of models.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: