Monthly Archives: November, 2016

Where Were the American MLB Players Born?

One of my fun units in my data science class is mapping. I wanted to show the relative ease of creating region maps colored with an interesting measurement for each region. This post will demonstrate scraping several datasets from html tables, getting map data using the mapdata package, merging this map data with the data frame of measurements, and constructing the map using the ggplot2 package.

A Question and Relevant Data

I am wondering about the locations of the birthplaces of the American players who played in the current MLB season. Doing a little searching, I found this page from Baseball Almanac that displays the number of players born in each state:

mlbbirthpage

Scraping the Data

Using the readHTMLTable function in the XML package, I read this html page into R.

 
library(XML)
d <- readHTMLTable("http://www.baseball-almanac.com/players/birthplace.php?y=2016")

Some Data Cleaning

Below, I pick out the list element containing the table, choose the relevant rows and columns, convert each column to character type, and make a long vector of values.

 
d1 <- d[[1]][14:30, 2:4]
for (j in 1:3)
  d1[, j] <- as.character(d1[, j])
d1 <- unlist(d1)

A typical value in my vector is “Alabama (22)”. I want to extract the number from each string in my vector. (I use the str_extract function in the stringr package.) The completed list of counts is stored in values .

 
library(stringr)
values <- as.numeric(str_extract(d1, '[0-9]+'))

Similarly I want to extract the state names — all of the names will be in the vector state_names .

 
state_names <- str_extract(d1, '[a-zA-z ]+')

I remove the District of Columbia data from each vector.

 
values <- values[-48]
state_names <- state_names[-48]

I create a data frame with two variables State and N_Player .

 
final <- data.frame(State=state_names, 
                    N_Player=as.numeric(values))

Collecting State Populations

The raw number of players in a state should be standardized by, say the state’s population. So I want to collect current state populations. I find a Wikipedia page with the data and read it in R by the htmltab function in the htmltab package.

 
library(htmltab)
p <- htmltab("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")

More Cleaning

I do some cleaning of this data frame. I select the rows corresponding to the 50 states, use shorter variable names, and remove the commas from the population values, and convert the results to numeric type.

 
p <- p[c(1:29, 31:49, 51:52), 3:4]
names(p) <- c("State", "Population")
p$Population <- as.numeric(str_replace_all(p$Population, ",", ""))

Merging the Two Datasets

Okay, now I merge the baseball data with the population data using the inner_join package.

 
library(dplyr)
final_data <- inner_join(final, p)

I create a density variable Pct_MLB which divides the number of players by the corresponding state’s population and make it a percentage.

 
final_data <- mutate(final_data, 
                     Pct_MLB = 100 * N_Player / Population,
                     state=tolower(State))

Merging with Mapping Data

I collect the map data for the United States using the map_data function in the mapdata package. Then I merge this mapping data with my measurement data using another application of inner_join .

 
library(mapdata)
states <- map_data("state")
states <- inner_join(states, final_data, by=c("region"="state"))

Ready to Graph

I am ready to graph using ggplot2 . I use geom_polygon with the fill variable is Pct_MLB . I choose a color range of yellow (low) to red (high) so the states are clearly distinguished.

 
library(ggplot2)
ggplot(data = states) + 
  geom_polygon(aes(x = long, y = lat, 
          fill = Pct_MLB, group = group), color = "white") + 
  coord_fixed(1.3) +
  scale_fill_gradient(low = "yellow", high = "red") +
  ggtitle("Density of 2016 MLB Players by State")

2016mlbgraph

As expected, we see a high density of baseball players born in California and Florida. But there are a number of other states with “larger” densities of baseball players such as Texas, Georgia, Oklahoma, and some of the midwestern states. By adjusting by the population sizes, I think this display is perhaps more meaningful that simply scanning a table of number of players from each state.

If you want to try this out, all of the code above can be found as a single chunk on my gist site.