Where Were the American MLB Players Born?
One of my fun units in my data science class is mapping. I wanted to show the relative ease of creating region maps colored with an interesting measurement for each region. This post will demonstrate scraping several datasets from html tables, getting map data using the
mapdata package, merging this map data with the data frame of measurements, and constructing the map using the
A Question and Relevant Data
I am wondering about the locations of the birthplaces of the American players who played in the current MLB season. Doing a little searching, I found this page from Baseball Almanac that displays the number of players born in each state:
Scraping the Data
readHTMLTable function in the
XML package, I read this html page into R.
library(XML) d <- readHTMLTable("http://www.baseball-almanac.com/players/birthplace.php?y=2016")
Some Data Cleaning
Below, I pick out the list element containing the table, choose the relevant rows and columns, convert each column to character type, and make a long vector of values.
d1 <- d[][14:30, 2:4] for (j in 1:3) d1[, j] <- as.character(d1[, j]) d1 <- unlist(d1)
A typical value in my vector is “Alabama (22)”. I want to extract the number from each string in my vector. (I use the
str_extract function in the
stringr package.) The completed list of counts is stored in
library(stringr) values <- as.numeric(str_extract(d1, '[0-9]+'))
Similarly I want to extract the state names — all of the names will be in the vector
state_names <- str_extract(d1, '[a-zA-z ]+')
I remove the District of Columbia data from each vector.
values <- values[-48] state_names <- state_names[-48]
I create a data frame with two variables
final <- data.frame(State=state_names, N_Player=as.numeric(values))
Collecting State Populations
The raw number of players in a state should be standardized by, say the state’s population. So I want to collect current state populations. I find a Wikipedia page with the data and read it in R by the
htmltab function in the
library(htmltab) p <- htmltab("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
I do some cleaning of this data frame. I select the rows corresponding to the 50 states, use shorter variable names, and remove the commas from the population values, and convert the results to numeric type.
p <- p[c(1:29, 31:49, 51:52), 3:4] names(p) <- c("State", "Population") p$Population <- as.numeric(str_replace_all(p$Population, ",", ""))
Merging the Two Datasets
Okay, now I merge the baseball data with the population data using the
library(dplyr) final_data <- inner_join(final, p)
I create a density variable
Pct_MLB which divides the number of players by the corresponding state’s population and make it a percentage.
final_data <- mutate(final_data, Pct_MLB = 100 * N_Player / Population, state=tolower(State))
Merging with Mapping Data
I collect the map data for the United States using the
map_data function in the
mapdata package. Then I merge this mapping data with my measurement data using another application of
library(mapdata) states <- map_data("state") states <- inner_join(states, final_data, by=c("region"="state"))
Ready to Graph
I am ready to graph using
ggplot2 . I use
geom_polygon with the fill variable is
Pct_MLB . I choose a color range of yellow (low) to red (high) so the states are clearly distinguished.
library(ggplot2) ggplot(data = states) + geom_polygon(aes(x = long, y = lat, fill = Pct_MLB, group = group), color = "white") + coord_fixed(1.3) + scale_fill_gradient(low = "yellow", high = "red") + ggtitle("Density of 2016 MLB Players by State")
As expected, we see a high density of baseball players born in California and Florida. But there are a number of other states with “larger” densities of baseball players such as Texas, Georgia, Oklahoma, and some of the midwestern states. By adjusting by the population sizes, I think this display is perhaps more meaningful that simply scanning a table of number of players from each state.
If you want to try this out, all of the code above can be found as a single chunk on my gist site.