In my “computing with data” course, one nice topic is text mining, and one activity creates a word cloud of the popular words from a book or speech or group of tweets. An easy way of creating a word cloud uses the
wordcloud package — one can create a word cloud if one has a vector of popular words and a vector of corresponding frequencies.
In a conversation with one of our current students, I wondered aloud if some type of word cloud display could be created using
ggplot2 graphics. This seemed appealing since
ggplot2 could be used to create panels or use color as an extra attribute. Anyway, a little google search led to me to this post which describes how to create a word cloud by clever use of the
ggrepel package. I apply these ideas to learn about popular first names among ballplayers.
We know that popularity of first names have gone through dramatic changes in history (one can explore the names in the babynames package to see these changes). So I would also think that names of ballplayers have also gone through changes, and I could use word clouds to visualize these changes. I divided the history of baseball into four eras –birthyears 1820-1895, 1896-1937, 1938-1969, 1970-1996 (these eras contain approximately an equal number of players) and created the following display using
ggplot2. The size of the text corresponds to the frequency of players with that first name and I have highlighted (using a different color) the names corresponds to frequencies of 100 or higher. Since I am a Jim, I notice that Jim was relatively popular name among ballplayers in the first three eras, but has disappeared from the 1970-1996 era (Matt, Chris, and Mike are popular instead). I don’t see an obvious Latin American influence with the exception of Carlos and Luis in the most recent era. (It would be interesting to do a word study of the first names of all of the current playoff teams.)
Here is the code I used. As you use
dplyr operations more often, it seems convenient and perhaps more intuitive to use piping.
library(Lahman) library(tidyverse) library(ggrepel) library(ggthemes) Master %>% mutate(Byear = factor(cut(birthYear, c(1820, 1895, 1937, 1969, 1996)), labels=paste("Birthyears", c("1820-1895", "1896-1937", "1938-1969", "1970-1996")))) %>% group_by(Byear, nameFirst) %>% summarize(Count = n()) %>% filter(is.na(Byear) == FALSE) %>% group_by(Byear) %>% arrange(desc(Count)) %>% slice(1:30) %>% ggplot + aes(x = 1, y = 1, size = Count, label = nameFirst, color= Count >= 100) + geom_text_repel(segment.size = 0, force = 100) + scale_size(range = c(2, 10), guide = FALSE) + scale_y_continuous(breaks = NULL) + scale_x_continuous(breaks = NULL) + labs(x = '', y = '') + facet_wrap(~ Byear) + theme_solarized() + theme(strip.text = element_text( color="red", size=16, lineheight=5.0), plot.title = element_text(colour = "blue", size = 18, hjust = 0.5, vjust = 0.8, angle = 0)) + ggtitle("Popular First Names of Ballplayers")