First Names, Word Clouds, and ggplot2


In my “computing with data” course, one nice topic is text mining, and one activity creates a word cloud of the popular words from a book or speech or group of tweets.  An easy way of creating a word cloud uses the wordcloud package — one can create a word cloud if one has a vector of popular words and a vector of corresponding frequencies.

In a conversation with one of our current students, I wondered aloud if some type of word cloud display could be created using ggplot2 graphics.  This seemed appealing since ggplot2 could be used to create panels or use color as an extra attribute.  Anyway, a little google search led to me to this post which describes how to create a word cloud by clever use of the ggrepel package. I apply these ideas to learn about popular first names among ballplayers.

Popular names

We know that popularity of first names have gone through dramatic changes in history (one can explore the names in the babynames package to see these changes).  So I would also think that names of ballplayers have also gone through changes, and I could use word clouds to visualize these changes.  I divided the history of baseball into four eras –birthyears 1820-1895, 1896-1937, 1938-1969, 1970-1996 (these eras contain approximately an equal number of players) and created the following display using ggplot2.    The size of the text corresponds to the frequency of players with that first name and I have highlighted (using a different color) the names corresponds to frequencies of 100 or higher.  Since I am a Jim, I notice that Jim was relatively popular name among ballplayers in the first three eras, but has disappeared from the 1970-1996 era (Matt, Chris, and Mike are popular instead).  I don’t see an obvious Latin American influence with the exception of Carlos and Luis in the most recent era.  (It would be interesting to do a word study of the first names of all of the current playoff teams.)


The code

Here is the code I used.  As you use dplyr operations more often, it seems convenient and perhaps more intuitive to use piping.


Master %>%
mutate(Byear = factor(cut(birthYear,
c(1820, 1895, 1937, 1969, 1996)),
c("1820-1895", "1896-1937",
"1938-1969", "1970-1996")))) %>%
group_by(Byear, nameFirst) %>%
summarize(Count = n()) %>%
filter( == FALSE) %>%
group_by(Byear) %>%
arrange(desc(Count)) %>%
slice(1:30) %>%
ggplot +
aes(x = 1, y = 1, size = Count, label = nameFirst,
color= Count >= 100) +
geom_text_repel(segment.size = 0, force = 100) +
scale_size(range = c(2, 10), guide = FALSE) +
scale_y_continuous(breaks = NULL) +
scale_x_continuous(breaks = NULL) +
labs(x = '', y = '') +
facet_wrap(~ Byear) +
theme_solarized() +
theme(strip.text = element_text(
color="red", size=16, lineheight=5.0),
plot.title = element_text(colour = "blue",
size = 18,
hjust = 0.5, vjust = 0.8, angle = 0)) +
ggtitle("Popular First Names of Ballplayers")