First Names, Word Clouds, and ggplot2


In my “computing with data” course, one nice topic is text mining, and one activity creates a word cloud of the popular words from a book or speech or group of tweets.  An easy way of creating a word cloud uses the wordcloud package — one can create a word cloud if one has a vector of popular words and a vector of corresponding frequencies.

In a conversation with one of our current students, I wondered aloud if some type of word cloud display could be created using ggplot2 graphics.  This seemed appealing since ggplot2 could be used to create panels or use color as an extra attribute.  Anyway, a little google search led to me to this post which describes how to create a word cloud by clever use of the ggrepel package. I apply these ideas to learn about popular first names among ballplayers.

Popular names

We know that popularity of first names have gone through dramatic changes in history (one can explore the names in the babynames package to see these changes).  So I would also think that names of ballplayers have also gone through changes, and I could use word clouds to visualize these changes.  I divided the history of baseball into four eras –birthyears 1820-1895, 1896-1937, 1938-1969, 1970-1996 (these eras contain approximately an equal number of players) and created the following display using ggplot2.    The size of the text corresponds to the frequency of players with that first name and I have highlighted (using a different color) the names corresponds to frequencies of 100 or higher.  Since I am a Jim, I notice that Jim was relatively popular name among ballplayers in the first three eras, but has disappeared from the 1970-1996 era (Matt, Chris, and Mike are popular instead).  I don’t see an obvious Latin American influence with the exception of Carlos and Luis in the most recent era.  (It would be interesting to do a word study of the first names of all of the current playoff teams.)


The code

Here is the code I used.  As you use dplyr operations more often, it seems convenient and perhaps more intuitive to use piping.


Master %>%
mutate(Byear = factor(cut(birthYear,
c(1820, 1895, 1937, 1969, 1996)),
c("1820-1895", "1896-1937",
"1938-1969", "1970-1996")))) %>%
group_by(Byear, nameFirst) %>%
summarize(Count = n()) %>%
filter( == FALSE) %>%
group_by(Byear) %>%
arrange(desc(Count)) %>%
slice(1:30) %>%
ggplot +
aes(x = 1, y = 1, size = Count, label = nameFirst,
color= Count >= 100) +
geom_text_repel(segment.size = 0, force = 100) +
scale_size(range = c(2, 10), guide = FALSE) +
scale_y_continuous(breaks = NULL) +
scale_x_continuous(breaks = NULL) +
labs(x = '', y = '') +
facet_wrap(~ Byear) +
theme_solarized() +
theme(strip.text = element_text(
color="red", size=16, lineheight=5.0),
plot.title = element_text(colour = "blue",
size = 18,
hjust = 0.5, vjust = 0.8, angle = 0)) +
ggtitle("Popular First Names of Ballplayers")




3 responses

  1. Oh, this is neat! However, when I run your code, the plots are cluttered with lines connecting the center of each facet with the label. (Facets do not affect it.) Any idea how why this happens and how to get rid of it? Many thanks for any hint!

    This is where I run your script:

    R Version:
    platform x86_64-w64-mingw32
    arch x86_64
    os mingw32
    system x86_64, mingw32
    version.string R version 3.4.3 (2017-11-30)
    nickname Kite-Eating Tree

    1. I just tried out the code and it seemed to work fine. I don’t know what is happening at your end. Sorry not to be of more help.

  2. Set segment.color to NA to get rid of these unwanted lines, should they appear for you.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: