One of the great pitchers in baseball history was Christy Mathewson. Since I went to Bucknell University, I became familiar with Mathewson as he was one of the greatest athletes to attend Bucknell (interesting, Baseball Reference lists 24 Bucknell players who have played in the Major League.) The football stadium at Bucknell is named the “Christy Mathewson–Memorial Stadium”.
Mathewson wrote about his baseball pitching experiences in the book Pitching in a Pinch, copyrighted in 1912. This book is freely available as part of Project Gutenberg. It is an interesting read — much of the book is devoted to the pitcher/batter matchups that we focus on today. Anyway, Mathewson’s book is a good springboard for introducing the tidy approach to text mining described in the new book by Julia Silge and David Robinson.
Load in the R packages
I will illustrate some useful functions in the
tidytext package. First I read in the packages for this example including
gutenbergr, where one can import text for over 5000 books in Project Gutenberg.
library(tidytext) library(tidyverse) library(gutenbergr) library(wordcloud) library(Lahman)
Download the book
I load Mathewson’s book using the
gutenberg_download function with the id number of 33291.
cm <- gutenberg_download(33291)
cm is a vector of long strings of text. I first create a data frame with two variables,
text_df <- data.frame(line = 1:length(cm), text = as.character(cm), stringsAsFactors = FALSE)
Creating a tidy text data frame
The new function
unnest_tokens converts the data frame to a token data frame, where each word of the book is on a single line — we call this
text_df %>% unnest_tokens(word, text, to_lower = TRUE) -> tidy_text
I am not interested in commonly used words called stop words. A data frame of stop words is stored in the
tidytext package and the following will remove the stop words from my data frame.
data(stop_words) tidy_text %>% anti_join(stop_words) -> tidy_text
Frequency tables of words
We are typically interested in creating a frequency table of words used in the book. The
count function will create this table, and I use
ggplot2 to create a bar graphs of the words that appear more than 50 times in the book. I first remove the book id number “33291” before I construct the graph.
tidy_text %>% count(word, sort = TRUE) # A tibble: 4,751 x 2 word n <chr> <int> 1 33291 6329 2 ball 391 3 game 382 4 mcgraw 239 5 base 232 6 time 184 7 club 179 8 hit 176 9 pitcher 173 10 players 161 # ... with 4,741 more rows
Here’s a graph of the frequencies:
tidy_text %>% count(word, sort = TRUE) %>% filter(n > 50) %>% filter(! word %in% c("33291")) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
Of course, we always like creating word clouds of the popular words used.
tidy_text %>% count(word) %>% filter(! word %in% c("33291")) %>% mutate(word = reorder(word, n)) %>% with(wordcloud(word, n, max.words = 40))
Exploring word lengths
It might be interesting to explore the long words that Mathewson used. I first create a new variable called
Length using the
tidy_text %>% mutate(Length = str_length(word)) -> tidy_text
I sort the frequency table of words by word length and display the top 20 popular long words.
tidy_text %>% count(word) %>% mutate(Length = str_length(word)) %>% arrange(desc(Length)) %>% slice(1:20) # A tibble: 20 x 3 word n Length <chr> <int> <int> 1 responsibilities 1 16 2 inconsistencies 1 15 3 intellectuality 1 15 4 notwithstanding 2 15 5 conclusiveness 2 14 6 conversational 6 14 7 correspondence 1 14 8 correspondents 3 14 9 discouragement 1 14 10 eccentricities 1 14 11 misinformation 1 14 12 misinterpreted 1 14 13 responsibility 2 14 14 satisfactorily 1 14 15 simultaneously 2 14 16 transportation 3 14 17 undergraduates 1 14 18 administering 1 13 19 agriculturist 1 13 20 assassination 1 13
What baseball players are mentioned in Mathewson’s book? I extract the last names of all ballplayers from the
Master data frame in the Lahman database and then merge this list with the word list from Mathewson’s book. This method isn’t perfect — for example, “York” is not a player’s names but instead part of the city where Mathewson played — but this is helpful for seeing some of the famous ballplayers who played during the same era.
select(Master, nameLast) %>% count(nameLast) %>% select(nameLast) -> MS text_df %>% unnest_tokens(word, text, to_lower = FALSE) %>% count(word) -> S inner_join(S, MS, by=c("word" = "nameLast")) %>% arrange(desc(n)) -> S2
A few things I learned in this particular post.
- Since the R code sometimes get garbled in a blog post, all of the R code for recreating this work is available on my githubgist site.
tidytextpackage is a nice entry into the world of text processing. The accompanying book by Silge and Robinson seems helpful and I plan on using this text to introducing text mining to a group on campus.
- There is a treasure of books available in Project Gutenberg that can be used in text mining. I’m interested next to trying to implement sentiment analysis following guidance from Silge and Robinson’s text.
- I plan on reading more from Mathewson’s book — it is an interesting look on how baseball was played over 100 years ago.