Monthly Archives: November, 2017

Text Mining “Pitching in a Pinch”

440px-Mathewson_in_NY_uniform.jpg

Christy Mathewson

Introduction

One of the great pitchers in baseball history was Christy Mathewson. Since I went to Bucknell University, I became familiar with Mathewson as he was one of the greatest athletes to attend Bucknell (interesting, Baseball Reference lists 24 Bucknell players who have played in the Major League.) The football stadium at Bucknell is named the “Christy Mathewson–Memorial Stadium”.

Mathewson wrote about his baseball pitching experiences in the book Pitching in a Pinch, copyrighted in 1912. This book is freely available as part of Project Gutenberg. It is an interesting read — much of the book is devoted to the pitcher/batter matchups that we focus on today. Anyway, Mathewson’s book is a good springboard for introducing the tidy approach to text mining described in the new book by Julia Silge and David Robinson.

Load in the R packages

I will illustrate some useful functions in the tidytext package. First I read in the packages for this example including tidytext and gutenbergr, where one can import text for over 5000 books in Project Gutenberg.

library(tidytext)
library(tidyverse)
library(gutenbergr)
library(wordcloud)
library(Lahman)

Download the book

I load Mathewson’s book using the gutenberg_download function with the id number of 33291.

cm <- gutenberg_download(33291)

The object cm is a vector of long strings of text. I first create a data frame with two variables, line and text.

text_df <- data.frame(line = 1:length(cm),                                       text = as.character(cm),
stringsAsFactors = FALSE)

Creating a tidy text data frame

The new function unnest_tokens converts the data frame to a token data frame, where each word of the book is on a single line — we call this tidy_text.

 text_df %>%
  unnest_tokens(word, text, to_lower = TRUE) ->
  tidy_text

I am not interested in commonly used words called stop words. A data frame of stop words is stored in the tidytext package and the following will remove the stop words from my data frame.

data(stop_words)
tidy_text %>%
  anti_join(stop_words) -> tidy_text

Frequency tables of words

We are typically interested in creating a frequency table of words used in the book. The count function will create this table, and I use ggplot2 to create a bar graphs of the words that appear more than 50 times in the book. I first remove the book id number “33291” before I construct the graph.

tidy_text %>%
  count(word, sort = TRUE)

# A tibble: 4,751 x 2
 word n
 <chr> <int>
1 33291 6329
2 ball 391
3 game 382
4 mcgraw 239
5 base 232
6 time 184
7 club 179
8 hit 176
9 pitcher 173
10 players 161
# ... with 4,741 more rows

Here’s a graph of the frequencies:

tidy_text %>%
 count(word, sort = TRUE) %>%
 filter(n > 50) %>%
 filter(! word %in% c("33291")) %>%
 mutate(word = reorder(word, n)) %>%
 ggplot(aes(word, n)) +
 geom_col() +
 xlab(NULL) +
 coord_flip()

christy1.png

Of course, we always like creating word clouds of the popular words used.

tidy_text %>%
 count(word) %>%
 filter(! word %in% c("33291")) %>%
 mutate(word = reorder(word, n)) %>%
 with(wordcloud(word, n, max.words = 40))

christy2.png

Exploring word lengths

It might be interesting to explore the long words that Mathewson used. I first create a new variable called Length using the str_length function.

tidy_text %>% mutate(Length = str_length(word)) ->
 tidy_text

I sort the frequency table of words by word length and display the top 20 popular long words.

tidy_text %>% count(word) %>%
 mutate(Length = str_length(word)) %>%
 arrange(desc(Length)) %>% slice(1:20)

# A tibble: 20 x 3
 word n Length
 <chr> <int> <int>
1 responsibilities 1 16
2 inconsistencies 1 15
3 intellectuality 1 15
4 notwithstanding 2 15
5 conclusiveness 2 14
6 conversational 6 14
7 correspondence 1 14
8 correspondents 3 14
9 discouragement 1 14
10 eccentricities 1 14
11 misinformation 1 14
12 misinterpreted 1 14
13 responsibility 2 14
14 satisfactorily 1 14
15 simultaneously 2 14
16 transportation 3 14
17 undergraduates 1 14
18 administering 1 13
19 agriculturist 1 13
20 assassination 1 13

Contemporary players?

What baseball players are mentioned in Mathewson’s book? I extract the last names of all ballplayers from the Master data frame in the Lahman database and then merge this list with the word list from Mathewson’s book.  This method isn’t perfect — for example, “York” is not a player’s names but instead part of the city where Mathewson played — but this is helpful for seeing some of the famous ballplayers who played during the same era.

select(Master, nameLast) %>%
count(nameLast) %>% select(nameLast) -> MS
text_df %>%
unnest_tokens(word, text, to_lower = FALSE) %>%
count(word) -> S
inner_join(S, MS, by=c("word" = "nameLast")) %>%
arrange(desc(n)) -> S2

myplot.png

Takeaways

A few things I learned in this particular post.

  1. Since the R code sometimes get garbled in a blog post, all of the R code for recreating this work is available on my githubgist site.
  2.  The tidytext package is a nice entry into the world of text processing.  The accompanying book by Silge and Robinson seems helpful and I plan on using this text to introducing text mining to a group on campus.
  3. There is a treasure of books available in Project Gutenberg that can be used in text mining.  I’m interested next to trying to implement sentiment analysis following guidance from Silge and Robinson’s text.
  4. I plan on reading more from Mathewson’s book — it is an interesting look on how baseball was played over 100 years ago.
Advertisements