Monthly Archives: January, 2014

Talkin’ baseball (on induction day)

A couple of weeks ago the BBWWA elected three players to the Baseball Hall of Fame. At the end of July, Greg Maddux, Tom Glavine and Frank Thomas, together with managers Joe Torre, Tony LaRussa and Bobby Cox (previously elected by the Veterans Committee), will be honored in Cooperstown. On that day, like most of the people who preceded them, will give an induction speech.

On the Baseball Hall of Fame website you can find transcript of many of the inductees’ speeches. Here, for example, you can find Babe Ruth’s speech.
If you have watched recent induction speeches on TV, you have certainly noticed that The Babe did not deliver a very long talk.
Just browse through the Hall’s website and you’ll see that monologues in Cooperstown used to be shorter than they are now.

I took the time to grab 135 induction speeches from the website, storing them in text files that I have made available on our GitHub.
And then I had some fun running some R code on them.

Let’s start by loading Babe Ruth’s acceptance speech.

speech = readLines("your/path/to/HOF/speeches/1939GeorgeHermanRuth.txt")

One of R packages for text mining is named tm. We load that package, we put the speech text into a corpus and we remove punctuation from it.

myCorpus = Corpus(VectorSource(speech))
myCorpus = tm_map(myCorpus, removePunctuation)

Then we construct a matrix of terms and by summing the counts of every cell we get the length of the speech in term of words.

dtm = DocumentTermMatrix(myCorpus)
[1] 128

Barely over a couple of seasons of homers for the Babe!
The most recent speech by a player I have grabbed is that of Andre Dawson. If you rerun the code above on his discourse, you get 2148 words–a 15-plus-fold increase!

Let’s try to see when hall-of-famers started talking a lot.
We first prepare an empty data frame where we will store the number of words for each speech. Then, with the help of the list.files function we get the file names of the speeches at our disposal. Finally we loop through all the files and store the word counts in the data frame we prepared.

# prepare data.frame with filename, count
speechLength = data.frame(speech=character(0), words=numeric(0))
# read speeches file names
speechFiles = list.files(path="your/path/to/HOF/speeches/")

# loop through speech and count words
for(speechFile in speechFiles){
  speech = readLines(paste("your/path/to/HOF/speeches/", speechFile, sep=""))  
  myCorpus = Corpus(VectorSource(speech))
  myCorpus = tm_map(myCorpus, removePunctuation)
  dtm = DocumentTermMatrix(myCorpus)
  speechLine = data.frame(speech = speechFile, words=sum(dtm))
  speechLength = rbind(speechLength, speechLine)
# add speech year
speechLength$year = as.numeric(substr(speechLength$speech, 1, 4))

Now it’s time to use R’s graphic engine, with the help of ggplot2 package (see some ggplot2 tricks here).

ggplot(data=speechLength, aes(x=year, y=words)) +
  geom_point() +
Word counts of Hall of Fame induction speeches. (click for full size)

Word counts of Hall of Fame induction speeches. (click for full size)

Before 1980 only a handful of baseball immortals exceeded 1000 words. The first guy to surpass the mark, back in 1966, eclipsed Hank Greenberg’s loquacity (565 words ten years before) by a country mile. In fact he actually eclipsed the 2000 words mark as well!
Who’s the guy?
You can look it up! (using the subset function… but I have just given you quite a hint!)

If you would like to visualize what these guys have actually talked about, R provides you with a package and a function to easily draw wordclouds, that are a very popular way to display the content of monologues (you have certainly seen lot of them during electoral campaigns).

In the code below, before generating the wordcloud, we used a few function from the tm package for removing punctuation, numbers and stopwords.

speech = readLines("your/path/to/HOF/speeches/1966CharlesDillonStengel.txt")
myCorpus = Corpus(VectorSource(speech))

# set text to lower caps, remove punctuantion, numbers and stopwords
myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

# package for wordclouds

# generate wordcloud
wordcloud(myCorpus, min.freq = 3, rot.per=0, scale=c(3,.3))

Just change the file name in the first line of the above code to get wordclouds for the hall-of-famers of your choice.

Casey Stengel (the first Mr. 2000).

Wordcloud for Casey Stengel's induction speech.

Wordcloud for Casey Stengel’s induction speech.

Tony Gwynn (the record holder, at 3527 words, at least for the considered speeches).

Wordcloud for Tony Gwynn's induction speech.

Wordcloud for Tony Gwynn’s induction speech.

Satchel Paige (with his 3rd person habits).

Wordcloud for Satchel Paige's induction speech.

Wordcloud for Satchel Paige’s induction speech.