Author Archive: bbaumer21

Scraping the Web for Analytics Directors

I am writing this traveling back from the SABR Analytics Conference, where I was lucky enough to see many friends and colleagues in the baseball analytics industry. [BTW, I also took home some hardware.]

These conferences always get me thinking about the state of the analytics industry in baseball. And in particular, about where teams are in terms of their embrace of analytics, and who is doing the work for them. This is a subject I’ve written about in the past, and it never fails to be an interesting mental exercise.

When Andrew Zimbalist and I researched this for The Sabermetric Revolution in 2013, and later when I cricled back in 2015 for the ESPN article, I thought about trying to automate the gathering of this information. But each time, it seemed like more effort than it was worth – there are, after all, only 30 teams – and each would have to be vetted individually anyway.

However, new tools (rvest) have made this a little easier, and while I won’t claim that the results you’ll see today are perfect, I think this is still a nice exercise.

So, in this post I’ll show you how you can scrape data from the web to build a data table of baseball analytics directors.

Setup

What makes this possible is that all of the team websites are run by MLB.com, and they all have a fairly uniform layout. In particular, they all have a page called “Front Office” that has a relatively stable format (more on that later). This fact is obscured somewhat by the domain names. So for example, if you Google: “dodgers front office”, you’ll come to a page with this URL:

(http://losangeles.dodgers.mlb.com/team/front_office.jsp?c_id=la)

If you want the Red Sox front office instead, it seems like you’d have to change both the c_id parameter to bos and the domain name to boston.redsox.mlb.com. But if you try the URL:

(http://www.mlb.com/team/front_office.jsp?c_id=la)

it works. Thus, all we have to change is the c_id parameter, and that is easy. We will need a list of the team IDs. I tried to do this using the Lahman data.

library(Lahman)
library(dplyr)
teamIds <- Teams %>%
  filter(yearID == 2014) %>%
  select(teamID)
teamIds <- as.character(teamIds$teamID)

Unfortunately, only about half of these IDs match the ones that MLB.com uses. You could try some more clever way of getting these, but I just ended up hard-coding the list. (Again, there are only 30 of them!)

teamIds <- c("ARI", "ATL", "BAL", "BOS", "CWS", "CHC",
             "CIN", "CLE", "COL", "DET", "HOU", "KC",
             "ANA", "LA", "MIA", "MIL", "MIN", "NYY",
             "NYM", "OAK", "PHI", "PIT", "SD", "SEA",
             "SF", "STL", "TB", "TEX", "TOR", "WAS")

Scraping

Now that we have the URLs, we can use the rvest package to bring the data embedded on these pages into R. The critical insight here is that an HTML document is highly-structured, and rvest exploits this structure to provide various ways for extracting only the information that we want.

If the data on these front offices pages were in a <table> tag, this would be a short post, because the html_table function in rvest would do nearly all of the work for us. Unfortunately, the data are in <dl> tags, so we have to work a little harder. We use the html_nodes function to grab only those <dl> elements.

library(rvest)
url <- paste0("http://www.mlb.com/team/front_office.jsp?c_id=la")
dl_list <- read_html(url) %>%
  html_nodes("dl")
dl_list
## {xml_nodeset (23)}
##  [1] <dl>\n          <dt>Chairman</dt><dd><a href="javascript:void(windo ...
##  [2] <dl>\n\t\t<dt>Executive Assistant to President and CEO</dt><dd>Cher ...
##  [3] <dl>\n        \n              \n\t\t  <dt>Vice President, Amateur & ...
##  [4] <dl>              \n          <dt>Manager, Baseball Operations-Glen ...
##  [5] <dl>\n                    \n\t\t  <dt>Assistant, Baseball Operation ...
##  [6] <dl>       \n          <dt>Vice President, Marketing and Broadcasti ...
##  [7] <dl>\n          \n          <dt>Director, Public Relations</dt><dd> ...
##  [8] <dl>\t\t \n\t\t  <dt>Vice President, External Affairs and Community ...
##  [9] <dl>\n          \n          <dt>Vice President, Finance</dt><dd>Eri ...
## [10] <dl>          \n          <dt>Senior Director, Human Resources</dt> ...
## [11] <dl>\n          <dt>Vice President, Information Technology</dt><dd> ...
## [12] <dl>         \n          <dt>Senior Counsel</dt><dd>Chad Gunderson< ...
## [13] <dl>\n                     \n          <dt>Head Athletic Trainer</d ...
## [14] <dl>\n          <dt>Vice President, Merchandise</dt><dd>Allister An ...
## [15] <dl>\n           <dt>Manager, Planning &amp; Development</dt><dd>An ...
## [16] <dl>         \n          <dt>Vice President, Corporate Partnerships ...
## [17] <dl>\n          <dt>Vice President, Premium Sales &amp; Services</d ...
## [18] <dl>\n          <dt>Vice President, Ticket Sales</dt><dd>David Sieg ...
## [19] <dl>\n          <dt>Official Scorers</dt><dd>Jerry White, Ed Munson ...
## [20] <dl>\n          <dt>Vice President, Security and Guest Services</dt ...
## ...

Each <dl> tag contains a description list. In most cases, each “row” consists of a job title within <dt> tags and a person’s name in <dd> tags. We can extract both using html_nodes again.

titles <- html_nodes(dl_list[[2]], "dt") %>%
  html_text()
people <- html_nodes(dl_list[[2]], "dd") %>%
  html_text()

The two vectors titles and people should have the same length, so we can put them together in a data frame.

data.frame(title = titles, person = people)
##                                                                   title
## 1                              Executive Assistant to President and CEO
## 2                               Administrative Assistant, EVP/CMO & CFO
## 3 Administrative Assistant, Sr. Vice President Planning and Development
## 4                                            Assistant to Tommy Lasorda
##           person
## 1   Cheryl Rampy
## 2 Desiree Juarez
## 3 Cristin Haught
## 4  Sean Mobasser

OK, great. So all we need to do is code this a function, iterate over all the teams, and combine the results. Unfortunately – as is so often the case with web scraping – the consistency of the formatting varies over different teams:

  • Sometimes, there are empty rows. We want to ignore these.
  • Seattle’s page includes a list of the board of directors at the top that is in a completely format. We don’t care about this, so we’ll just ignore it.
  • De Jon Watson’s listing contains two <dt>‘s instead of one <dt> and one <dd>. This is just an HTML coding error that we need to overcome.
  • The Yankees page contains footnotes. These should be ignored.
  • Finally, most of the time the titles are in the <dt>‘s and the names are in the <dd>‘s, but not always. Furthermore, they sometimes are switched even within one team’s listing. Our solution is to look for words that would only appear in a job title and never in a name (e.g. “Director”, “Manager”, etc.) and switch them if necessary.

Our function looks like this:

dl_to_table <- function(dl) {
  titles <- html_nodes(dl, "dt") %>%
    html_text()
  people <- html_nodes(dl, "dd") %>%
    html_text()
  # skip empty rows
  if (length(titles) < 1) {
    return(NULL)
  }
  # strip board of directors stuff from SEA
  if (length(titles) == 1 & sum(grepl(";", titles)) > 0) {
    return(NULL)
  }
  # De Jon Watson coding bug
  if (titles[1] == "De Jon Watson") {
    people[1] <- titles[2]
    titles <- titles[1]
  }
  # remove NYY footnote
  people <- people[!grepl("\\* of Martinique", people)]
  # switch people and titles if necessary
  if (sum(grepl("(Director|President|Executive|Chief|Managing|General)", people)) > 0) {
    temp <- titles
    titles <- people
    people <- temp
  }
  return(data.frame(title = titles, person = people))
}

This will give us a data.frame of front office employees and their job titles, but only for one department.

dl_to_table(dl_list[[2]])
##                                                                   title
## 1                              Executive Assistant to President and CEO
## 2                               Administrative Assistant, EVP/CMO & CFO
## 3 Administrative Assistant, Sr. Vice President Planning and Development
## 4                                            Assistant to Tommy Lasorda
##           person
## 1   Cheryl Rampy
## 2 Desiree Juarez
## 3 Cristin Haught
## 4  Sean Mobasser

For each team, we’ll need to iterate over all of these departments, each one corresponding to a <dl> tag. lapply will handle the iteration, and rbind_all will put the resulting list of data.frames into a single data.frame. We also need to know to which team each list corresponds.

get_jobs <- function(code) {
  message(paste("Retrieving data for", code))
  url <- paste0("http://www.mlb.com/team/front_office.jsp?c_id=", code)
  dl_list <- read_html(url) %>%
    html_nodes("dl")
  out <- lapply(dl_list, dl_to_table) %>%
    rbind_all() %>%
    mutate(team = code)
  return(out)
}

Now we can grab the entire list for a single team with a single command.

get_jobs("LA")
## Source: local data frame [226 x 3]
## 
##                                                                          title
##                                                                          (chr)
## 1                                                                     Chairman
## 2                                                                      Partner
## 3                                                                      Partner
## 4                                                                      Partner
## 5                                                                      Partner
## 6                                          President and CEO of the LA Dodgers
## 7                                             President of Baseball Operations
## 8                                                     Executive Vice President
## 9                         Executive Vice President and Chief Marketing Officer
## 10 Chief Financial Officer and Managing Director of Guggenheim Baseball Manage
## ..                                                                         ...
## Variables not shown: person (chr), team (chr)

It remains only to loop over all of the teams. Once again the lapply + rbind_all paradigm is an elegant way to do this in R.

full <- lapply(teamIds, get_jobs) %>%
  rbind_all()

Analyzing the results

One thing that is immediately obvious – and problematic for our interests – is that not all teams put their full list of employees on the web. More damaging, while most of the teams at least put Baseball Operations on the web, the Mets and Yankees don’t.

full %>%
  group_by(team) %>%
  summarize(N = n()) %>%
  print.data.frame()
##    team   N
## 1   ANA 142
## 2   ARI 345
## 3   ATL 251
## 4   BAL 198
## 5   BOS 216
## 6   CHC 247
## 7   CIN 226
## 8   CLE 237
## 9   COL 157
## 10  CWS 191
## 11  DET 184
## 12  HOU 263
## 13   KC 189
## 14   LA 226
## 15  MIA 272
## 16  MIL 270
## 17  MIN 191
## 18  NYM  37
## 19  NYY  47
## 20  OAK 123
## 21  PHI 243
## 22  PIT 277
## 23   SD 253
## 24  SEA 203
## 25   SF 255
## 26  STL 176
## 27   TB 192
## 28  TEX 235
## 29  TOR 623
## 30  WAS 364

There isn’t much we can do about this. The next step is to search for analytics employees. Here again, there aren’t formal standards as to who is and who is not “in analytics”, but we can use some simple regular expressions to search for job titles that sounds like analytics. At the same time, we don’t want to include business analytics or IT folks, so we have to exclude them.

analytics <- full %>%
  filter(grepl("(Research|Analytics|Analyst|Quantitative|Informatics|Baseball Information Services|Baseball Systems)", title)) %>%
  filter(!grepl("(Business|Marketing|Ticket|CRM|Radio|Finance|Financial|IT|Accounting|Color|HR|Telecommunication|Desktop|TV|Sales|Network|Brand|Television|Procurement|Counsel|Information Technology|Client|Payroll|Systems Analyst|Technical Analyst|System Analyst|Programmer Analyst)", title))

A first question is how many analytical employees each team has.

analytics %>%
  group_by(team) %>%
  summarize(N = n()) %>%
  arrange(desc(N))
## Source: local data frame [27 x 2]
## 
##     team     N
##    (chr) (int)
## 1     TB    13
## 2     LA    11
## 3    HOU     7
## 4    MIL     7
## 5    CHC     6
## 6    PIT     6
## 7    WAS     5
## 8    ANA     4
## 9     KC     4
## 10   MIN     4
## ..   ...   ...

While there are many obvious limitations to these counts, the Dodgers recent hiring binge is evident, as are the substantial staffs of the Rays, Astros, and Pirates.

My initial interest with this was to identify the directors of analytics in each team. We can use grep to narrow down to these people. We have used a right_join here because we also want to note teams for which we couldn’t find an analytics director.

analytics %>%
  filter(grepl("(Director|Senior)", title) & !grepl("Assistant", title)) %>%
  right_join(data.frame(teamIds), by = c("team" = "teamIds")) %>%
  print.data.frame()
##                                                      title
## 1              Director of Baseball Analytics and Research
## 2                                                     &lt;NA&gt;
## 3                             Director, Baseball Analytics
## 4                                \nSenior Baseball Analyst
## 5                                                     &lt;NA&gt;
## 6                         Director, Research &amp; Development
## 7                                                     &lt;NA&gt;
## 8       Senior Director, Baseball Research and Development
## 9                       Senior Developer, Baseball Systems
## 10                                                    &lt;NA&gt;
## 11                                                    &lt;NA&gt;
## 12                    Director of Research and Development
## 13                  Director Baseball Operations/Analytics
## 14            Director Baseball Analytics/Player Personnel
## 15            Director Baseball Analytics/Research Science
## 16                         Director, Quantitative Analysis
## 17                        Director, Research &amp; Development
## 18                      Director,  Analytics &amp; Development
## 19                              Senior Director, Analytics
## 20                       Director-Research and Development
## 21                                                    &lt;NA&gt;
## 22                                                    &lt;NA&gt;
## 23                                                    &lt;NA&gt;
## 24                              Director, Baseball Systems
## 25             Director, Baseball Research and Development
## 26                          Director, Baseball Informatics
## 27                 Director, Baseball Information Services
## 28                                                    &lt;NA&gt;
## 29 Director, Minor League Operations/Quantitative Analysis
## 30                                                    &lt;NA&gt;
## 31                            Director, Scouting Analytics
## 32             Director, Pitching Research and Development
## 33                              Director, Baseball Systems
## 34                          Director of Baseball Analytics
## 35                                     Director, Analytics
## 36               Director, Baseball Research &amp; Development
##                 person team
## 1         Dr. Ed Lewis  ARI
## 2                 &lt;NA&gt;  ATL
## 3         Sarah Gelles  BAL
## 4          Tom Tippett  BOS
## 5                 &lt;NA&gt;  CWS
## 6          Chris Moore  CHC
## 7                 &lt;NA&gt;  CIN
## 8       Sky Andrecheck  CLE
## 9          Don Crislip  CLE
## 10                &lt;NA&gt;  COL
## 11                &lt;NA&gt;  DET
## 12          Mike Fast   HOU
## 13       Mike Groopman   KC
## 14       John Williams   KC
## 15         Daniel Mack   KC
## 16      Jonathan Luman  ANA
## 17        Doug Fearing   LA
## 18         Royce Cohen   LA
## 19          Jason Pare  MIA
## 20   Daniel Turkenkopf  MIL
## 21                &lt;NA&gt;  MIN
## 22                &lt;NA&gt;  NYY
## 23                &lt;NA&gt;  NYM
## 24       Rob Naberhaus  OAK
## 25          Andy Galdi  PHI
## 26             Dan Fox  PIT
## 27       Matt Klotsche   SD
## 28                &lt;NA&gt;  SEA
## 29   Yeshayah Goldfarb   SF
## 30                &lt;NA&gt;  STL
## 31       Shawn Hoffman   TB
## 32         Joshua Kalk   TB
## 33       Brian Plexico   TB
## 34      Todd Slavinsky  TEX
## 35         Joe Sheehan  TOR
## 36 Samuel Mondry-Cohen  WAS

I don’t see any false positives here – all of the people we identified belong on this list. Where are our misses?

  • ATL: Matt Grabowski is “Assistant Director, Scouting and Analytics”, but we have specifically excluded “Assistant”s
  • CWS: It’s not clear that anyone really fits the description. Dan Fabian (Senior Director of Baseball Operations) is in a more general role, and Dan Strittmatter is still junior (Coordinator of Baseball Information)
  • CIN: Sam Grossman is now Assistant GM (Congratulations, Sam)! and Michael Schatz has come over from Oakland, but doesn’t appear on the website yet.
  • COL: Trevor Patch is “Manager – Baseball Analytics”
  • DET: Sam Menzin is now Director of Baseball Operations, but it’s not clear who should be here in his stead
  • MIN: Jack Goin is at the Manager level
  • NYY: Michael Fishman is now Assistant GM, and they don’t put any data on the website, so it’s hard to know who is now the analytics director
  • NYM: TJ Barra is “Manager, Baseball Research and Development”, but they don’t put any data on the website
  • SEA: Jesse Smith is “Manager – Baseball Analytics”
  • STL: The Cardinals’ director of analytics is, um, no longer working for the team

So there you have it. Now that the Phillies and Marlins have made hires this offseason, pretty much every team in baseball has someone at the Manager or Director level dedicated to overseeing baseball analytics.

Epilogue

You could, of course, use these to search for other roles within baseball, but I’ll leave that to you.

full %>%
  filter(grepl("Assistant General Manager", title))
Advertisements