Scraping Baseball-Reference Pages, Mike Piazza and Junior

Baseball-Reference is a great source of baseball data. For each MLB player in history, it contains much of the Retrosheet data, and it also lists many of the modern batting measures for each season. Moreover, the web pages on this site have a convenient structure that facilitates reading the tables directly into R. In this post, we’ll illustrate the use of the XML package to read in several Baseball-Reference pages and compare career trajectories of historical players using any batting statistic of interest. In honor of Mike Piazza’s and Ken Griffey’s recent induction in the HOF, we’ll compare their offensive hitting trajectories against similar great catchers and outfielders.

Scraping a B-R Table

Look at a Baseball-Reference page for a non-pitcher, say Barry Bonds.

Note the url —

This url is the character string followed by the first letter of the player’s last name followed by a slash and the usual Lahman player id followed by .shtml This simple structure makes it easy to download this page by the readHTMLTable function.

On this Baseball-Reference page there are a number of html tables. “Standard Batting” gives the season to season standard batting statistics, “Player Value” gives a number of modern batting measures, “Postseason Batting” gives batting statistics for each playoff season that Bonds played on, etc. We will focus on the use of the first two tables.

To download this page into R, we use the readHTMLtable function from the XML package.

d <- readHTMLTable("")

Note that the variable d is a list with the following elements:

 [1] "NULL"               "batting_standard"  
 [3] "batting_value"      "batting_postseason"
 [5] "standard_fielding"  "NULL"              
 [7] "NULL"               "NULL"              
 [9] "NULL"               "salaries" 

We can access the Standard Batting table:

standard <- d$batting_standard
head(standard, 1)
  Year Age      Tm Lg  G  PA  AB  R  H 2B 3B HR RBI SB CS BB SO
1 1985  20 PIT-min  A 71 296 254 49 76 16  4 13  37 15  3 37 52
    BA  OBP  SLG  OPS OPS+  TB GDP HBP SH SF IBB Pos     Awards
1 .299 .383 .547 .930      139       0  1  4   0     PRW · CARL

tail(standard, 1)
     Year Age  Tm Lg   G  PA  AB  R  H 2B 3B HR RBI SB CS  BB SO
24 2007 ★  42 SFG NL 126 477 340 75 94 14  0 28  66  5  0 132 54
     BA  OBP  SLG   OPS OPS+  TB GDP HBP SH SF IBB  Pos Awards
24 .276 .480 .565 1.045  169 192  13   3  0  2  43 *7/D     AS

Note that, unlike what you see on the Baseball-Reference page, the first row contains stats forthe first minor league season on Bonds and the last row is his final MLB season (the “Total” rows are removed).

Plotting Bonds Home Run Counts

Okay, suppose we wish to plot Bonds’ home run numbers for only his MLB seasons against age. (Have to be careful about the variable type — Age and HR are both factor variables that need to be converted to numeric.)

standard <- mutate(standard,
ggplot(filter(standard, Lg=="NL"), aes(Age, HR)) +


Comparing Trajectories

The ease of directly reading Baseball-Reference tables motivated me to write a function comparing_baseball_trajectories.R that will compare trajectories of a group of players using a table and associated batting statistic. The inputs to the function are …

  • Names: a vector of player names which each name is either a full name “Barry Bonds” or the associated Lahman playerID “bondsba01” (Sometimes we have to use playerID for players like Pete Rose and Ken Griffey Jr. who don’t have unique MLB names.)
  • table: what Baseball-Reference table is accessed (default is “batting_value”, but one can also use “batting_value”)
  • stat: what batting statistic is used (default is “oWAR”)
  • NCOL: how many columns to use in the ggplot2 facet display (default is 1)
  • playerIDs: a vector indicating if we are using playerIDs or not in the Names vector (default is FALSE)

– I was careful to remove minor league stats.
– When a player played for multiple teams over a season, I summed the measures over the teams to get a summary measure (this works for the batting_value table). This summation won’t make sense for some measures like a batting average.
– For the batting_standard table, used the “TOT” value for team to find the seasons with multiple teams.

Mike Piazza and Three Similar Catchers

I have always been interested in Mike Piazza since he was born in Norristown, PA which is close to where I grew up in suburban Philadelphia. Using Bill James’ similarity scores as reported in Baseball-Reference, Piazza is similar to Johnny Bench, Yogi Berra, and Carlton Fisk. I initially compare the trajectories of the four players with respect to OPS+ (this is one of the batting statistics on the “Standard Batting” table). This is a measure of total offensive contribution — one nice feature of OPS+ is that it adjusts to the league average — a OPS+ = 100 corresponds to an average performance.

d <- compare_batting_trajectories(c("Mike Piazza", "Johnny Bench", 
                                    "Yogi Berra", "Carlton Fisk"),
                                  stat="OPS+", NCOL=2)


Let’s next compare the four players with respect to WAR (this is reported in the Player Value table). In contrast to OPS+, WAR measures overall performance of a player (including offense and defensive) over the value of a replacement player.

d <- compare_batting_trajectories(c("Mike Piazza", "Johnny Bench", 
                                    "Yogi Berra", "Carlton Fisk"),
                                  stat="WAR", NCOL=2)


The OPS+ graph confirms that Piazza was probably the best hitting catcher, especially early in his career. But the WAR graph which includes both offense and defense contributes seems to indicate that Johnny Bench contributed more wins to his team.

Junior Griffey

Here we compare the WAR contributions of Ken Griffey Jr with three similar outfielders — we have to use Griffey’s playerID since we want to distinguish his stats from those of his father.

d <- compare_batting_trajectories(c("griffke02","Frank Robinson",
                                   "Rafael Palmeiro","Gary Sheffield"),
                                    stat="WAR", NCOL=2,
                      playerIDs=c(TRUE, FALSE, FALSE, FALSE))


Although these four players have similar career stats, they have different career trajectories.

  • Junior Griffey had a remarkable career from 20 to 30, but did not age well.
  • Frank Robinson maintained a large WAR value until about age 34 and then declined, but at a more gradual rate than Griffey.
  • Rafael Palmeiro had a pretty symmetric WAR pattern — increasing from 20 to 25, pretty constant from 25 to 35, and a decline in his final seasons.
  • Gary Sheffield, in contrast, had an increasing pattern of season WAR values until age 35 and then declined.

I’d encourage the reader to try this function with different groups of players and different measures — it would be nice to develop a Shiny app for this.


2 responses

  1. A concerned citizen | Reply

    So… scraping BRef sites is a direct violation of their terms of service, and is something Sean’s known to not be okay with. Just saying. Maybe you talked to him about this, I don’t know, but if you didn’t them this seems ill-advised.

    1. Thanks for your concern. I did communicate with Sean. He said that it was okay to keep the post, but they don’t support mass downloading. Also these functions may not work with future redesigns of the pages.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: