Bryce Harper was in the news over the weekend but for the wrong reasons. Both Harper and Joey Votto are having remarkable on-base percentage seasons and so I thought it would be interesting to explore their pattern of getting on-base through the 2015 season. I’ll show, step by step, how one can produce an interesting graphic comparison.
First, I find the game log data from Baseball-Reference for both players. The
XML package makes it relatively easy to pick up the appropriate tables directly from the web pages.
library(XML) d <- readHTMLTable("http://www.baseball-reference.com/players/gl.cgi?id=harpebr03&t=b&year=2015") harper <- d$batting_gamelogs d <- readHTMLTable("http://www.baseball-reference.com/players/gl.cgi?id=vottojo01&t=b&year=") votto <- d$batting_gamelogs
One annoying thing is that all of the variables are read in the data frames as factors. I quickly convert all of the columns for both data frames to character-type (using the
for (j in 1:38) harper[, j] <- as.character(harper[, j]) for (j in 1:38) votto[, j] <- as.character(votto[, j])
dplyr functions, I
select the on-base variables and
filter the rows that are not numeric. I create a new data frame combining the Harper and Votto data frames, and finally I convert all of the character data to numeric so I can make computations with these data.
library(dplyr) select(harper, Rk, AB, H, BB, HBP, SF) %>% filter(H != "H") -> Harper select(votto, Rk, AB, H, BB, HBP, SF) %>% filter(H != "H") -> Votto BothPlayers <- rbind(data.frame(Player="Harper", Harper), data.frame(Player="Votto", Votto)) for(j in 2:7) BothPlayers[, j] <- as.numeric(BothPlayers[, j])
I use the
ggplot2 package to construct my graph of the game-to-game OBP values for both players. I add smoothing curves to look at the pattern of each player across the 2015 season. I played with the span argument in
geom_smooth to get a reasonable smooth.
ggplot(BothPlayers, aes(Rk, (H + BB + HBP) / (AB + BB + HBP + SF), color=Player)) + geom_point() + facet_wrap(~ Player, ncol=1) + geom_smooth(span=.4) + ylab("OBP") + xlab("Game Number") + ggtitle("2015 Game OBPs for Bryce Harper and Joey Votto")
We see from the graph that Harper has been extremely consistent in getting on base — his “local OBP” has been close to .500 for the entire season. In contrast, Votto had a relative mediocre OBP first half (I see a slump about games 20-40), but he has been “on fire” in the 2nd half in the 2015 season, with an average OBP larger than .500 (see the cluster of 1.000 OBP values about game 100). I believe that Harper is a shoe-in for the NL MVP this season, party due his on-base ability and partly due to his slugging performance this season.