xFIP, K/9, and BB/9 Across Season Halves: Little Tricks with Base R Graphics

If you’re a regular reader of this blog, you likely have noticed that the ggplot2 package is put to good use. Carson has even built it right into his pitchRx package. You may have also noticed that I haven’t used it here. One of the main reasons is that I just haven’t taken the time to sit down with the package and figure out the syntax for it. It has some great virtues, and I encourage you to learn it as well.

However, there is also something to be said about being able to fiddle with the base graphics in R, and in doing so, I’ve learned more about some of the packages available that we might otherwise not think about using for graphics themselves. So, in this post, I’m going to have a little bit of fun and play around with color scales and for loops in the context of R’s base graphics.

For this post, I’m going to use the scales package, so make sure to have that library loaded and ready to go. We’ll be using data I put together on pitcher splits across the first and second “halves” of the season, as defined by Fangraphs. The data come directly from Fangraphs from 2012 through 2014, and you can grab it right here. We’ll use this data to take a look to see if there are any patterns among starting pitchers that are traded at the deadline, as well as some other exploratory analyses. After all, it might be interesting to know if pitchers see improvement after being traded to a contender, perhaps due to a little extra motivation from a playoff push (Side Note: if that excites you, we’re going to be disappointed with our limited data). Let’s go ahead and load up the data, then take a look at what we have:

###set working directory
setwd("c:/...")

###load data
dat <- read.csv(file="SPHalves12to14.csv", h=T)
head(dat)
  year half              name  g gs    ip   k9  bb9  hr9 babip lob_pct gb_pct hr_fb  era  fip xfip war
1 2013    1        Jake Peavy 11 11  67.0 8.87 2.01 1.34 0.293   0.708  0.367 0.118 4.30 3.73 3.53 1.1
2 2012    1 Francisco Liriano 19 14  83.1 8.86 5.40 0.76 0.296   0.660  0.460 0.101 5.08 4.16 4.29 0.8
3 2012    1  Justin Verlander 18 18 132.2 8.68 2.04 0.75 0.246   0.746  0.403 0.083 2.58 2.97 3.35 3.7
4 2012    1      Zack Greinke 19 19 111.0 9.00 2.11 0.41 0.335   0.729  0.538 0.066 3.32 2.38 2.80 3.5
5 2012    1        Chris Sale 15 15 101.2 8.59 2.12 0.44 0.255   0.802  0.450 0.056 2.21 2.59 3.25 3.3
6 2012    1       R.A. Dickey 17 17 120.0 9.23 1.95 0.68 0.258   0.795  0.512 0.102 2.40 2.79 2.90 3.2

nrow(dat)
[1] 573

You should be able to see that we have an indicator of which half we’re looking at for each observation, as well as useful things like strikeouts per 9 innings pitched (k9), walks per 9 innings pitches (BB9), and fielding independent pitching (fip and xfip). We’ll focus on these for today. Note that there are 573 observations in the data, meaning that we don’t have qualified halves for some of the pitchers in the sample. That’s OK, as we’re just doing some exploratory looks.

First, I want to create a new identifier that allows us to match the pitcher-year specific halves with one another. Then, we’ll create an indicator variable that a given pitcher was traded near the deadline to a contender. I grabbed a short list of these pitchers for the years of our data from Baseball America’s Trade Central pages. Let’s go ahead and do this now with the code below for the pitchers that were traded to contenders:

###create player-year unique identifier
dat$name_year <- paste(dat$name,"_",dat$year,sep="")

###identify players traded near deadline
dat$contenderTrade <- ifelse(dat$name_year=="" |
    dat$name_year=="Jeff Samardzija_2014" |
    dat$name_year=="Brandon McCarthy_2014" |
    dat$name_year=="Jake Peavy_2014" |
    dat$name_year=="Justin Masterson_2014" |
    dat$name_year=="Jon Lester_2014" |
    dat$name_year=="John Lackey_2014" |
    dat$name_year=="Drew Smyly_2014" |
    dat$name_year=="David Price_2014" |
    dat$name_year=="Scott Feldman_2013" |
    dat$name_year=="Ricky Nolasco_2013" |
    dat$name_year=="Matt Garza_2013" |
    dat$name_year=="Jake Peavy_2013" |
    dat$name_year=="Bud Norris_2013" |
    dat$name_year=="Brett Myers_2012" |
    dat$name_year=="Anibal Sanchez_2012" |
    dat$name_year=="Zack Greinke_2012" |
    dat$name_year=="Francisco Liriano_2012" |
    dat$name_year=="Paul Maholm_2012" |
    dat$name_year=="Ryan Dempster_2012" |
    dat$name_year=="Joe Blanton_2012", 1, 0)

###take a look
head(dat)
  year half              name  g gs    ip   k9  bb9  hr9 babip lob_pct gb_pct hr_fb  era  fip xfip war              name_year contenderTrade
1 2013    1        Jake Peavy 11 11  67.0 8.87 2.01 1.34 0.293   0.708  0.367 0.118 4.30 3.73 3.53 1.1        Jake Peavy_2013              1
2 2012    1 Francisco Liriano 19 14  83.1 8.86 5.40 0.76 0.296   0.660  0.460 0.101 5.08 4.16 4.29 0.8 Francisco Liriano_2012              1
3 2012    1  Justin Verlander 18 18 132.2 8.68 2.04 0.75 0.246   0.746  0.403 0.083 2.58 2.97 3.35 3.7  Justin Verlander_2012              0
4 2012    1      Zack Greinke 19 19 111.0 9.00 2.11 0.41 0.335   0.729  0.538 0.066 3.32 2.38 2.80 3.5      Zack Greinke_2012              1
5 2012    1        Chris Sale 15 15 101.2 8.59 2.12 0.44 0.255   0.802  0.450 0.056 2.21 2.59 3.25 3.3        Chris Sale_2012              0
6 2012    1       R.A. Dickey 17 17 120.0 9.23 1.95 0.68 0.258   0.795  0.512 0.102 2.40 2.79 2.90 3.2       R.A. Dickey_2012              0

Now we’re ready to start exploring. First, let’s take a look at whether there is any noticeable pattern for pitchers in the first vs. second half with respect to strikeouts per 9 innings pitched. If things are systematic, we should see the lines, on average, moving upward or downward from left to right in our figure. We’ll get to the traded players later to see if they differ from the pack in any way. In this plot, I am going to make use of a for loop to plot each player’s respective statistics for the first and second half.

The colors I’ll use allow for more customization, and specifically the last 2 digits  of the “#RRGGBBTT” tell us about the transparency of the color (TT), with the first three pairs digits being the percentage of Red, Green, and Blue, respectively (going from 00 to 99 each). We’ll leave this number relatively low so that we can see the overlap of lines throughout the plot. Further, I use rect and grid, respectively, to lay down a gray rectangle background and a grid for reference to the axes, respectively. Prior to plotting each player, I also calculate the average first and second half numbers for each of our variables of interest. We can draw the averages on our plots for reference, and drawing the individual lines gives us an idea of the variability in our data, sort of like a fancy replacement for error bars (NOTE: you should not use this in lieu of error bars–again, we’re just being exploratory).

###aggregate first and second half stats
datagg <- data.frame(tapply(dat$k9, dat$half, mean))
datagg$half <- c(1,2)
colnames(datagg)[1] <- "k9"

datagg$xfip <- tapply(dat$xfip, dat$half, mean)
datagg$bb9 <- tapply(dat$bb9, dat$half, mean)
datagg$fip <- tapply(dat$fip, dat$half, mean)
datagg$era <- tapply(dat$era, dat$half, mean)
datagg
        k9 half     xfip      bb9      fip      era
1 7.412933    1 3.813781 2.713428 3.825583 3.780601
2 7.401655    2 3.754931 2.523345 3.716793 3.639379

###plot changes in k9 for first and 2nd half
png(file="K9Base.png", height=500, width=600)
plot(dat$k9 ~ dat$half, type="n", ylim=c(4, 12), main="K/9 1st & 2nd Half", xlab="Half", ylab="K/9")
rect(-1000, -1000, 1000, 1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    lines(dat$k9[dat$name_year==i] ~ dat$half[dat$name_year==i], lwd=2, col="#0000ff20")
    points(dat$k9[dat$name_year==i] ~ dat$half[dat$name_year==i], cex=1, pch=16, col="#0000ff40")
    }
lines(datagg$k9 ~ datagg$half, lwd=2, col="black")
points(datagg$k9 ~ datagg$half, pch=16, cex=2)
dev.off()

K9Base

Notice that the for loop plots the K/9 for each pitcher, i, individually. Then we add on the average for the first and second halves from 2012 through 2014. There isn’t much going on with the average–strikeouts per nine innings as a whole don’t seem to be going up or down when comparing the two halves–and we don’t see much systematic pattern with the individual lines, either. They’re largely criss-crossing throughout the plot.

Now let’s specifically identify our pitchers that were traded around the deadline to see if they take it up a notch after being traded to a contender. To do this, we’ll plot them separately with a different color to highlight them in our plot. Our last plot, we used blue (the third pair of numbers in our RGB colors, using “ff” to ramp up blue to the max). For the traded pitchers, we’ll make them green (so we’ll do “ff” in the 2nd pair of numbers in the RGB colors).

###now color by whether the player was traded or not
png(file="K9traded.png", height=500, width=600)
plot(dat$k9 ~ dat$half, type="n", ylim=c(4, 12), main="K/9 1st & 2nd Half", xlab="Half", ylab="K/9")
rect(1000, -1000, 1000, -1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    lines(dat$k9[dat$name_year==i & dat$contenderTrade==0] ~ dat$half[dat$name_year==i & dat$contenderTrade==0], lwd=2, col="#0000ff20")
    points(dat$k9[dat$name_year==i & dat$contenderTrade==0] ~ dat$half[dat$name_year==i & dat$contenderTrade==0], cex=1, pch=16, col="#0000ff20")
    }
for(i in unique(dat$name_year)) {
    lines(dat$k9[dat$name_year==i & dat$contenderTrade==1] ~ dat$half[dat$name_year==i & dat$contenderTrade==1], lwd=2, col="#00ff0090")
    points(dat$k9[dat$name_year==i & dat$contenderTrade==1] ~ dat$half[dat$name_year==i & dat$contenderTrade==1], cex=1, pch=16, col="#00ff0090")
    }
lines(datagg$k9 ~ datagg$half, lwd=2, col="black")
points(datagg$k9 ~ datagg$half, pch=16, cex=2)
dev.off()

K9traded

This time, you can easily see each of the pitchers that was traded. However, we don’t see any ramping up of the strikeout rate for these guys in the second half. Oh well. Let’s keep tinkering with our graphics anyway.

Next, we’ll take a look at xFIP instead. Maybe there’s more of a pattern of fielding independent pitching success, rather than just strikeout rate.

###take a look at xFIP
png(file="xFIPtraded.png", height=500, width=600)
plot(dat$xfip ~ dat$half, type="n", ylim=c(2, 5.5), main="xFIP 1st & 2nd Half", xlab="Half", ylab="xFIP")
rect(-1000, -1000, 1000, 1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    lines(dat$xfip [dat$name_year==i & dat$contenderTrade==0] ~ dat$half[dat$name_year==i & dat$contenderTrade==0], lwd=2, col="#0000ff20")
    points(dat$xfip [dat$name_year==i & dat$contenderTrade==0] ~ dat$half[dat$name_year==i & dat$contenderTrade==0], cex=1, pch=16, col="#0000ff20")
    }
for(i in unique(dat$name_year)) {
    lines(dat$xfip [dat$name_year==i & dat$contenderTrade==1] ~ dat$half[dat$name_year==i & dat$contenderTrade==1], lwd=2, col="#00ff0090")
    points(dat$xfip [dat$name_year==i & dat$contenderTrade==1] ~ dat$half[dat$name_year==i & dat$contenderTrade==1], cex=1, pch=16, col="#00ff0090")
    }
lines(datagg$xfip ~ datagg$half, lwd=2, col="black")
points(datagg$xfip ~ datagg$half, pch=16, cex=2)
dev.off()

xFIPtraded

Again, there doesn’t seem to be much going on here with respect to traded pitchers, though maybe there’s a small downward slope overall for xFIP from the first half to the second half as represented by our aggregate black line.

Since we don’t see much going on with our traded pitchers, let’s instead take a look at another way to color our lines to perhaps see different patterns in the data. This time, we’ll scale our color values relative to the strikeout rate. This is silly and redundant to begin, because we know that K/9 on the vertical axis should be related to the scaled colors using K/9, but later we’ll use these same colors with xFIP on the y-axis and keep K/9 as the variable by which to scale the colors.

The first thing I do below is load in the scales package, and rescale our K/9 variable between 10 and 99–or, if you’ve been following along closely, two digit numbers between which our RGB colors can take place. Note that I square the strikeout rate to get some spread in our rescale. Then we’ll just use the paste function in order to identify that we want to use these numbers as our first two (indicating level of Red to use), along with some blue (at 40), and with 50 percent transparency. Then, for the color, we simply identify the variable name that we made using paste.  Lastly, I included the figure side-by-side with our original, unscaled figure that we started with.

###playing with rescaled variable value determined colors
library(scales)
dat$k9C <- round(rescale(dat$k9^2, to=c(10, 99)), 0)
dat$colork9 <- paste("#",dat$k9C,"004050",sep="")

png(file="K9ColoredCompare.png", height=600, width=1200)
par(mfrow=c(1,2))
plot(dat$k9 ~ dat$half, type="n", ylim=c(4, 12), main="K/9 1st & 2nd Half", xlab="Half", ylab="K/9")
rect(-1000, -1000, 1000, 1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    lines(dat$k9[dat$name_year==i] ~ dat$half[dat$name_year==i], lwd=2, col="#0000ff20")
    points(dat$k9[dat$name_year==i] ~ dat$half[dat$name_year==i], cex=1, pch=16, col="#0000ff40")
    }
plot(dat$k9 ~ dat$half, type="n", ylim=c(3, 12), main="K/9 1st & 2nd Half", xlab="Half", ylab="K/9")
rect(-1000, -1000, 1000, 1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    lines(dat$k9[dat$name_year==i] ~ dat$half[dat$name_year==i], lwd=2, col=dat$colork9[dat$name_year==i])
    }
for(i in unique(dat$name_year)) {
    points(dat$k9[dat$name_year==i] ~ dat$half[dat$name_year==i], cex=1, pch=16, col=dat$colork9[dat$name_year==i])
    }
dev.off()

K9ColoredCompare

Notice that–as we would of course expect, the color slowly fades from light purple to a darker bluish gray color as we go downward on the plot. This makes sense. Those with the highest strikeout rates would be high on our y-axis, and if these are high, then we know that our color scale included more red. Clearly, our bluish gray gets more reddish purple as the pitcher has a higher strikeout rate. So, things worked. But this sort of thing is a bit redundant: why do we need both a color and an axis to tell us the same thing? Color is best used on two-dimensional plots when it tells us something about a third variable. So let’s move on.

If xFIP and K/9 are related, we should see the colors change as we move downward on the plot as we do with the strikeout rate, though, maybe not as starkly as above. Let’s try and see.

###now look at xfip colord by K/9
png(file="xFIPcolorK9.png", height=500, width=600)
plot(dat$xfip ~ dat$half, type="n", ylim=c(2, 5.5), main="xFIP 1st & 2nd Half", xlab="Half", ylab="xFIP")
rect(-1000, -1000, 1000, 1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    lines(dat$xfip[dat$name_year==i] ~ dat$half[dat$name_year==i], lwd=2, col=dat$colork9[dat$name_year==i])
    }
for(i in unique(dat$name_year)) {
    points(dat$xfip[dat$name_year==i] ~ dat$half[dat$name_year==i], cex=1, pch=16, col=dat$colork9[dat$name_year==i])
    }
dev.off()

xFIPcolorK9

Hopefully you can see that as xFIP improves (moves downward on the y-axis), our line color becomes more and more red/lighter purple. This is expected, given that xFIP is calculated with a function that heavily depends on strikeout rate. There’s a bit more noise here, of course, but the relationship seems to be there.

Lastly, let’s take a look at a scatterplot of xFIP and K/9 and leave aside our first half/second half information. This time, we’ll use a different third variable to color our points: walks per 9 innings pitches (BB/9). Let’s see if we find a pattern here.

###plot xFIP as function of K/9
dat$bb9C <- round(rescale(dat$bb9^.9, to=c(10, 99)), 0)
dat$colorbb9 <- paste("#",dat$bb9C,"004080",sep="")

png(file="xFIPbyK9byBB9.png", height=500, width=650)
plot(dat$xfip ~ dat$k9, type="n", ylim=c(2, 5.5), xlim=c(4,12), main="xFIP by K/9", xlab="K/9", ylab="xFIP")
rect(-1000, -1000, 1000, 1000, col="#00000010")
grid(lty="dashed")
for(i in unique(dat$name_year)) {
    points(dat$xfip[dat$name_year==i] ~ dat$k9[dat$name_year==i], cex=1.5, pch=16, col=dat$colorbb9[dat$name_year==i])
    }
dev.off()

xFIPbyK9byBB9

You can see here that, as expected from the pattern in colors as it relates to xFIP and K/9 in our previous plot, there is a downward slope with xFIP improving as K/9 increases. However, you should also note that there is another clear relationship with the colors involved: given K/9, as xFIP increases, the color of our dots gets more red. Since we colored our plot by adding more red when BB/9 is high, we can safely assume that xFIP increases as BB/9 increases. Sure, this is all pretty straight forward, but I’d encourage you to fiddle around with color to see if you can come up with some less obvious relationships in your data.

Hopefully, at worst, this post is instructive in having fun with R and discovering new ways to think about displaying your data and being creative about using packages that you may otherwise not think to use in these instances.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: