# A Graph of a Batting Average

What does it mean for a baseball hitter to hit “for average”? Here’s an idea how to dissect a batting average and graph its different components, and then use traditional graphics in R to construct this display for a particular player and season of interest.

How does a player gets a base hit? He gets a hit by

• not striking out
• possibly hitting a home run, typically over the fence
• otherwise, he gets a hit on a ball put into play

This motivates the following decomposition of all outcomes of an at-bat (AB). We divide all at-bats first by SO/not-SO, then by HR/not-HR, and then by HIT (in play)/OUT (in play).

If we define the following rates:

• the strikeout rate SO.RATE = SO / AB
• the home run rate HR.RATE = (H – HR) / (AB – SO)
• the balls-in-play batting average BABIP = (H – HR) / (AB – SO – HR)

Then one can show that a batting average AVG = H / AB can be written as
$AVG = \left(1 - SO.Rate\right) \times \left(HR.Rate + (1 - HR.Rate) \times BABIP\right)$

Bill James, in one of his Baseball Abstracts, used the area of a rectangle to represent the runs created for a particular player, where the sides of the rectangle represented on-base and slugging abilities. Likewise, we can use areas of shaded rectangles to represent the different components of a batting average.

Here is an outline of this construction. I’m illustrating this construction using Mark McGwire’s famous 1998 season where he had 509 at-bats, 152 hits, 70 home runs, and 155 strikeouts. (We’ll shortly hear about the number of votes McGwire will get for the HOF.)

* Start with a unit square where the horizontal side corresponds to the strikeout rate (0 to 1).

* Draw off a vertical line at Mark’s strikeout rate of 155 / 509 = 0.304. The area of the rectangle is equal to his strikeout rate

* In the “no strikeout” region, draw a horizontal line at Mark’s home run rate 70 / (509 – 155) = 0.197. The area of this “HR” rectangle
$\left(1 - SO.Rate\right) \times HR.Rate$
represents the first component of the batting average.

* The area of the upper-right rectangle is the proportion of AB where he did not get a SO or a HR. We mark off the balls-in-play hit rate of (152 – 70) / (509 – 155 – 70) = 0.289 with a vertical line.

The area of the “H” rectangle
$\left(1 - SO.Rate\right) \times \left((1 - HR.Rate) \times BABIP\right)$
is the proportion of AB where Mark got an in-play hit — it represents the second component of the batting average.

* Last, we shade the two type of hits (home runs and in-play) in red and blue, respectively. the sum of the two shaded areas is the batting average:
$AVG = AREA.OF.RED.REGION + AREA.OF.BLUE.REGION$

The function  plot.batting.average  will construct this graph using traditional R graphics. All is needed is the Lahman package that contains the season batting data for all players and the dpylr package. The inputs to the function are the name of the player in quotes and the season. (If you inspect the function, you’ll see that I use the  plot  function to set up the square,  lines  functions to draw the segments, and  rect  to draw the shaded rectangles.) The graph displays the three rates and gives the areas of the two areas that make up the batting average.

We illustrate this batting average decomposition for several “interesting batters”.

The 1998 Mark McGwire — this is the year when McGwire hit 70 home runs. He had a high strikeout rate of 30%, but he had a remarkable home run rate of 20% and a below-average BIBIP of 29%. (Note that this function is available on my Github Gist site.)

library(devtools)
source_gist("6afd88ed3e48fd62b7b6")
plot.batting.average("Mark McGwire", 1998)


The 2001 Mark McGwire. This was the last season of McGwire’s career. His strikeout rate rose to 40%, his home run rate dropped to 16%, and his BABIP rate was a measily 18%. This was a distinctive season since the component of his AVG due to home runs was actually higher than the component due to hits in play.

plot.batting.average("Mark McGwire", 2001)


The 2010 Adam Dunn. Adam Dunn is an interesting hitter with a high strikeout rate compensated with a good home run rate and a pretty good BABIP rate.

plot.batting.average("Adam Dunn", 2010)


The 2013 Dan Uggla. In this season, Uggla’s had a reasonable home run rate, but his high strikeout rate and low BABIP rate resulted in a 0.179 batting average.

plot.batting.average("Dan Uggla", 2013)


The 2004 Ichiro Suzuki. In this remarkable season, Suzuki had a low strikeout rate combined with a high BABIP rate of 40%.

plot.batting.average("Ichiro Suzuki", 2004)


The 1980 George Brett. How did Brett get close to a .400 batting average? He combined a very low strikeout rate with a reasonable home run rate and a strong batting average on balls put in play.

plot.batting.average("George Brett", 1980)


The 1941 Ted Williams. The characteristics of Williams’ .406 season was similar to Brett’s. The notable difference was Williams’ high home run rate.

plot.batting.average("Ted Williams", 1941)


These graphs shows graphically how players “hit for average” and might be a useful way to compare batters. At least, it might discourage the imprecise way of simply saying that a player “hits for average”. As indicated by these graphs, players who get a high batting average typically have low strikeout rates, and it is possible to boost one’s batting average by hitting home runs (think Mark McGwire).