Baseball, as we all know, is going through some significant changes. Strikeouts and home runs are up, hits are down, and games are getting longer. What is the impact of these changes on measures of performance? Here I’ll explore visually how the distributions of batting averages of regular players have changed from 1995 to the current season. I’m doing this study for several reasons — first, it is an interesting thing to explore, and second, it will give me an opportunity to demonstrate different ways of displaying many distributions of data. (We are currently talking about comparing distributions of quantitative data in my graphics course, and my data science students shortly will have a graphics miniproject.)
I’m focusing on batting averages of regulars which I define to those players with at least 428 at-bats. (Why 428? When I scraped some SI data, the minimum number of at-bats for the players on their leader board was 428.) I want to remove players like pitchers who have lower batting averages. I selected data for the 1995, 1998, 2001, 2004, 2007, 2010, 2013, 2016, 2018 seasons. The number of regular players from these seasons ranged from 117 to 176.
Graph 1: A Basic Scatterplot
A basic thing to try is to simply plot the AVG against the Season for all seasons which I show below. This is a poor graph since there is significant overplotting and one cannot see where the AVGs are concentrated.
Graph 2: A Jittered Scatterplot
A simple way to improve the scatterplot is to vertically jitter the points which I show below. Now one gets a better sense of where the AVGs are distributed and the outliers (like Chris Davis’ 0.168 AVG in 2018) are more clearly distinguished. I am not that enthused about this graph since one still has significant overplotting.
Graph 3: Error Bar Plots
When one wants to compare distributions, a common strategy is to plot means together with some “error bars” indicating variability about the means. Below I show the mean AVGs where the error bars add and subtract one standard deviation of AVG. I don’t like this graph for several reasons. First, it will not be obvious what is being graphed. If you are primarily interested in the error in estimating the mean, then you would plot the mean plus and minus a standard error which is the standard deviation divided by the square root of the sample size — that is not what I am plotting below. Also this graph is only showing the middle 2/3 of each distribution so much of the distribution is not graphed.
Graph 4: Parallel Boxplots
I like showing parallel boxplots of the AVGs as shown below. I like boxplots since they clearly display the middle 50% of each distribution (the box component), show the tail components (the whiskers of the boxplots), and set apart individual AVGs that are either too large or too small. We see several interesting things:
- it is clear that the median AVG of regulars is decreasing over time and the 2018 median is the smallest in this group of seasons.
- we have outstanding high AVGs for the early seasons, but unusually small AVGS in the later seasons
- the spreads of the middle 50% of regular AVGs appear to be stable over this period
Graph 5: Violin Plots
One problem with boxplots is that they are a bit discontinuous in appearance and one might prefer smoother estimates of the distributions. A violin plot is a modern graphical method of displaying a distribution– basically it is a mirrored version of a density estimate display. I show parallel violin plots of the AVGs below. One gets a better sense of the shape of the AVG distributions in this display.
Graph 6: Violin Plots Plus
One thing that is missing is some indication of a typical AVG in each distribution. Below in my “violin plots plus” display where I add the mean AVG as a red dot. Like the medians, the mean AVGs have been decreasing over time.
Graph 7: Ridgeline Plots
A new attractive way of comparing distributions is by so-called ridgeline plots where you show overlapping density estimates as shown below. What is interesting here is that I get a better sense of the shapes of these AVG distributions. Note that the AVG distribution in 1995 appears bimodal with modes around .260 and .300 with the .300 mode more popular. In contrast, the 2018 AVG distribution is bimodal but in a different way — the more popular mode is .250 with a smaller mode at .290. This might be worth exploring further — at least, it questions the common assumption that AVGs are bell-shaped.
Here’s some general advice about comparative distributions of numeric data.
- Don’t just plot the means. (You are missing a lot of information about the data.)
- It is not much better to plot error bars. (Readers may not understand what you are plotting.)
- Boxplots are helpful both in understanding center and spread of the middle 50% of the data and for identifying outliers. Boxplots have been around awhile (it was a John Tukey idea from almost 50 years ago) and they remain popular in exploratory work.
- Density graphs (either using violin or ridgeline graphs) are helpful, especially when you have large amounts of data. Unusual shapes of distributions such as bimodality can be picked up by these displays.
- Experiment with different types of graphs. Usually the first graph you try is not the best one. Also, the best type of graph may depend on your data and the characteristic of the distribution that you are primarily interested in.
All of the R code to obtain this data and construct these plots is available on my GithubGist site.