Category Archives: Uncategorized

New Edition of TSUB and Model/Data Simulations

I’m back from Australia. It was interesting giving several baseball talks to groups that know little about baseball, but I think many of the ideas of my talk would carry over to cricket.

New Edition of Teaching Statistics Using Baseball

You might know that I wrote an introductory statistics text “Teaching Statistics Using Baseball” (TSUB) based solely on baseball examples. Since TSUB was written in 2003, I thought a revision was needed primarily to update the examples with current players and data. (For example, Barry Bonds is out and Mike Trout is in.) The revision of TSUB is now available — see the book web page for general information and examples how to use R or StatCrunch in teaching a course using this book. In particular, I wrote a package tsub that contains all of the datasets used in the book.

By the way, it is encouraging that other authors, such as Tabor and Franklin and Rothman have more recently written introductory statistics texts with a sports theme.

Model/Data Simulations

One of the innovative aspects of TSUB is its use of Bayesian thinking to introduce statistical inference. I thought it would be instructive to illustrate a simulation-based way of performing Bayesian inference for a hitting probability. I call this “model/data simulation” — I coauthored a paper in the American Statistician back in 2001 that describes this method in more detail.

Construct a Prior

We are interested in learning about P, the probability a particular batter gets a hit. Suppose we think that values of P of .15, .16, …, .39, .4 are all possible and our prior assigns each value the same probability. (The choice of uniform prior can be modified if one believes that particular values of P are more or less likely.)

Decide on a Statistic

We are going to observe the batting outcomes (hit or out) for, say 500 at-bats — we can represent the outcomes as the sequence

0, 1, 0, 0, 1, 1, 0, 0, 0, …

where 1 (0) denotes a hit (out). We will compute a Statistic based on these 500 outcomes — possible Statistics are (1) the number of hits, or (2) the largest “ofer” (that is, the largest gap between consecutive base hits).

Model/Data Simulation

Okay, here is how one implements the algorithm. First one simulates a value of P from my prior, call it P0. Then one simulates batting outcomes assuming P = P0 and computes the value of the statistic S. One repeats this a large number of times, obtaining many ordered pairs (P, S).

Inference

To implement Bayes’ rule, suppose that we observe S = S_obs. Then we look at the simulated values of P only for the pairs where S = S_obs. This sample represents a simulated sample from the posterior distribution of P conditional on S = S_obs. We perform various inferences by summarizing this posterior sample.

R Functions

I’ve written two R functions model_data_simulation() and inference_plot() (available on my github gist site) that implements this method. The first function does the model/data simulation and the second function implements the Bayesian inference once we observe S = S_obs. These functions can be read into the workspace by use of the source_gist function.

library(devtools)
source_gist("2a78aaf69a0e54fe1a35fbedeb662e62",
            filename="model_data_simulation.R")

Example 1: Statistic is the Sum

Suppose we wish to learn about P based on the sum of the binary outcomes, that is, the number of hits. In the R script, we first define a vector P_vector containing the possible values of P between 0.15 and 0.40. Then we use the model_data_simulation function with arguments P_vector , N (the number of at-bats), and mystat , the choice of statistic (here sum ). The output is a plot of the simulated draws of P and S. As one might anticipate, there is a positive association between the value of P and the number of hits in 500 at-bats

P_vector <- seq(.15, .40, by=.01)
S1 <- model_data_simulation(P_vector,
                            N=500,
                            mystat=sum)

Suppose we observe the batter get 150 hits in the 500 at-bats. We use inference_plot with arguments S1 (the simulated draws of P and S from the model/data simulation) and 150 (the observed value of the statistic). The output is a density plot of the simulated values of P conditional on S = 150. We show the location of a 50% interval estimate — here the probability P is in the interval (.290, .310) is 0.50.

inference_plot(S1, 150)

Example 2: Statistic is the Largest Ofer

To illustrate the use of a different statistic, suppose we focus on the largest “ofer” value — that is, the length of the longest streak of 0’s in the sequence. First I write a function max_ofer that computes the largest ofer (note that I’m using a function from my BayesTestStreak package). Then I rerun model_data_simulation where the statistics function is max_ofer . Here note there is a negative association between P and the longest ofer — weak hitters tend to have larger ofers.

max_ofer <- function(y){
  require(BayesTestStreak)
  max(find.spacings(y)$y)
}
S2 <- model_data_simulation(P_vector,
                           N=500,
                           mystat=max_ofer)

Suppose you observe a hitter whose maximum ofer is 30. We run inference_plot using the statistics value of 30. We see we are 50% sure that P is between 0.160 and 0.200 — the batter is relatively weak.

inference_plot(S2, 30)

Using this Method in Teaching

I like this method of introducing inference for several reasons:

  1. It helps the student understand the distinction between the parameter and the statistic.
  2. It is flexible — one can play with different choices of statistic S.
  3. Inference is easy to implement — one performs inference by data analysis on a simulated sample, and one can make straight-forward interpretations of interval estimates. (One does not have to consider samples that you did not observe.)