Monthly Archives: June, 2016

Simulating a Half-Inning of Baseball

I am working on a method for determining the efficiency of a team in using its on-base events to score runs. Some baseball teams tend to scatter their on-base events through the nine innings and score few runs. (I am thinking of the 2016 Phillies, but maybe their problem is simply not producing many on-base events.) In contrast, other teams seem to be effective in bunching their on-base events to produce runs. In working on this method, I wanted to write a function to simulate the number of runs scored in a half inning given the overall probabilities of the six basic events OUT, SINGLE, DOUBLE, TRIPLE, HOME RUN, and WALK/HBP, and some basic runner advancement rules. Anyway, I thought it was a good topic for a post.

Runner Movements

In my simulation, I wanted to use realistic movements of runners on base on base hits. For example, if there is a runner on first, then sometimes a single will advance runners to first and second, and sometimes you’ll have runners on first and third. Using the 2015 Retrosheet data, I found the frequencies of the different outcomes which can be converted to probabilities. For example, when there is a runner on 1st, a single advanced the runner to 2nd 3940 times, and advanced the runner to 3rd 1411, so the probabilities of the final states “runners on 1st and 2nd” and “runners on 1st and third” are approximated to be 3940 / (3940 + 1411) = 0.736 and 1411 / (3940 + 1411) = 0.264. In a similar fashion, I found the runner advancement probabilities for singles and doubles in all runner situations. Here is the runner advancement matrix for a single. (My notation is that “110” means runners on first and second.)

    000   100 010 001   110   101 011   111
000   0 1.000   0   0 0.000 0.000   0 0.000
100   0 0.000   0   0 0.736 0.264   0 0.000
010   0 0.576   0   0 0.000 0.424   0 0.000
001   0 1.000   0   0 0.000 0.000   0 0.000
110   0 0.000   0   0 0.359 0.239   0 0.401
101   0 0.000   0   0 0.766 0.234   0 0.000
011   0 0.518   0   0 0.000 0.482   0 0.000
111   0 0.000   0   0 0.344 0.223   0 0.433

The Probabilities of the Six Basic Events

Also to make my simulation more realistic, I used the frequencies of the six basic events from 2015 data.

   OUT     BB     1B     2B     3B     HR 
0.6817 0.0863 0.1543 0.0454 0.0052 0.0270 

Note that I’m ignoring other events such as steals, sacrifices, wild pitches and passed balls that help to produce runs, but I’m interested in how well I can estimating run scoring just using this information.

The Simulation Function

On my gist site, I have two functions — the function runs_setup defines the runner advancement matrices and a vector that gives the probabilities of the six basic events, and the function simulate_half_inning does the actual simulation. Here’s the crux of the actual simulation. Using the sample function, I simulate an event. I simulate the runner advancement — sometimes, the advancement is deterministic (for example, the bases will always be “001” after a triple), but otherwise I simulate the runner advancement using the Prob_Single and Prob_Double matrices. I update the number of outs and runs scored, and then repeat until we have three outs.

 while(outs < 3){
   event <- sample(names(setup$prob), size=1, prob=setup$prob)
   if (event=="1B") new_bases <- sample(all_bases, 1, 
                              prob=setup$Prob_Single[bases, ])
   if (event=="2B") new_bases <- sample(all_bases, 1, 
                              prob=setup$Prob_Double[bases, ])
   if (event=="BB") new_bases <- sample(all_bases, 1, 
                              prob=setup$Prob_Walk[bases, ])
   if (event=="3B") new_bases <- "001"
   if (event=="HR") new_bases <- "000"
   if (event=="OUT") new_bases <- bases
   outs <- outs + (event == "OUT")
   runs <- runs - (event == "OUT") + 
           runs.transition(bases, new_bases) 
   bases <- new_bases

How Do the Simulation Results Compare with Real Baseball?

I use the runs_setup function to do the setup, and the replicate function together with simulate_half_inning to simulate the runs scored for 10,000 half-innings:

st <- runs_setup()
R <- replicate(10000, simulate_half_inning(st))
round(prop.table(table(R)), 3)
0     1     2     3     4     5     6     7     8    10    12 
0.762 0.123 0.060 0.033 0.014 0.005 0.002 0.001 0.000 0.000 0.000 

In these simulations, there were 0 runs scored in 76.2% of the half-innings, 1 run scored in 12.3% of the half-innings, and so on.

I’m interested in how the simulated results compare to the real run production during the 2015 season. I’ve graphed the proportions of 0, 1, 2, … runs scored (in a half-inning) using the simulated and actual data. Actually, I have plotted the logit probability which is log (prob / (1 – prob)) — this makes it easier to visualize small probabilities.


Looking at the graph, it seems that the simple model predicted a higher proportion of 0-run innings and predicted a lower proportion of 1-run innings. That makes sense, since the model is ignoring events like stolen bases and sacrifice runs that produce a single run. It is interesting that the model seems good in predicting the proportion of runs scored for 2 runs and higher.

In my work, one common theme is that simple models tend to be remarkably accurate in predicting the variability in baseball data. This simulation model is certainly wrong (for example, I was ignoring the variability in run scoring between teams), but the model seems useful in understanding the variation in run scoring.