Simulating a Half-Inning of Baseball

I am working on a method for determining the efficiency of a team in using its on-base events to score runs. Some baseball teams tend to scatter their on-base events through the nine innings and score few runs. (I am thinking of the 2016 Phillies, but maybe their problem is simply not producing many on-base events.) In contrast, other teams seem to be effective in bunching their on-base events to produce runs. In working on this method, I wanted to write a function to simulate the number of runs scored in a half inning given the overall probabilities of the six basic events OUT, SINGLE, DOUBLE, TRIPLE, HOME RUN, and WALK/HBP, and some basic runner advancement rules. Anyway, I thought it was a good topic for a post.

Runner Movements

In my simulation, I wanted to use realistic movements of runners on base on base hits. For example, if there is a runner on first, then sometimes a single will advance runners to first and second, and sometimes you’ll have runners on first and third. Using the 2015 Retrosheet data, I found the frequencies of the different outcomes which can be converted to probabilities. For example, when there is a runner on 1st, a single advanced the runner to 2nd 3940 times, and advanced the runner to 3rd 1411, so the probabilities of the final states “runners on 1st and 2nd” and “runners on 1st and third” are approximated to be 3940 / (3940 + 1411) = 0.736 and 1411 / (3940 + 1411) = 0.264. In a similar fashion, I found the runner advancement probabilities for singles and doubles in all runner situations. Here is the runner advancement matrix for a single. (My notation is that “110” means runners on first and second.)

    000   100 010 001   110   101 011   111
000   0 1.000   0   0 0.000 0.000   0 0.000
100   0 0.000   0   0 0.736 0.264   0 0.000
010   0 0.576   0   0 0.000 0.424   0 0.000
001   0 1.000   0   0 0.000 0.000   0 0.000
110   0 0.000   0   0 0.359 0.239   0 0.401
101   0 0.000   0   0 0.766 0.234   0 0.000
011   0 0.518   0   0 0.000 0.482   0 0.000
111   0 0.000   0   0 0.344 0.223   0 0.433

The Probabilities of the Six Basic Events

Also to make my simulation more realistic, I used the frequencies of the six basic events from 2015 data.

   OUT     BB     1B     2B     3B     HR 
0.6817 0.0863 0.1543 0.0454 0.0052 0.0270 

Note that I’m ignoring other events such as steals, sacrifices, wild pitches and passed balls that help to produce runs, but I’m interested in how well I can estimating run scoring just using this information.

The Simulation Function

On my gist site, I have two functions — the function runs_setup defines the runner advancement matrices and a vector that gives the probabilities of the six basic events, and the function simulate_half_inning does the actual simulation. Here’s the crux of the actual simulation. Using the sample function, I simulate an event. I simulate the runner advancement — sometimes, the advancement is deterministic (for example, the bases will always be “001” after a triple), but otherwise I simulate the runner advancement using the Prob_Single and Prob_Double matrices. I update the number of outs and runs scored, and then repeat until we have three outs.

 while(outs < 3){
   event <- sample(names(setup$prob), size=1, prob=setup$prob)
   if (event=="1B") new_bases <- sample(all_bases, 1, 
                              prob=setup$Prob_Single[bases, ])
   if (event=="2B") new_bases <- sample(all_bases, 1, 
                              prob=setup$Prob_Double[bases, ])
   if (event=="BB") new_bases <- sample(all_bases, 1, 
                              prob=setup$Prob_Walk[bases, ])
   if (event=="3B") new_bases <- "001"
   if (event=="HR") new_bases <- "000"
   if (event=="OUT") new_bases <- bases
   outs <- outs + (event == "OUT")
   runs <- runs - (event == "OUT") + 
           runs.transition(bases, new_bases) 
   bases <- new_bases

How Do the Simulation Results Compare with Real Baseball?

I use the runs_setup function to do the setup, and the replicate function together with simulate_half_inning to simulate the runs scored for 10,000 half-innings:

st <- runs_setup()
R <- replicate(10000, simulate_half_inning(st))
round(prop.table(table(R)), 3)
0     1     2     3     4     5     6     7     8    10    12 
0.762 0.123 0.060 0.033 0.014 0.005 0.002 0.001 0.000 0.000 0.000 

In these simulations, there were 0 runs scored in 76.2% of the half-innings, 1 run scored in 12.3% of the half-innings, and so on.

I’m interested in how the simulated results compare to the real run production during the 2015 season. I’ve graphed the proportions of 0, 1, 2, … runs scored (in a half-inning) using the simulated and actual data. Actually, I have plotted the logit probability which is log (prob / (1 – prob)) — this makes it easier to visualize small probabilities.


Looking at the graph, it seems that the simple model predicted a higher proportion of 0-run innings and predicted a lower proportion of 1-run innings. That makes sense, since the model is ignoring events like stolen bases and sacrifice runs that produce a single run. It is interesting that the model seems good in predicting the proportion of runs scored for 2 runs and higher.

In my work, one common theme is that simple models tend to be remarkably accurate in predicting the variability in baseball data. This simulation model is certainly wrong (for example, I was ignoring the variability in run scoring between teams), but the model seems useful in understanding the variation in run scoring.

3 responses

  1. Awesome work, Jim. Are the simulated points for 3 > runs < 12 hiding perfectly behind the actual points?

  2. Josh, yes they are — I should probably redraw so that one set of points is not hidden.

  3. This program is great! Incidentally, I checked it using averages for the major leages in 2018 using the Tm data here, and checking results for a few specific teams.

    Your system consistently UNDERCOUNTS runs. I figured it had to do with errors rather than stolen bases, sac flies, etc. The fielding average was 0.984 for 2018. I upped all on base events by 1/.984, but that model consistently OVERCOUNTS runs. After sleeping on it, I decided that fielding percentage should not affect Bases on Balls or Home Runs, since in these cases no ball is in play. in addition, about 35% of plays involve assists. I tried multiplyng (At Bats minus hits) by(1/.984 -1 )*65% and adding (At Bats minus hits) by
    (1/.984-1)^2 * 35% to get an approximation of number of additional base runners who got on thanks to errors, and weren’t credited with a hit. I allocated the additional runners on errors proportionately between singles, doubles, triples, and the results were close to actual league totals.
    By squaring the fraction of assists, I allowed for multiple fielding attempts in assists. The third baseman might fumble a grounder to third, destroying an out and resulting in an error, or he might get the grounder, throw to second or first for a forced out, and the second fielder fumbles- so both the original assister and final out putter have to field the ball. cleanly.

    As I said, the system worked well for the major leagues as a whole, predicting 723 runs per season when the league averaged 721, It worked well for San Francisco, predicting about 607 runs where they scored 603, but way overestimated Boston’s runs, giving 906 when actual number was 876 for the year.

    I think another investigation might be on fielding errors. I suspect that fielding errors do NOT happen randomly, but are more likely to occur with speedsters, who are more apt to stretch a hit into a double or even a triple. These players, credited with an out when in actuality they got on base, may be undervalued as hitters.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: