How (And Why) We Built A World Series Simulator
WBEZ’s attempt to plot all possible futures for all of postseason baseball
This year, the Chicago Cubs evolved from lovable losers to a once-in-a-generation juggernaut.
As the Major League Baseball playoffs started in October, the team became the story everyone in Chicago was walking about. Still, we wanted to do it in a different way that could appeal both to baseball fans and those in our audience who aren’t as interested in sports.
We landed on a tool that allowed users to peek under the hood of the MLB playoffs by simulating the postseason as many times as they wanted, which we hope taught even baseball fans something new about their sport.
Why We Built It
Every year sites like Baseball Prospectus, Fangraphs, and FiveThirtyEight publish World Series odds for each team entering the playoffs. Those sites run thousands of simulations to publish a single percentage chance. But those numbers often don’t match up with what fans might expect.
That’s because more than other sports, baseball is ruled by randomness. In 2014 Baseball Prospectus did an article looking at what it would take for a team to have even odds to win the World Series at the start of the playoffs. Their conclusion? It would take the best team in the history of baseball.
Each series is at most seven games, which isn’t enough time for subtle advantages to show up, and with four playoff rounds those unlikely occurrences can add up quickly. Because of that, most sites gave the Cubs only a 1-in–4 shot to win the World Series, even as heavy favorites.
Our goal was to explain that reality, but do more than just report that final percentage. We wanted to let users interact with the model and experience —even for a second—what it feels like to enter a team as good as the Cubs in a tournament as random as the MLB postseason. We felt that clicking a button four times and only seeing the Cubs win once (or not at all) drove home that message better than just saying they had a “25 percent chance.”
How We Built It
The first step was to come up with a way to simulate a single game. Bill James, one of the founders of sabermetrics, came up with a formula to estimate the winning percentage of one team against another. Known as log5, the formula only needs to know each team’s regular season winning percentage to calculate one team’s expected chances against the other.
probability = (team1pct - team1pct*team2pct)/(team1pct+team2pct-2*team1pct*team2pct)
Building from there, we built a quick-and-dirty script to simulate an entire postseason. It read from a JSON file with team information, then it created objects for each of the 10 playoff teams and nine series. Each series object contained the two teams involved, the number of games to play, the number of wins a team needed to advance, and what the next series was. A “play_series” function simulated all the games until one team was victorious, and advanced to the next round (or declared them our virtual World Series champions).
Eventually I stumbled on a great article by Steve Staude in the Hardball Times that built off of James’ log5 formula using the odds ratio. Implementing Staude’s method allowed me to add home field advantage to the simulation (home teams normally win around 54 percent of the time). I also switched from using an unadjusted season winning percentage to BaseRuns, which estimates winning percentage based on the actual number of hits a team got and allowed, which smooths out the luck inherent in even in MLB’s 162-game season. But that was as complicated as our model got—nowhere near what Fangraphs or FiveThirtyEight is doing.
While this setup worked great—basically, it output results to the command line—the simulator needed an interface as well as a way to send the data from the simulator to the user.
At WBEZ, we are big fans of the NPR Visuals team’s work. We use their Daily Graphics Rig for smaller graphics and base many of our large projects on their app template. Both make it easy to develop and publish quickly; they also make a lot of decisions so we don’t have to.
The graphics rig is based mostly on D3, and while I wasn’t originally planning on using it for this project, the NPR tools made it easier to stick with that library than to try to force the simulator into something else. Using those tools, and cribbing from an open source NCAA tournament bracket generator, I was able to rewrite my original code to fit D3. In the end, this turned out to be a really good decision.
The rewritten code uses the same JSON file of teams, and processes it to create a list of all possible series. After creating a container SVG, D3 creates an element for each series and adds a text element for the top and bottom teams in the matchup.
When a user clicks the button to simulate a series, a function clears any previous information from the bracket. Using D3, we select each of the series objects, play a series with the teams, and advance to the winner to the next series, which in turn gets simulated.
The NPR templates are also built using their pym.js, which made it easy to make the final graphic responsive. The only issue is that the graphics rig is set up to redraw the entire graphic on resize, and with dynamic data we couldn’t just reset to the original state. We solved that by splitting up the functions to simulate the series and to write the results to the bracket. When a user resizes their window, the render function redraws the bracket and then rewrites the results, but it doesn’t touch the underlying data.
In testing, we found that users might want a sense of which team was the favorite in each matchup. Calculating the series probability is more complicated that just the odds for a single game. Again, Staude’s great Hardball Times article explained how to use a cumulative binomial distribution to estimate a team’s chance in a series of any given length.
I built a function in R that generated the odds ratio chance of Team A winning versus Team B, then ran that through the pbinom function to make a probability matrix for all possible matchups. Finally, I imported that into the main simulator, and displayed the probability for each matchup next to the result.
As a final touch, we tried to design the graphic to match the iconic Wrigley Field scoreboard. It has the dual advantages of being easily recognizable and easy to do, with only two colors and straight lines. Thanks, 1930s designers!
The Final Score
This project was different from anything we’d tried before at WBEZ and came with a lot of moving parts. (Here’s the repo.)
We had to decide if the simulator should live with our story about probability in baseball or on its own. We opted to combine them together, but I wouldn’t rule out doing it differently next time. There were also a number of extra features (such as letting users set team strength, simulating the postseason from its current state, displaying odds predictions by team, like FiveThirtyEight, et al) that we had to do without.
We saved a lot of time by using open source tools from NPR and reusing past templates. Much of the graphics code came from the graphics rig, and the bones of the design came from another story I did based around a large embed graphic last year. That helped make sure I didn’t spend a lot of development time on what was just a fun way to look at sports.
Judging by analytics and comments, our intended audience liked it, which means it was a success, given that sports don’t make up a big part of WBEZ’s daily coverage. And it gave us something we could share throughout the baseball playoffs, whether the Cubs won or lost.
Chris Hagan is a web and data producer at Chicago Public Media, and learned to code to scrape stats off MLB.com. You can follow him @chrishagan.