Some Grumbling That Is and Is Not About Stolen Base Metrics

I recently wrote an article about Todd Frazier’s stolen bases for BP Southside, the Baseball Prospectus White Sox site, and in doing so did a decent amount of digging into the different advanced measures of base-stealing productivity—something that would take into account all the necessary components and spit out a measure of runs saved or lost. I got frustrated by a few things, and so decided to type this up, as I think it encapsulates a lot of issues in public sports analysis. All of this is written from a baseball perspective, but it applies at least as much to hockey and probably even more to basketball.

Before I start I want to say that the various sports stats sites strike me in many ways as emblematic of the promise of the internet. Vast amounts of cross-indexed information that can be used with minimal technical abilities, and synthesizing months and years of work done often by volunteers, shared with anyone who wants it. So I certainly don’t want any of what follows to suggest I don’t appreciate the work that has been done or don’t like using the sites I’m discussing.

For most advanced baseball stats, you get them from one of three different places: Baseball Prospectus, Baseball-Reference, and FanGraphs. (Disclosure: I write for sites operated by each of Baseball Prospectus and FanGraphs.) For something like base-stealing where there’s two different versions (B-R doesn’t seem to have a standalone stat for this), you have to make a judgment call about which ones to use. I use a few primary criteria for this:

How closely tailored is the metric to the specific question I want answered?
How comprehensive is the measure? In other words, does it take into account everything I think it should in this situation?
How transparent and understandable is the measure? For example, could I decompose it to understand the impact of an individual play / game on this measurement? Alternatively, could I break it down to understand the impact of a single decision that was made in the metric’s construction?
How accurate is the measure? What assumptions does it make, how reasonable are those assumptions in practice, etc.? (You can contrast this with #2 by saying that #2 is how good the theory is and #4 is how good the implementation is.)

Obviously these criteria are interconnected—a more comprehensive measurement is less likely to be transparent but may be more accurate than something that works with broader strokes—but they’re what I think about when I look at these things.

So how do BP’s SBR and FG’s wSB do when evaluated with these criteria? (Links are to the respective glossaries, which you should probably read before continuing.)

Both of these metrics are trying to compute how many runs Todd Frazier has created from his decisions to steal bases, so both in pretty good shape on this front.
These measures, from what I can tell, are about equally comprehensive. SBR takes run expectancy—for instance, treating steals of second differently from steals of third—into account, and wSB doesn’t. On the other hand, wSB debits runners for each time they don’t take off, which is a subtle but important decision that corrects puzzling SBR results like Paul Konerko being an “average” base-stealer because he never tried to steal bases. Neither metric considers secondary (tertiary?) factors like the impact of stolen base attempts on pitcher and batter behavior, defensive positioning, etc.
wSB is quite transparent in its computations. There’s a simple formula, and its motivations are pretty well laid-out. If you wanted to compute wSB from projections, or over a portion of the season, it’d take you basically no time in a spreadsheet. For SBR, by contrast, there aren’t any details for computing things—it’s a two sentence description with no way for me to understand the smaller decisions that go into it, or recreate it under different circumstances.
It’s pretty hard to assess how good the decisions that go into SBR are, because there’s no transparency. (That said, there are some seeming contradictions in the numbers—for instance, as of this writing Jimmy Rollins has 4 SB opportunities on the leaderboard, despite having 5 SB and 2 CS, so something seems wrong there.) For wSB, there are a couple of puzzling decisions, and a couple seem just wrong:
1. Why is the run value of a stolen base equal to 0.2 runs forever? This ignores temporal variation: advancing a base is more useful if there are fewer homers hit, for instance, and that varies over time. (It also ignores the differences between stealing second, third, and home, but we covered that in point 2.)
2. Where does the 0.075 term come from?
3. Why compute opportunities only for first base, and not second and third? Why include times that there was a runner at second as opportunities, but not times the player reached on an error or a fielder’s choice? None of these will have a huge impact in aggregate, but they’d make the numbers more correct.

So neither of these metrics grades out very highly. I find it perplexing and frustrating that if, I’d like to analyze one of the simpler parts of baseball, our most statistically advanced sport, I’m stuck relying on two metrics with what appear to be clear flaws.

Besides my minor gripes with these two stats, there are two generalizations I want to make from this. One is that, in the era of databases and servers, we should be wary of people who allow for biases, especially in their “advanced” stats, for the sake of simplicity. wSB’s being something you could derive from the Lahman database (or the Macmillan Baseball Encyclopedia) was useful in the 1990s, but it’s silly now. Simplified wOBA or OPS are useful if I want to save 10 minutes coding something for a blog post or want to do something computationally intensive, and we should preserve those and similar metrics for such cases, but they’re not acceptable for bottom line metrics that thousands of fans look at every day.

We have the play-by-play data and the computing power to measure some things more exactly, and we should do it. For park adjustment, we don’t need to assume a player had half his games at home and half at neutral road parks, because we know how many batters a pitcher faced in each park. For league adjustment, we can handle the nuances of interleague play and the DH without just throwing our hands up. (This is why, despite some concerns, I like BP’s Deserved Run Average on the whole; it seems much more flexible than a lot of other baseball metrics.)

The other generalization is that if you obscure how a metric is computed you severely damage its credibility, especially when there is an easily accessible alternative. If you provide the code, or failing that a formula, or failing that a detailed explanation, I can understand what a number means, why it’s different from what I expected, why it’s different from a similar number at a different site. When it’s just two sentences and I see something strange, what the hell am I supposed to do with that? And then if it’s wrong nobody fixes it, and if it’s right it doesn’t get used.

So in the spirit of all this, some requests for the big sites (FG, B-R, and BP), in roughly ascending order of how much work they are:

Provide a good way for people to ask questions about your numbers. Mention it specifically in the contact page; make it an explicit employee job responsibility; put a feedback form. It shouldn’t be contingent on my guessing which writer/editor/developer I should tweet at, or hoping that an email to contact@website.com is going to go through. (I don’t mean to denigrate the efforts of the people who do get and respond to these queries, which I’ve seen at each major site; I just know that I not infrequently decide it’s too much work, and this is a barrier that should surely be reduced or eliminated.)
Write and publish full explanations of your metrics. Describe where each term in a formula comes from. Link to a study someone did that justifies why you chose that number for replacement level. Explain what it doesn’t include and why. Work through examples. Keep the links and explanations up to date. Solicit feedback.
Move beyond formulas and publish code. Publishing code makes it easier for people to:
- Identify errors in your implementation.
- Identify implicit assumptions that may need to be challenged.
- Repurpose and build off the work (and in doing so, spread the word and make the metrics more prominent).
- Learn what they need to learn to contribute to the community.
Take a hard look at all your metrics (especially the ones that are considered to be best-in-class) and ask: could this be better? Is it built off box-score stats where play-by-play would be better? Does it omit something we know how to measure? Does it build in some dumb historical quirk that nobody really likes (like treating errors differently)? If you think the answer’s yes, then fix it.

All of these are especially true for anything built off play-by-play data, since those are (as far as I know) available to everyone for a minimal investment of time and effort. For the sites I’m talking about, their strengths are largely in the infrastructure to publish a variety of data and tie it together in interesting ways; they aren’t (or shouldn’t be) in IP that are keeping intentionally obscure. So tell me what you’re doing and I’ll trust you more. A thoughtful license should mitigate most of the concerns about people doing things they shouldn’t with the fruits of your labor. (For private data sources or extremely complex models, I understand that they can’t be open-sourced in the same way (though I disagree with a lot of the reasoning involved), but if anything that amplifies the need for thoughtful, thorough, clear explanations of what’s under the hood.)

People sometimes discuss how baseball has been “solved,” or that there aren’t big new advances to be made. They might be right, they probably aren’t. But if it has been solved, we shouldn’t have to keep the solutions behind lock and key. And if it hasn’t, then let’s get things out in the open, rather than letting errors languish and credibility erode.

That's a Clown Hypothesis, Bro!

Sports analysis and commentary, mostly empirically-based.

Some Grumbling That Is and Is Not About Stolen Base Metrics

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply