Misadventures in Data Journalism

I don’t know the technical term for this fallacy, but it’s unfortunately common for people to assert a causal link between a subpopulation A and trait B, look the incidence of B in A, and draw conclusions about the link without ever looking at B in the population that isn’t A. For instance, if you say a stock went down 5 points, that looks bad; if the whole market went down 10, that stock looks good; if companies similar to that stock went up 25, the stock looks bad. Context is key.

People who work with data habitually should be pretty good at avoiding this error; teams of people, including editors, should be very good at avoiding this issue. And yet we get stuff like this Wall Street Journal.

If it’s paywalled for you, it shows that the incidence of babies named Shea has bounced around over time in the tristate area (NY, NJ, CT), and the author Andrew Beaton attributes that to Mets fandom. The teaser in the tweet suggests as much: “When the Mets are good, more NYC-area babies are named Shea.” (To be fair, if you read that sentence strictly it only suggests correlation, but I think it’s reasonable to say that it’s a correlation that’s only interesting if there’s at least a vague causal link.)

The meat of this article is this chart:

Before returning to my main issue with the article, I want to point out three issues with this chart:

  1. Doesn’t adjust for population growth. The New York city MSA grew by about 16% from 1990 to 2010, so that’s worth taking into account.
  2. Doesn’t account for randomness. If you picked a fake time trend and generated data from it, would it look any different? Probably not, I suspect.
  3. Doesn’t have a consistent way of picking notable years. The highlighted years were pretty clearly chosen post-hoc, as there’s some inconsistency as to whether the spike comes the same year as the good performance (1986, sort of 2000) or after (2007, sort of 2000), and the division championship in 1988 is actually a trough. (Plus the 1969 World Series and 1973 pennant don’t seem to have an impact.)

You should look for all of those, in particular the second and third, when people make charts like that, but again, I’m here to talk about the one that came to me first, which is that they draw this conclusion about the tristate area without looking at any national data.

Here’s the plot of Shea over time, nationally:


This doesn’t directly disprove that there’s a Mets effect, since the trends aren’t the same, but the uptick in tristate Sheas in the mid-1980s is the same as that huge jump in the national trend, and the positive trend afterwards is also seen in both datasets. So, without pretty strong corroboration, it seems wrong to assert that this is a tristate trend, and not a national trend.

Does that mean it’s not Mets-related? Not necessarily, since there are Mets fans all over the country. But I would suspect that a healthy majority of them are located in those three states, and removing them doesn’t visually change the trend at all, so again, it’s pretty aggressive to attribute the relationship to the Mets.


Finally, instead of using the other 47 states as a control, we can use a different one: the also somewhat common name “Shay.”


So, while the Shays ebb and flow in a way not too different from the Sheas for most of the past 30 years, there’s a persistent change in the Shea/Shay ratio right around the time the Mets got good for the first time that appears to be converging back after the close of Shea Stadium. Maybe that’s something, maybe it’s not; I didn’t adjust for gender and I don’t know what other demographic factors (e.g. ancestry) could affect this balance. My takeaway from the Shay analysis is that it provides minimal evidence for Mets fans driving Shea and some broad evidence for Shea Stadium driving the Shea/Shay ratio.

What I’m interested in isn’t really whether this piece’s conclusion about Shea is accurate; more broadly, my point is that this shows the real challenges associated with publishing good empirical work in the rhythm of a daily paper or blog. For a piece like this to be high enough quality to run, the researcher has to have both the ability and the resources to take an extra couple hours (at least) thinking about and testing alternate hypotheses and doing sensitivities, and after doing that will quite likely end up with a muddier conclusion (less interesting to most readers) or a null result like this (really uninteresting to most readers). The most likely good outcome is that you get a bunch of stuff that doesn’t change your conclusion that you either cram in a footnote (hurting your style but keeping geeks like me off your back) or omit altogether (easier, but very bad professional practice).

(It’s a separate issue, but they also didn’t release data and code for this, which is a big pet peeve of mine. It’s probably too much to ask people to add the underlying analysis for every tiny post (especially this one, which was probably just Excel) to a repository of some sort, but even a link to the raw data would be nice.)

I think there’s a place for people who can use SQL and R in the newsroom; I even do some stuff like this myself. (Just tonight I did some Retrosheet queries to answer a question on Twitter, and pieces like this one about John Danks are pretty similar in concept to the Shea piece.) I really do question, though, whether trying to keep pace with the more traditional side of the newsroom is good for readers, writers, or outlets; given the guaranteed drop in both quality and relevance of the analysis, it’s hard for me to believe it’s anything but bad.

I used Social Security Administration data, found here, for all of my analysis. I haven’t had a chance to clean my code up to get it on GitHub yet, but it’ll make it there soon I expect.

Some Grumbling That Is and Is Not About Stolen Base Metrics

I recently wrote an article about Todd Frazier’s stolen bases for BP Southside, the Baseball Prospectus White Sox site, and in doing so did a decent amount of digging into the different advanced measures of base-stealing productivity—something that would take into account all the necessary components and spit out a measure of runs saved or lost. I got frustrated by a few things, and so decided to type this up, as I think it encapsulates a lot of issues in public sports analysis. All of this is written from a baseball perspective, but it applies at least as much to hockey and probably even more to basketball.

Before I start I want to say that the various sports stats sites strike me in many ways as emblematic of the promise of the internet. Vast amounts of cross-indexed information that can be used with minimal technical abilities, and synthesizing months and years of work done often by volunteers, shared with anyone who wants it. So I certainly don’t want any of what follows to suggest I don’t appreciate the work that has been done or don’t like using the sites I’m discussing.

For most advanced baseball stats, you get them from one of three different places: Baseball Prospectus, Baseball-Reference, and FanGraphs. (Disclosure: I write for sites operated by each of Baseball Prospectus and FanGraphs.) For something like base-stealing where there’s two different versions (B-R doesn’t seem to have a standalone stat for this), you have to make a judgment call about which ones to use. I use a few primary criteria for this:

  1. How closely tailored is the metric to the specific question I want answered?
  2. How comprehensive is the measure? In other words, does it take into account everything I think it should in this situation?
  3. How transparent and understandable is the measure? For example, could I decompose it to understand the impact of an individual play / game on this measurement? Alternatively, could I break it down to understand the impact of a single decision that was made in the metric’s construction?
  4. How accurate is the measure? What assumptions does it make, how reasonable are those assumptions in practice, etc.? (You can contrast this with #2 by saying that #2 is how good the theory is and #4 is how good the implementation is.)

Obviously these criteria are interconnected—a more comprehensive measurement is less likely to be transparent but may be more accurate than something that works with broader strokes—but they’re what I think about when I look at these things.

So how do BP’s SBR and FG’s wSB do when evaluated with these criteria? (Links are to the respective glossaries, which you should probably read before continuing.)

  1. Both of these metrics are trying to compute how many runs Todd Frazier has created from his decisions to steal bases, so both in pretty good shape on this front.
  2. These measures, from what I can tell, are about equally comprehensive. SBR takes run expectancy—for instance, treating steals of second differently from steals of third—into account, and wSB doesn’t. On the other hand, wSB debits runners for each time they don’t take off, which is a subtle but important decision that corrects puzzling SBR results like Paul Konerko being an “average” base-stealer because he never tried to steal bases. Neither metric considers secondary (tertiary?) factors like the impact of stolen base attempts on pitcher and batter behavior, defensive positioning, etc.
  3. wSB is quite transparent in its computations. There’s a simple formula, and its motivations are pretty well laid-out. If you wanted to compute wSB from projections, or over a portion of the season, it’d take you basically no time in a spreadsheet. For SBR, by contrast, there aren’t any details for computing things—it’s a two sentence description with no way for me to understand the smaller decisions that go into it, or recreate it under different circumstances.
  4. It’s pretty hard to assess how good the decisions that go into SBR are, because there’s no transparency. (That said, there are some seeming contradictions in the numbers—for instance, as of this writing Jimmy Rollins has 4 SB opportunities on the leaderboard, despite having 5 SB and 2 CS, so something seems wrong there.) For wSB, there are a couple of puzzling decisions, and a couple seem just wrong:
    1. Why is the run value of a stolen base equal to 0.2 runs forever? This ignores temporal variation: advancing a base is more useful if there are fewer homers hit, for instance, and that varies over time. (It also ignores the differences between stealing second, third, and home, but we covered that in point 2.)
    2. Where does the 0.075 term come from?
    3. Why compute opportunities only for first base, and not second and third? Why include times that there was a runner at second as opportunities, but not times the player reached on an error or a fielder’s choice? None of these will have a huge impact in aggregate, but they’d make the numbers more correct.

So neither of these metrics grades out very highly. I find it perplexing and frustrating that if, I’d like to analyze one of the simpler parts of baseball, our most statistically advanced sport, I’m stuck relying on two metrics with what appear to be clear flaws.

Besides my minor gripes with these two stats, there are two generalizations I want to make from this. One is that, in the era of databases and servers, we should be wary of people who allow for biases, especially in their “advanced” stats, for the sake of simplicity. wSB’s being something you could derive from the Lahman database (or the Macmillan Baseball Encyclopedia) was useful in the 1990s, but it’s silly now. Simplified wOBA or OPS are useful if I want to save 10 minutes coding something for a blog post or want to do something computationally intensive, and we should preserve those and similar metrics for such cases, but they’re not acceptable for bottom line metrics that thousands of fans look at every day.

We have the play-by-play data and the computing power to measure some things more exactly, and we should do it. For park adjustment, we don’t need to assume a player had half his games at home and half at neutral road parks, because we know how many batters a pitcher faced in each park. For league adjustment, we can handle the nuances of interleague play and the DH without just throwing our hands up. (This is why, despite some concerns, I like BP’s Deserved Run Average on the whole; it seems much more flexible than a lot of other baseball metrics.)

The other generalization is that if you obscure how a metric is computed you severely damage its credibility, especially when there is an easily accessible alternative. If you provide the code, or failing that a formula, or failing that a detailed explanation, I can understand what a number means, why it’s different from what I expected, why it’s different from a similar number at a different site. When it’s just two sentences and I see something strange, what the hell am I supposed to do with that? And then if it’s wrong nobody fixes it, and if it’s right it doesn’t get used.

So in the spirit of all this, some requests for the big sites (FG, B-R, and BP), in roughly ascending order of how much work they are:

  1. Provide a good way for people to ask questions about your numbers. Mention it specifically in the contact page; make it an explicit employee job responsibility; put a feedback form. It shouldn’t be contingent on my guessing which writer/editor/developer I should tweet at, or hoping that an email to is going to go through. (I don’t mean to denigrate the efforts of the people who do get and respond to these queries, which I’ve seen at each major site; I just know that I not infrequently decide it’s too much work, and this is a barrier that should surely be reduced or eliminated.)
  2. Write and publish full explanations of your metrics. Describe where each term in a formula comes from. Link to a study someone did that justifies why you chose that number for replacement level. Explain what it doesn’t include and why. Work through examples. Keep the links and explanations up to date. Solicit feedback.
  3. Move beyond formulas and publish code. Publishing code makes it easier for people to:
    • Identify errors in your implementation.
    • Identify implicit assumptions that may need to be challenged.
    • Repurpose and build off the work (and in doing so, spread the word and make the metrics more prominent).
    • Learn what they need to learn to contribute to the community.
  4. Take a hard look at all your metrics (especially the ones that are considered to be best-in-class) and ask: could this be better? Is it built off box-score stats where play-by-play would be better? Does it omit something we know how to measure? Does it build in some dumb historical quirk that nobody really likes (like treating errors differently)? If you think the answer’s yes, then fix it.

All of these are especially true for anything built off play-by-play data, since those are (as far as I know) available to everyone for a minimal investment of time and effort. For the sites I’m talking about, their strengths are largely in the infrastructure to publish a variety of data and tie it together in interesting ways; they aren’t (or shouldn’t be) in IP that are keeping intentionally obscure. So tell me what you’re doing and I’ll trust you more. A thoughtful license should mitigate most of the concerns about people doing things they shouldn’t with the fruits of your labor. (For private data sources or extremely complex models, I understand that they can’t be open-sourced in the same way (though I disagree with a lot of the reasoning involved), but if anything that amplifies the need for thoughtful, thorough, clear explanations of what’s under the hood.)

People sometimes discuss how baseball has been “solved,” or that there aren’t big new advances to be made. They might be right, they probably aren’t. But if it has been solved, we shouldn’t have to keep the solutions behind lock and key. And if it hasn’t, then let’s get things out in the open, rather than letting errors languish and credibility erode.

Trying to Beat PECOTA

I don’t have much experience doing strictly predictive modelling. There are a few reasons for this—it hasn’t been part of the job where I’ve worked, and building a prediction/projection system has never seemed worth it as a side sports project. But when Baseball Prospectus opened up a “Beat PECOTA” contest, I figured it was something I could do in a quick-and-dirty fashion. It’d be fun and I’d get a partial answer to something I’m curious about: in a baseball context, how much can fancy machine learning algorithms substitute for some of the very subtle domain knowledge that most projection systems rely on. So I took most of the afternoon and a lot of the evening the day before Opening Day and gave it a shot.

It wound up being not as quick and a fair bit dirtier than I was hoping for, but I got some submissions in (not quite how I’d’ve liked, as I’ll explain below), and I’m using this post to explain what I wound up doing, the issues I noticed as I did it, and some of the things that the model output suggests.

This post is also an experiment in writing and publishing something using RMarkdown (so you can see all the code and corresponding ugly-ass output with less writing involved). If you’re not interested in the R code, I apologize for the clutter.

Some Background

PECOTA predicts major league performance for a large number of players (hitters and pitchers), and the contest is pretty straightforward: for as many players as you want, pick over or under; if the player deviates the projection by a certain amount (0.007 of True Average or 0.3 runs of Deserved Run Average) and exceeds a playing time threshold (80 PA or 20 IP), then you get 10 points for the right direction and lose 11.5 points for the wrong direction.

Since BP archives PECOTA projections[Sort of—PECOTA outputs change somewhat frequently as the inputs change, but you can grab a snapshot from around the same time each year.], we can go back and score all of their projections in years past based on the current scoring rules. In theory, then, you can look for patterns in past over/unders and use that to predict how likely a player is to have a given result.

The Model

(Feel free to skip to the next section if you don’t care about the stats involved.) I decided to use a random forest model. You should read up on this class of models if you’re not familiar with it, but the basic theory is to grow a large number of individual decision trees on random selections of data using random choices of predictors. Because of the large number of trees, it is capable of avoiding much of the overfitting that a traditional regression model is subject to, which is key in this context because of the comparatively limited number of data points (roughly 300 each of hitters and pitchers per year).

By using a tree model, the random forest also incorporates non-linearities (for instance, a variable whose predictive power changes depending on the value of another variable) naturally without having them be pre-specified. Since I expected that few, if any, variables would have simple linear relationships with the outcome of interest, this is a huge plus for the random forest.

As with any class of models, RFs have their drawbacks: they require parameter tuning (e.g., figuring out how many trees to grow) and they don’t provide clear outputs (like tests of statistical signifiance or an R2 figure), but for someone who’s just trying to throw something together (as I was) they make a lot of sense.

The Data

I trained the model on a dataset consisting of 2014 and 2015 PECOTA projections, along with some historical performance data (for instance, how they did relative to their projection the previous year). I decided not to train the model on seasons from before 2014 for two main reasons. The first was the amount of time it would take to pull and validate the data and to match it up with the external data sources I was using. The second was relevance: PECOTA changes substantively year-to-year, and so its blind spots in years past likely differ from any current gaps the model will find.

For pitchers, I also merged in information from Steamer and ZiPS, two other projection systems available from FanGraphs; in theory, discrepancies between PECOTA and other systems indicate a greater likelihood of PECOTA missing. I do think their inclusion did help, and would like to use them in any further analysis of this.

I originally intended to have my pitchers submission include ZiPS and Steamer, but due to a coding error I only found later (quick and very dirty!) that didn’t end up happening. Since cleaning the datasets to merge them took a fair amount of time for a modest increase in predictive power, so I skipped doing that for the hitters.

One other huge pitching flaw: the PECOTA contest is only being judged on DRA. DRA wasn’t released until last year, so there are no projections for it in old data. There were projections for Fair Run Average (DRA’s sort-of predecessor), but FRA results were deprecated and aren’t available anymore. In the interest of speed, I just ran things with plain ERA. Plain ERA and DRA are very different stats, but I actually wouldn’t be surprised if this didn’t make a huge difference (i.e. that beating a DRA projection typically overlaps with beating an ERA projection).

The variables I wound up including in the two projections are:

  • Handedness
  • Height
  • Weight
  • League (hitters only, by accident—yet again, did this very quicky)
  • Age
  • BP Breakout, Improve, Collapse, Attrition scores
  • Rookie indicator

For hitters:

  • Position
  • HR
  • BB
  • SO
  • AVG
  • OBP
  • SLG
  • tAV
  • PA
  • Prior year’s: projected tAV, tAV, PA, projection result

For pitchers:

  • BB9
  • SO9
  • GB%
  • ERA
  • IP
  • Prior year’s: projected ERA, ERA, projection result (lagged IP left out due to oversight)

I judged the different model specifications on how they did on a 30% validation set, using as my metric the actual BP scoring rules. I ultimately settled on judging them on all predictions, as restricting to high-certainty ones (or even positive expected value ones) seemed to decrease the sample size without necessarily improving the results; having chosen the model, I then fit it to the entire dataset, and predicted on the 2016 data.

Some Results

First: My full prediction set is in my GitHub; the ones I submitted to BP were a haphazard subset, due to their cutoff at 99 predictions per category and the issues I had submitting them in the first place (and they weren’t changed as I revised the code, so some screw-ups might persist there). If you’re curious about what this wonky black box spits out for your favorite player, go there.

Moving on to potentially more generalizable patterns, I’ve plotted the feature importance lists for the two models below. (Feature importance is determined by comparing actual prediction results to prediction results when that variable is randomly permuted. See this description.

A few takeaways from this whole exercise (nothing particularly novel):

  • The last two years, there’ve been apparently exploitable patterns in PECOTA’s projections of regular players, most concretely that batters are noticeably more likely to underperform than overperform. Those performance patterns also correlate with other existing variables.
  • Even with data issues, code sloppiness, moderate lack of know-how, and a big time crunch, it’s possible to put together an ML model that seems to perform fairly well in this prediction space.
  • There are much worse ways to spend the day before opening day than crunching baseball numbers.

I’m planning on following up on that first point with a more general and methodical study in the near future, as I think it’s potentially an area where some real understanding can be gained.

I linked my predictions earlier; all code and data from this are on GitHub, except for the PECOTA spreadsheets, which aren’t mine to distribute.

Sharing is Caring

For an article I’m working on (plan to see it at Hardball Times some time TBD), I had cause to analyze the data from the Fans Scouting Report run by Tom Tango. While FanGraphs hosts the data from 2009 onwards, I couldn’t find a clean version of the 2003–2008 dataset, so I pulled the data off Tango’s site and then did some annoying but largely insubstantial cleaning so they can be combined with the data available on FG.

For anyone that wants to read the code or use the data, I’ve posted the code and datasets on my GitHub. If you click around, you may notice that that’s all I have up there, but my intention is to post code and data for articles from now on, and ideally go back and fill in some of my old posts as well.

Finally, if you haven’t read my most recent piece, it went up about 5 weeks ago at THT; it’s a slightly out-there proposal to rearrange baseball’s schedule and alignment to improve the quality of the regular season. Read it here.

Some Housekeeping Links

Links to my two most recent pieces, both over at the Hardball Times:

From last week, a look at MLB trading patterns and which teams trade with each other the most.

And from March, a look at how the strike zone changes depending on day/night games and dome/outdoor games.

Obviously, I’ve been writing less this year, but I have several things in the pipeline, though it’s yet to be determined whether they get posted here, THT, or somewhere else.