Misadventures in Data Journalism

I don’t know the technical term for this fallacy, but it’s unfortunately common for people to assert a causal link between a subpopulation A and trait B, look the incidence of B in A, and draw conclusions about the link without ever looking at B in the population that isn’t A. For instance, if you say a stock went down 5 points, that looks bad; if the whole market went down 10, that stock looks good; if companies similar to that stock went up 25, the stock looks bad. Context is key.

People who work with data habitually should be pretty good at avoiding this error; teams of people, including editors, should be very good at avoiding this issue. And yet we get stuff like this Wall Street Journal.

If it’s paywalled for you, it shows that the incidence of babies named Shea has bounced around over time in the tristate area (NY, NJ, CT), and the author Andrew Beaton attributes that to Mets fandom. The teaser in the tweet suggests as much: “When the Mets are good, more NYC-area babies are named Shea.” (To be fair, if you read that sentence strictly it only suggests correlation, but I think it’s reasonable to say that it’s a correlation that’s only interesting if there’s at least a vague causal link.)

The meat of this article is this chart:

Before returning to my main issue with the article, I want to point out three issues with this chart:

  1. Doesn’t adjust for population growth. The New York city MSA grew by about 16% from 1990 to 2010, so that’s worth taking into account.
  2. Doesn’t account for randomness. If you picked a fake time trend and generated data from it, would it look any different? Probably not, I suspect.
  3. Doesn’t have a consistent way of picking notable years. The highlighted years were pretty clearly chosen post-hoc, as there’s some inconsistency as to whether the spike comes the same year as the good performance (1986, sort of 2000) or after (2007, sort of 2000), and the division championship in 1988 is actually a trough. (Plus the 1969 World Series and 1973 pennant don’t seem to have an impact.)

You should look for all of those, in particular the second and third, when people make charts like that, but again, I’m here to talk about the one that came to me first, which is that they draw this conclusion about the tristate area without looking at any national data.

Here’s the plot of Shea over time, nationally:


This doesn’t directly disprove that there’s a Mets effect, since the trends aren’t the same, but the uptick in tristate Sheas in the mid-1980s is the same as that huge jump in the national trend, and the positive trend afterwards is also seen in both datasets. So, without pretty strong corroboration, it seems wrong to assert that this is a tristate trend, and not a national trend.

Does that mean it’s not Mets-related? Not necessarily, since there are Mets fans all over the country. But I would suspect that a healthy majority of them are located in those three states, and removing them doesn’t visually change the trend at all, so again, it’s pretty aggressive to attribute the relationship to the Mets.


Finally, instead of using the other 47 states as a control, we can use a different one: the also somewhat common name “Shay.”


So, while the Shays ebb and flow in a way not too different from the Sheas for most of the past 30 years, there’s a persistent change in the Shea/Shay ratio right around the time the Mets got good for the first time that appears to be converging back after the close of Shea Stadium. Maybe that’s something, maybe it’s not; I didn’t adjust for gender and I don’t know what other demographic factors (e.g. ancestry) could affect this balance. My takeaway from the Shay analysis is that it provides minimal evidence for Mets fans driving Shea and some broad evidence for Shea Stadium driving the Shea/Shay ratio.

What I’m interested in isn’t really whether this piece’s conclusion about Shea is accurate; more broadly, my point is that this shows the real challenges associated with publishing good empirical work in the rhythm of a daily paper or blog. For a piece like this to be high enough quality to run, the researcher has to have both the ability and the resources to take an extra couple hours (at least) thinking about and testing alternate hypotheses and doing sensitivities, and after doing that will quite likely end up with a muddier conclusion (less interesting to most readers) or a null result like this (really uninteresting to most readers). The most likely good outcome is that you get a bunch of stuff that doesn’t change your conclusion that you either cram in a footnote (hurting your style but keeping geeks like me off your back) or omit altogether (easier, but very bad professional practice).

(It’s a separate issue, but they also didn’t release data and code for this, which is a big pet peeve of mine. It’s probably too much to ask people to add the underlying analysis for every tiny post (especially this one, which was probably just Excel) to a repository of some sort, but even a link to the raw data would be nice.)

I think there’s a place for people who can use SQL and R in the newsroom; I even do some stuff like this myself. (Just tonight I did some Retrosheet queries to answer a question on Twitter, and pieces like this one about John Danks are pretty similar in concept to the Shea piece.) I really do question, though, whether trying to keep pace with the more traditional side of the newsroom is good for readers, writers, or outlets; given the guaranteed drop in both quality and relevance of the analysis, it’s hard for me to believe it’s anything but bad.

I used Social Security Administration data, found here, for all of my analysis. I haven’t had a chance to clean my code up to get it on GitHub yet, but it’ll make it there soon I expect.

Some Grumbling That Is and Is Not About Stolen Base Metrics

I recently wrote an article about Todd Frazier’s stolen bases for BP Southside, the Baseball Prospectus White Sox site, and in doing so did a decent amount of digging into the different advanced measures of base-stealing productivity—something that would take into account all the necessary components and spit out a measure of runs saved or lost. I got frustrated by a few things, and so decided to type this up, as I think it encapsulates a lot of issues in public sports analysis. All of this is written from a baseball perspective, but it applies at least as much to hockey and probably even more to basketball.

Before I start I want to say that the various sports stats sites strike me in many ways as emblematic of the promise of the internet. Vast amounts of cross-indexed information that can be used with minimal technical abilities, and synthesizing months and years of work done often by volunteers, shared with anyone who wants it. So I certainly don’t want any of what follows to suggest I don’t appreciate the work that has been done or don’t like using the sites I’m discussing.

For most advanced baseball stats, you get them from one of three different places: Baseball Prospectus, Baseball-Reference, and FanGraphs. (Disclosure: I write for sites operated by each of Baseball Prospectus and FanGraphs.) For something like base-stealing where there’s two different versions (B-R doesn’t seem to have a standalone stat for this), you have to make a judgment call about which ones to use. I use a few primary criteria for this:

  1. How closely tailored is the metric to the specific question I want answered?
  2. How comprehensive is the measure? In other words, does it take into account everything I think it should in this situation?
  3. How transparent and understandable is the measure? For example, could I decompose it to understand the impact of an individual play / game on this measurement? Alternatively, could I break it down to understand the impact of a single decision that was made in the metric’s construction?
  4. How accurate is the measure? What assumptions does it make, how reasonable are those assumptions in practice, etc.? (You can contrast this with #2 by saying that #2 is how good the theory is and #4 is how good the implementation is.)

Obviously these criteria are interconnected—a more comprehensive measurement is less likely to be transparent but may be more accurate than something that works with broader strokes—but they’re what I think about when I look at these things.

So how do BP’s SBR and FG’s wSB do when evaluated with these criteria? (Links are to the respective glossaries, which you should probably read before continuing.)

  1. Both of these metrics are trying to compute how many runs Todd Frazier has created from his decisions to steal bases, so both in pretty good shape on this front.
  2. These measures, from what I can tell, are about equally comprehensive. SBR takes run expectancy—for instance, treating steals of second differently from steals of third—into account, and wSB doesn’t. On the other hand, wSB debits runners for each time they don’t take off, which is a subtle but important decision that corrects puzzling SBR results like Paul Konerko being an “average” base-stealer because he never tried to steal bases. Neither metric considers secondary (tertiary?) factors like the impact of stolen base attempts on pitcher and batter behavior, defensive positioning, etc.
  3. wSB is quite transparent in its computations. There’s a simple formula, and its motivations are pretty well laid-out. If you wanted to compute wSB from projections, or over a portion of the season, it’d take you basically no time in a spreadsheet. For SBR, by contrast, there aren’t any details for computing things—it’s a two sentence description with no way for me to understand the smaller decisions that go into it, or recreate it under different circumstances.
  4. It’s pretty hard to assess how good the decisions that go into SBR are, because there’s no transparency. (That said, there are some seeming contradictions in the numbers—for instance, as of this writing Jimmy Rollins has 4 SB opportunities on the leaderboard, despite having 5 SB and 2 CS, so something seems wrong there.) For wSB, there are a couple of puzzling decisions, and a couple seem just wrong:
    1. Why is the run value of a stolen base equal to 0.2 runs forever? This ignores temporal variation: advancing a base is more useful if there are fewer homers hit, for instance, and that varies over time. (It also ignores the differences between stealing second, third, and home, but we covered that in point 2.)
    2. Where does the 0.075 term come from?
    3. Why compute opportunities only for first base, and not second and third? Why include times that there was a runner at second as opportunities, but not times the player reached on an error or a fielder’s choice? None of these will have a huge impact in aggregate, but they’d make the numbers more correct.

So neither of these metrics grades out very highly. I find it perplexing and frustrating that if, I’d like to analyze one of the simpler parts of baseball, our most statistically advanced sport, I’m stuck relying on two metrics with what appear to be clear flaws.

Besides my minor gripes with these two stats, there are two generalizations I want to make from this. One is that, in the era of databases and servers, we should be wary of people who allow for biases, especially in their “advanced” stats, for the sake of simplicity. wSB’s being something you could derive from the Lahman database (or the Macmillan Baseball Encyclopedia) was useful in the 1990s, but it’s silly now. Simplified wOBA or OPS are useful if I want to save 10 minutes coding something for a blog post or want to do something computationally intensive, and we should preserve those and similar metrics for such cases, but they’re not acceptable for bottom line metrics that thousands of fans look at every day.

We have the play-by-play data and the computing power to measure some things more exactly, and we should do it. For park adjustment, we don’t need to assume a player had half his games at home and half at neutral road parks, because we know how many batters a pitcher faced in each park. For league adjustment, we can handle the nuances of interleague play and the DH without just throwing our hands up. (This is why, despite some concerns, I like BP’s Deserved Run Average on the whole; it seems much more flexible than a lot of other baseball metrics.)

The other generalization is that if you obscure how a metric is computed you severely damage its credibility, especially when there is an easily accessible alternative. If you provide the code, or failing that a formula, or failing that a detailed explanation, I can understand what a number means, why it’s different from what I expected, why it’s different from a similar number at a different site. When it’s just two sentences and I see something strange, what the hell am I supposed to do with that? And then if it’s wrong nobody fixes it, and if it’s right it doesn’t get used.

So in the spirit of all this, some requests for the big sites (FG, B-R, and BP), in roughly ascending order of how much work they are:

  1. Provide a good way for people to ask questions about your numbers. Mention it specifically in the contact page; make it an explicit employee job responsibility; put a feedback form. It shouldn’t be contingent on my guessing which writer/editor/developer I should tweet at, or hoping that an email to contact@website.com is going to go through. (I don’t mean to denigrate the efforts of the people who do get and respond to these queries, which I’ve seen at each major site; I just know that I not infrequently decide it’s too much work, and this is a barrier that should surely be reduced or eliminated.)
  2. Write and publish full explanations of your metrics. Describe where each term in a formula comes from. Link to a study someone did that justifies why you chose that number for replacement level. Explain what it doesn’t include and why. Work through examples. Keep the links and explanations up to date. Solicit feedback.
  3. Move beyond formulas and publish code. Publishing code makes it easier for people to:
    • Identify errors in your implementation.
    • Identify implicit assumptions that may need to be challenged.
    • Repurpose and build off the work (and in doing so, spread the word and make the metrics more prominent).
    • Learn what they need to learn to contribute to the community.
  4. Take a hard look at all your metrics (especially the ones that are considered to be best-in-class) and ask: could this be better? Is it built off box-score stats where play-by-play would be better? Does it omit something we know how to measure? Does it build in some dumb historical quirk that nobody really likes (like treating errors differently)? If you think the answer’s yes, then fix it.

All of these are especially true for anything built off play-by-play data, since those are (as far as I know) available to everyone for a minimal investment of time and effort. For the sites I’m talking about, their strengths are largely in the infrastructure to publish a variety of data and tie it together in interesting ways; they aren’t (or shouldn’t be) in IP that are keeping intentionally obscure. So tell me what you’re doing and I’ll trust you more. A thoughtful license should mitigate most of the concerns about people doing things they shouldn’t with the fruits of your labor. (For private data sources or extremely complex models, I understand that they can’t be open-sourced in the same way (though I disagree with a lot of the reasoning involved), but if anything that amplifies the need for thoughtful, thorough, clear explanations of what’s under the hood.)

People sometimes discuss how baseball has been “solved,” or that there aren’t big new advances to be made. They might be right, they probably aren’t. But if it has been solved, we shouldn’t have to keep the solutions behind lock and key. And if it hasn’t, then let’s get things out in the open, rather than letting errors languish and credibility erode.

Trying to Beat PECOTA

I don’t have much experience doing strictly predictive modelling. There are a few reasons for this—it hasn’t been part of the job where I’ve worked, and building a prediction/projection system has never seemed worth it as a side sports project. But when Baseball Prospectus opened up a “Beat PECOTA” contest, I figured it was something I could do in a quick-and-dirty fashion. It’d be fun and I’d get a partial answer to something I’m curious about: in a baseball context, how much can fancy machine learning algorithms substitute for some of the very subtle domain knowledge that most projection systems rely on. So I took most of the afternoon and a lot of the evening the day before Opening Day and gave it a shot.

It wound up being not as quick and a fair bit dirtier than I was hoping for, but I got some submissions in (not quite how I’d’ve liked, as I’ll explain below), and I’m using this post to explain what I wound up doing, the issues I noticed as I did it, and some of the things that the model output suggests.

This post is also an experiment in writing and publishing something using RMarkdown (so you can see all the code and corresponding ugly-ass output with less writing involved). If you’re not interested in the R code, I apologize for the clutter.

Some Background

PECOTA predicts major league performance for a large number of players (hitters and pitchers), and the contest is pretty straightforward: for as many players as you want, pick over or under; if the player deviates the projection by a certain amount (0.007 of True Average or 0.3 runs of Deserved Run Average) and exceeds a playing time threshold (80 PA or 20 IP), then you get 10 points for the right direction and lose 11.5 points for the wrong direction.

Since BP archives PECOTA projections[Sort of—PECOTA outputs change somewhat frequently as the inputs change, but you can grab a snapshot from around the same time each year.], we can go back and score all of their projections in years past based on the current scoring rules. In theory, then, you can look for patterns in past over/unders and use that to predict how likely a player is to have a given result.

The Model

(Feel free to skip to the next section if you don’t care about the stats involved.) I decided to use a random forest model. You should read up on this class of models if you’re not familiar with it, but the basic theory is to grow a large number of individual decision trees on random selections of data using random choices of predictors. Because of the large number of trees, it is capable of avoiding much of the overfitting that a traditional regression model is subject to, which is key in this context because of the comparatively limited number of data points (roughly 300 each of hitters and pitchers per year).

By using a tree model, the random forest also incorporates non-linearities (for instance, a variable whose predictive power changes depending on the value of another variable) naturally without having them be pre-specified. Since I expected that few, if any, variables would have simple linear relationships with the outcome of interest, this is a huge plus for the random forest.

As with any class of models, RFs have their drawbacks: they require parameter tuning (e.g., figuring out how many trees to grow) and they don’t provide clear outputs (like tests of statistical signifiance or an R2 figure), but for someone who’s just trying to throw something together (as I was) they make a lot of sense.

The Data

I trained the model on a dataset consisting of 2014 and 2015 PECOTA projections, along with some historical performance data (for instance, how they did relative to their projection the previous year). I decided not to train the model on seasons from before 2014 for two main reasons. The first was the amount of time it would take to pull and validate the data and to match it up with the external data sources I was using. The second was relevance: PECOTA changes substantively year-to-year, and so its blind spots in years past likely differ from any current gaps the model will find.

For pitchers, I also merged in information from Steamer and ZiPS, two other projection systems available from FanGraphs; in theory, discrepancies between PECOTA and other systems indicate a greater likelihood of PECOTA missing. I do think their inclusion did help, and would like to use them in any further analysis of this.

I originally intended to have my pitchers submission include ZiPS and Steamer, but due to a coding error I only found later (quick and very dirty!) that didn’t end up happening. Since cleaning the datasets to merge them took a fair amount of time for a modest increase in predictive power, so I skipped doing that for the hitters.

One other huge pitching flaw: the PECOTA contest is only being judged on DRA. DRA wasn’t released until last year, so there are no projections for it in old data. There were projections for Fair Run Average (DRA’s sort-of predecessor), but FRA results were deprecated and aren’t available anymore. In the interest of speed, I just ran things with plain ERA. Plain ERA and DRA are very different stats, but I actually wouldn’t be surprised if this didn’t make a huge difference (i.e. that beating a DRA projection typically overlaps with beating an ERA projection).

The variables I wound up including in the two projections are:

  • Handedness
  • Height
  • Weight
  • League (hitters only, by accident—yet again, did this very quicky)
  • Age
  • BP Breakout, Improve, Collapse, Attrition scores
  • Rookie indicator

For hitters:

  • Position
  • HR
  • BB
  • SO
  • AVG
  • OBP
  • SLG
  • tAV
  • PA
  • Prior year’s: projected tAV, tAV, PA, projection result

For pitchers:

  • BB9
  • SO9
  • GB%
  • ERA
  • IP
  • Prior year’s: projected ERA, ERA, projection result (lagged IP left out due to oversight)

I judged the different model specifications on how they did on a 30% validation set, using as my metric the actual BP scoring rules. I ultimately settled on judging them on all predictions, as restricting to high-certainty ones (or even positive expected value ones) seemed to decrease the sample size without necessarily improving the results; having chosen the model, I then fit it to the entire dataset, and predicted on the 2016 data.

Some Results

First: My full prediction set is in my GitHub; the ones I submitted to BP were a haphazard subset, due to their cutoff at 99 predictions per category and the issues I had submitting them in the first place (and they weren’t changed as I revised the code, so some screw-ups might persist there). If you’re curious about what this wonky black box spits out for your favorite player, go there.

Moving on to potentially more generalizable patterns, I’ve plotted the feature importance lists for the two models below. (Feature importance is determined by comparing actual prediction results to prediction results when that variable is randomly permuted. See this description.

wkdir <- "C:/Users/Frank/Documents/GitHub/BeatPecota/"
hittermodel <- readRDS(paste0(wkdir,"predictions/HitterModel.rds"))
pitchermodel <- readRDS(paste0(wkdir,"predictions/PitcherModel.rds"))


plot of chunk unnamed-chunk-2


plot of chunk unnamed-chunk-2

The main thing that jumps out to me is that the PECOTA projected playing time is considered to be important for both models; another factor is using the lagged projection is important, as are pretty standard performance projection baseline (ERA and SLG, for instance).

Since these aren’t regression models, we can’t get coefficients that neatly tell us how these variables line up; a simple way to get a hint of this is to plot conditional distributions, i.e., how these variables are distributed for the different classes (Push, Over, and Under) in the training data.

pitcherinput <- read_csv(paste0(wkdir,"data/Pitcher Training Data.csv"))
hitterinput <- read_csv(paste0(wkdir,"data/Hitter Training Data.csv"))
pitimp = varImp(pitchermodel)$importance %>% as.matrix()
data_frame(var = row.names(pitimp),imp = pitimp[,1]) %>% arrange(-imp) %>% head(5) -> imppitch

# Pitching Plots

for (i in imppitch$var) {
  plot <- ggplot(pitcherinput,aes_string(x=i,group="ProjResult",color="ProjResult")) + geom_density()

plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

hitimp = varImp(hittermodel)$importance %>% as.matrix()
data_frame(var = row.names(hitimp),imp = hitimp[,1]) %>% arrange(-imp) %>% head(5) -> imphit

# Hitting Plots

for (i in imphit$var) {
  plot <- ggplot(hitterinput,aes_string(x=i,group="ProjResult",color="ProjResult")) + geom_density()

plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4

Having just dropped a bunch of plots, I want to make a couple things clear. One is that these show the reverse of the conditional probability you might be expecting, i.e., they show that players who beat their projections are more likely to have had a high PA (or whatever) projection, not necessarily that high PA projections mean a player is likely to beat their tAV projection. Another is that a large part of the appeal of a random forest is to capture interactions and non-linearities, so just looking at density plots is not going to tell the final story. The third part is that, while these predictors seem to perform well against the null, I haven’t done any rigorous testing to see how different the distributions plotted are.

Seeing that projected playing time seems to be positively correlated with performance relative to the projection is one of the two things I would say constitute “insight” out of this whole project. (Without knowing more about the spreadsheets I can’t be sure, but a possible explanation that playing time incorporates subjective depth-chart information, which in turn relates to subjective talent estimates, makes some intuitive sense.)

The other “insight” (or at least, something I didn’t know coming in) is summarized in the chunk below. The tables are just the raw results for all players; the plots show the distributions of the PECOTA error, with black lines for the average error and red lines delineating the push zone.

# Hitter Results

##  Over  Push Under 
##   206   138   266
hitterelig <- hitterinput %>% filter(PA_ACTUAL > 80)
## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion
## Error in filter(., PA_ACTUAL > 80): object 'PA_ACTUAL' not found
##  Over  Push Under 
##   206   108   266
ggplot(hitterelig,aes(x=TAv_ACTUAL-TAv)) + geom_density() + geom_vline(xintercept=mean(hitterelig$TAv_ACTUAL-hitterelig$TAv,na.rm=T),color="black") +
geom_vline(xintercept=0.007,color="red") +

plot of chunk unnamed-chunk-5

# Pitcher Results

##  Over  Push Under 
##   216   185   250
pitcherelig <- pitcherinput %>% filter(IP_ACTUAL > 20)
## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion

## Warning in data.matrix(data): NAs introduced by coercion
## Error in filter(., IP_ACTUAL > 20): object 'IP_ACTUAL' not found
##  Over  Push Under 
##   216   130   250
ggplot(pitcherelig,aes(x=ERA_ACTUAL-ERA)) + geom_density() + geom_vline(xintercept=mean(pitcherelig$ERA_ACTUAL-pitcherelig$ERA,na.rm=T),color="black") + geom_vline(xintercept=0.3,color="red") + 

plot of chunk unnamed-chunk-5

That is to say, PECOTA’s misses tend to be high (very much so for hitters, a bit for pitchers), meaning that going under on every player would have done very well for hitters and a bit better than breaking even for pitchers:

sum(c(-11.5,10) * table(pitcherelig$ProjResult)[c(1,3)])
## [1] 16
sum(c(-11.5,10) * table(hitterelig$ProjResult)[c(1,3)])
## [1] 291

The result for pitchers is weakly statistically significant, while the hitter result is significant:

# Pitcher Statistical Significance

##  2-sample test for equality of proportions with continuity
##  correction
## data:  table(pitcherelig$ProjResult)[c(1, 3)] out of rep(nrow(pitcherelig), 2)
## X-squared = 3.8369, df = 1, p-value = 0.05014
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -1.140320e-01 -6.192066e-05
## sample estimates:
##    prop 1    prop 2 
## 0.3624161 0.4194631
# Hitter Statistical Significance

##  2-sample test for equality of proportions with continuity
##  correction
## data:  table(hitterelig$ProjResult)[c(1, 3)] out of rep(nrow(hitterelig), 2)
## X-squared = 12.435, df = 1, p-value = 0.0004215
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.16139821 -0.04549834
## sample estimates:
##    prop 1    prop 2 
## 0.3551724 0.4586207

(Because I didn’t notice this pattern until I wrote all of this up, I didn’t use “always go under” as a baseline for model computations. Having added it back in, the models do outperform it, so that’s nice.)

Without further research, I’m wary of concluding anything about this. It could be that PECOTA has a persistent flaw; it could be that PECOTA has had issues pricing in decreasing offensive production; it could be that this is a selection effect that an all-purpose projection system can’t really cover up; it could be that PECOTA just had two rough years from this standpoint. I’m looking forward to seeing what happens with these predictions this year, and also to doing some more digging to see if these patterns hold for other projection systems.


A few takeaways from this whole exercise (nothing particularly novel):

  • The last two years, there’ve been apparently exploitable patterns in PECOTA’s projections of regular players, most concretely that batters are noticeably more likely to underperform than overperform. Those performance patterns also correlate with other existing variables.
  • Even with data issues, code sloppiness, moderate lack of know-how, and a big time crunch, it’s possible to put together an ML model that seems to perform fairly well in this prediction space.
  • There are much worse ways to spend the day before opening day than crunching baseball numbers.

I’m planning on following up on that first point with a more general and methodical study in the near future, as I think it’s potentially an area where some real understanding can be gained.

I linked my predictions earlier; all code and data from this are on GitHub, except for the PECOTA spreadsheets, which aren’t mine to distribute.

Sharing is Caring

For an article I’m working on (plan to see it at Hardball Times some time TBD), I had cause to analyze the data from the Fans Scouting Report run by Tom Tango. While FanGraphs hosts the data from 2009 onwards, I couldn’t find a clean version of the 2003–2008 dataset, so I pulled the data off Tango’s site and then did some annoying but largely insubstantial cleaning so they can be combined with the data available on FG.

For anyone that wants to read the code or use the data, I’ve posted the code and datasets on my GitHub. If you click around, you may notice that that’s all I have up there, but my intention is to post code and data for articles from now on, and ideally go back and fill in some of my old posts as well.

Finally, if you haven’t read my most recent piece, it went up about 5 weeks ago at THT; it’s a slightly out-there proposal to rearrange baseball’s schedule and alignment to improve the quality of the regular season. Read it here.

Some Housekeeping Links

Links to my two most recent pieces, both over at the Hardball Times:

From last week, a look at MLB trading patterns and which teams trade with each other the most.

And from March, a look at how the strike zone changes depending on day/night games and dome/outdoor games.

Obviously, I’ve been writing less this year, but I have several things in the pipeline, though it’s yet to be determined whether they get posted here, THT, or somewhere else.

Some Fun With Jeopardy! Data

Post summary: Jeopardy!’s wagering and question weights make scores of all types (especially final scores) less repeatable and less predictable than raw numbers of questions answered. The upshot of this is that they make the results more random and less skill-based. This also makes it harder to use performance in one game to predict a second game’s result, but I present models to show that the prediction can be done somewhat successfully (though without a lot of granularity) using both decision tree and regression analysis. 

I watched Jeopardy! a lot when I was younger, since I’ve long been a bit of a trivia fiend. When I got to college, though, I started playing quizbowl, which turned me off from Jeopardy! almost entirely. One reason is that I couldn’t play along as well, as the cognitive skills involved in the two games sometimes conflict with each other. More importantly, though, quizbowl made it clear to me the structural issues with Jeopardy!. Even disregarding that the questions rarely distinguish between two players with respect to knowledge, the arbitrary overweighting of questions via variable question values (a $2,000 question is certainly not 10 times as hard or…well, anything as a $200 question) and random Daily Doubles makes it hard to be certain that the better player actually won on a given day.

With that said, I still watch Jeopardy! on occasion, and I figured it’d be a fun topic to explore at the end of my impromptu writing hiatus. To that end, I scraped some performance data (show level, not question level) from the very impressive J-Archive, giving me data for most matches from the last 18 seasons of the show.

After scraping the data, I stripped out tournament and other special matches (e.g. Kids Week shows), giving me a bit over 3,000 shows worth of data to explore, including everything up until late December 2014. I decided to look at which variables are the best indicator of player quality in a predictive manner. Put another way, if you want to estimate the probability that the returning champion wins, which stat should you look at? Here are the ones I could pull out of the data and thought would be worth including:

  • Final score (FS).
  • Score after Double Jeopardy! but before Final Jeopardy! (DS).
  • Total number of questions answered correctly and incorrectly (including Daily Doubles but excluding Final Jeopardy!); the number used is usually right answers (RQ) minus wrong answers, or question differential (QD).
  • Coryat score (CS), which is the score excluding Final Jeopardy! and assuming that all wagers on Daily Doubles were the nominal value of the clue (e.g. a Daily Double in a $800 square is counted as $800, regardless of how much is wagered).

My initial inclination, as you can probably guess from the discussion above, is that purely looking at question difference should yield the best results. Obviously, though, we have to confirm this hypothesis, and to answer this, I decided to look at every player that played (at least) two matches, then see how well we could predict the result of the second match based on the player’s peripherals from the first match. While only having one data point per player isn’t ideal, this preserves a reasonable sample size and lets us avoid deciding how to aggregate stats across multiple matches. Overall, there are approximately 1,700 such players in my dataset, though the analysis I present below is based on a random sample of 80% of those players (the remainder is to be used for out-of-sample testing).

How well do these variables predict each other? This plot shows the correlation coefficients between our set of metrics in the first game and the second game. (Variables marked with 2 are the values from the second game played.)


In general, correct answers does predict the second-game results better than the other candidates, but it’s not too far from the other metrics that aren’t final score. (The margin of error on these correlations is roughly ± 0.05.) Still, this provides some firm evidence that final score isn’t a very good predictor, and that RQ might be the best of the ones here.

We can also use the data estimate how much noise each of the different elements add. If you assume that the player with the highest QD played the best match, then we can compare how often the highest QD player finishes with the best performance according to the other metrics. I find that the highest QD player finishes with the highest Coryat (i.e. including question values but no wagering) 83% of the time, the highest pre-Final Jeopardy! total (i.e. including question values and Daily Doubles) 80% of the time, and the highest overall score (i.e. she wins) 70% of the time. This means that including Final Jeopardy! increases the chances that the best performer doesn’t play another day 10% of the time, and weighting the questions at all increases the chances by 17%; both of those figures seem very high to me, and speak to how much randomness influences what we watch.

What about predicting future wins, though? What’s the best way to do that? To start with, I ran a series of logistic regressions (common models used to estimate probabilities for a binary variable like wins/losses) to see how well these variables predict loss. The following plots show the curve that estimates the probability of a player winning their second match, with bars showing the actual frequency of wins and losses. For the sake of brevity, let’s just show one score plot and one questions plot:

Logits Final ScoreLogits Questions Answered Correctly

Unsurprisingly, the variable that was (in simplistic terms) the best predictor of peripherals also appears to be a much better predictor of future performance. We can see this not only by eyeballing the fit, but also simply by looking at the curves. The upward trend is much more pronounced in the later graphs, which here means that there’s enough meaningful variation to meaningfully peg some players as having 25% chances of winning and some as having 65% chances of winning. (For reference, 42% of players win their second games.) Final score, by contrast, puts almost all of the values in between 35% and 55%, suggesting that the difference between a player that got 20 questions right and one that got 25 right (43% and 55%) is comparable to the difference between a player that finished with $21,000 and one that finished with $40,000. Because $40,000 is in the top 1% of final scores in our dataset, while 25 questions right is only in the 85th percentile, this makes clear that score simply isn’t as predictive as questions answered. (For the technically inclined, the AIC of the models using raw questions are noticeably lower than those using various types of score. I left out the curves, but RQ is a little better than QD, noticeably better than DS and CS, and much better than FS.)

One of many issues with these regression models is that they impose a continuous structure on the data (i.e. they don’t allow for big jumps at a particular number of questions answered) and they omit interactions between variables (for instance, finishing with $25,000 might mean something very different depending on whether the player got 15 or 25 questions correct). To try to get around these issues, I also created a decision tree, which (in very crude terms) uses the data to build a flow-chart that predicts results.

Because of the way my software (the rpart package in R) fits the trees, it automatically throws out variables that don’t substantively improve the model, as opposed to the regressions above, which means it will tell us what the best predictors are. Here’s a picture of the decision tree that gets spit out. Note that all numbers in that plot are from the dataset used to build the model, not including the testing dataset; moreover, FALSE refers to losses and TRUE to wins.

Decision Tree

This tells us that there’s some information in final score for distinguishing between contestants that answered a lot of questions correctly, but (as we suspected) questions answered is a lot more important. In fact, the model’s assessment of variable importance puts final score in dead last, behind correct answers, question difference, Coryat, and post-Double Jeopardy! score.

One last thing: now that we have our two models, which performs better? To test this, I did a couple of simple analyses using the portion of the data I had removed before building the models (roughly 330 games). For the logit model, I split the new data into 10 buckets based on the predicted probability that they would win their next game, then compared the actual winning percentage to the expected winning percentage for each bucket. I also calculated the confidence interval for the predictions because, with 33 games per bucket, there’s a lot of room for random variation to push the data around. (A technical note: the intervals come from simulations, because the probabilities within each bucket aren’t homogeneous.) A graphical representation of this is below, followed by the underlying numbers:

Logit Forest Plot

Logit Model Predictions by Decile
Decile Expected Win % Lower Bound Upper Bound Actual Win %
1 25 12 38 41
2 31 15 47 35
3 34 18 50 29
4 37 21 55 33
5 39 24 56 26
6 41 24 59 41
7 44 27 61 36
8 47 32 65 35
9 52 35 68 59
10 60 42 76 61

My takeaway from this (a very small sample!) is that the model does pretty well at very broadly assessing which players aren’t very likely to win and which aren’t, but the additional precision isn’t necessarily adding much, given how much overlap there is between the different deciles.

How about the tree model? The predictions from that one are pretty easy to summarize, because there’s only three separate predictions:

Tree Model Predictions
Predicted Win % Number of Matches Number of Wins Actual Winning %
36 84 247 34
45 24 42 57
63 26 48 54

The lower probability (larger) bucket looks very good; the smaller buckets don’t look quite as nice (though they are within the 95% confidence intervals for the predictions). There is a bit of gain, though, if you remove the second decision from the tree, the one that uses final score to split the latter two buckets, the predicted winning percentage is 56%, which is (up to rounding) exactly what we get in the out-of-sample testing if we combine those two buckets. We shouldn’t ignore the discrepancies in those predictions with such a small out of sample test, but it does suggest that the model is picking up something pretty valuable.

Because of that and its greater simplicity, I’m inclined to pick the tree as the winner, but I don’t think it’s obviously superior, and you’re free to disagree. To further aid this evaluation, I intend to follow this post up in a few months to see how the models have performed with a few more months of predictions. In the meantime, keep an eye out for a small app that allows for some simple Jeopardy! metrics, including predicted next-game winning percentages and a means of comparing two performances.