regressions | That's a Clown Hypothesis, Bro!

Post summary: Jeopardy!’s wagering and question weights make scores of all types (especially final scores) less repeatable and less predictable than raw numbers of questions answered. The upshot of this is that they make the results more random and less skill-based. This also makes it harder to use performance in one game to predict a second game’s result, but I present models to show that the prediction can be done somewhat successfully (though without a lot of granularity) using both decision tree and regression analysis.

I watched Jeopardy! a lot when I was younger, since I’ve long been a bit of a trivia fiend. When I got to college, though, I started playing quizbowl, which turned me off from Jeopardy! almost entirely. One reason is that I couldn’t play along as well, as the cognitive skills involved in the two games sometimes conflict with each other. More importantly, though, quizbowl made it clear to me the structural issues with Jeopardy!. Even disregarding that the questions rarely distinguish between two players with respect to knowledge, the arbitrary overweighting of questions via variable question values (a $2,000 question is certainly not 10 times as hard or…well, anything as a $200 question) and random Daily Doubles makes it hard to be certain that the better player actually won on a given day.

With that said, I still watch Jeopardy! on occasion, and I figured it’d be a fun topic to explore at the end of my impromptu writing hiatus. To that end, I scraped some performance data (show level, not question level) from the very impressive J-Archive, giving me data for most matches from the last 18 seasons of the show.

After scraping the data, I stripped out tournament and other special matches (e.g. Kids Week shows), giving me a bit over 3,000 shows worth of data to explore, including everything up until late December 2014. I decided to look at which variables are the best indicator of player quality in a predictive manner. Put another way, if you want to estimate the probability that the returning champion wins, which stat should you look at? Here are the ones I could pull out of the data and thought would be worth including:

Final score (FS).
Score after Double Jeopardy! but before Final Jeopardy! (DS).
Total number of questions answered correctly and incorrectly (including Daily Doubles but excluding Final Jeopardy!); the number used is usually right answers (RQ) minus wrong answers, or question differential (QD).
Coryat score (CS), which is the score excluding Final Jeopardy! and assuming that all wagers on Daily Doubles were the nominal value of the clue (e.g. a Daily Double in a $800 square is counted as $800, regardless of how much is wagered).

My initial inclination, as you can probably guess from the discussion above, is that purely looking at question difference should yield the best results. Obviously, though, we have to confirm this hypothesis, and to answer this, I decided to look at every player that played (at least) two matches, then see how well we could predict the result of the second match based on the player’s peripherals from the first match. While only having one data point per player isn’t ideal, this preserves a reasonable sample size and lets us avoid deciding how to aggregate stats across multiple matches. Overall, there are approximately 1,700 such players in my dataset, though the analysis I present below is based on a random sample of 80% of those players (the remainder is to be used for out-of-sample testing).

How well do these variables predict each other? This plot shows the correlation coefficients between our set of metrics in the first game and the second game. (Variables marked with 2 are the values from the second game played.)

In general, correct answers does predict the second-game results better than the other candidates, but it’s not too far from the other metrics that aren’t final score. (The margin of error on these correlations is roughly ± 0.05.) Still, this provides some firm evidence that final score isn’t a very good predictor, and that RQ might be the best of the ones here.

We can also use the data estimate how much noise each of the different elements add. If you assume that the player with the highest QD played the best match, then we can compare how often the highest QD player finishes with the best performance according to the other metrics. I find that the highest QD player finishes with the highest Coryat (i.e. including question values but no wagering) 83% of the time, the highest pre-Final Jeopardy! total (i.e. including question values and Daily Doubles) 80% of the time, and the highest overall score (i.e. she wins) 70% of the time. This means that including Final Jeopardy! increases the chances that the best performer doesn’t play another day 10% of the time, and weighting the questions at all increases the chances by 17%; both of those figures seem very high to me, and speak to how much randomness influences what we watch.

What about predicting future wins, though? What’s the best way to do that? To start with, I ran a series of logistic regressions (common models used to estimate probabilities for a binary variable like wins/losses) to see how well these variables predict loss. The following plots show the curve that estimates the probability of a player winning their second match, with bars showing the actual frequency of wins and losses. For the sake of brevity, let’s just show one score plot and one questions plot:

Logits Final Score

Unsurprisingly, the variable that was (in simplistic terms) the best predictor of peripherals also appears to be a much better predictor of future performance. We can see this not only by eyeballing the fit, but also simply by looking at the curves. The upward trend is much more pronounced in the later graphs, which here means that there’s enough meaningful variation to meaningfully peg some players as having 25% chances of winning and some as having 65% chances of winning. (For reference, 42% of players win their second games.) Final score, by contrast, puts almost all of the values in between 35% and 55%, suggesting that the difference between a player that got 20 questions right and one that got 25 right (43% and 55%) is comparable to the difference between a player that finished with $21,000 and one that finished with $40,000. Because $40,000 is in the top 1% of final scores in our dataset, while 25 questions right is only in the 85th percentile, this makes clear that score simply isn’t as predictive as questions answered. (For the technically inclined, the AIC of the models using raw questions are noticeably lower than those using various types of score. I left out the curves, but RQ is a little better than QD, noticeably better than DS and CS, and much better than FS.)

One of many issues with these regression models is that they impose a continuous structure on the data (i.e. they don’t allow for big jumps at a particular number of questions answered) and they omit interactions between variables (for instance, finishing with $25,000 might mean something very different depending on whether the player got 15 or 25 questions correct). To try to get around these issues, I also created a decision tree, which (in very crude terms) uses the data to build a flow-chart that predicts results.

Because of the way my software (the rpart package in R) fits the trees, it automatically throws out variables that don’t substantively improve the model, as opposed to the regressions above, which means it will tell us what the best predictors are. Here’s a picture of the decision tree that gets spit out. Note that all numbers in that plot are from the dataset used to build the model, not including the testing dataset; moreover, FALSE refers to losses and TRUE to wins.

Decision Tree

This tells us that there’s some information in final score for distinguishing between contestants that answered a lot of questions correctly, but (as we suspected) questions answered is a lot more important. In fact, the model’s assessment of variable importance puts final score in dead last, behind correct answers, question difference, Coryat, and post-Double Jeopardy! score.

One last thing: now that we have our two models, which performs better? To test this, I did a couple of simple analyses using the portion of the data I had removed before building the models (roughly 330 games). For the logit model, I split the new data into 10 buckets based on the predicted probability that they would win their next game, then compared the actual winning percentage to the expected winning percentage for each bucket. I also calculated the confidence interval for the predictions because, with 33 games per bucket, there’s a lot of room for random variation to push the data around. (A technical note: the intervals come from simulations, because the probabilities within each bucket aren’t homogeneous.) A graphical representation of this is below, followed by the underlying numbers:

Logit Model Predictions by Decile
Decile	Expected Win %	Lower Bound	Upper Bound	Actual Win %
1	25	12	38	41
2	31	15	47	35
3	34	18	50	29
4	37	21	55	33
5	39	24	56	26
6	41	24	59	41
7	44	27	61	36
8	47	32	65	35
9	52	35	68	59
10	60	42	76	61

My takeaway from this (a very small sample!) is that the model does pretty well at very broadly assessing which players aren’t very likely to win and which aren’t, but the additional precision isn’t necessarily adding much, given how much overlap there is between the different deciles.

How about the tree model? The predictions from that one are pretty easy to summarize, because there’s only three separate predictions:

Tree Model Predictions
Predicted Win %	Number of Matches	Number of Wins	Actual Winning %
36	84	247	34
45	24	42	57
63	26	48	54

The lower probability (larger) bucket looks very good; the smaller buckets don’t look quite as nice (though they are within the 95% confidence intervals for the predictions). There is a bit of gain, though, if you remove the second decision from the tree, the one that uses final score to split the latter two buckets, the predicted winning percentage is 56%, which is (up to rounding) exactly what we get in the out-of-sample testing if we combine those two buckets. We shouldn’t ignore the discrepancies in those predictions with such a small out of sample test, but it does suggest that the model is picking up something pretty valuable.

Because of that and its greater simplicity, I’m inclined to pick the tree as the winner, but I don’t think it’s obviously superior, and you’re free to disagree. To further aid this evaluation, I intend to follow this post up in a few months to see how the models have performed with a few more months of predictions. In the meantime, keep an eye out for a small app that allows for some simple Jeopardy! metrics, including predicted next-game winning percentages and a means of comparing two performances.

Like most White Sox fans, I was disappointed when Mark Buehrle left the team. I didn’t necessarily think they made a bad decision, but Buehrle is one of those guys that makes me really appreciate baseball on a sentimental level. He’s never seemed like a real ace, but he’s more interesting: he worked at a quicker pace than any other pitcher, was among the very best fielding pitchers, and held runners on like few others (it’s a bit out of date, but this post has him picking off two batters for each one that steals, which is astonishing).

In my experience, these traits are usually discussed as though they’re unrelated to his value as a pitcher, and the same could probably be said of the fielding skills possessed by guys like Jim Kaat and Greg Maddux. However, that’s covering up a non-negligible portion of what Buehrle has brought to his teams over the year; using a crude calculation of 10 runs per win, his 87 Defensive Runs Saved are equal to about 20% of his 41 WAR during the era for which have DRS numbers. (Roughly half of that 20% is from fielding his position, with the other half coming from his excellent work in inhibiting base thieves. Defensive Runs Saved are a commonly used, all-encompassing defensive metric from Baseball Info Solutions. All numbers in this piece are from Fangraphs. ) Buehrle’s extreme, but he’s not the only pitcher like this; Jake Westbrook had 62 DRS and only 18 WAR or so in the DRS era, which means the DRS equate to more than 30% of the WAR.

So fielding can make up a substantial portion of a pitcher’s value, but it seems like we rarely discuss it. That makes a fair amount of sense; single season fielding metrics are considered to be highly variable for position players who will be on the field for six times as many innings as a typical starting pitcher, and pitcher defensive metrics are less trustworthy even beyond that limitation. Still, though, I figured it’d be interesting to look at which sorts of pitchers tend to be better defensively.

For purposes of this study, I only looked at what I’ll think of as “fielding runs saved,” which is total Defensive Runs Saved less runs saved from stolen bases (rSB). (If you’re curious, there is a modest but noticeable 0.31 correlation between saving runs on stolen bases and fielding runs saved.) I also converted it into a rate stat by dividing by the number of innings pitched and then multiplying by 150 to give a full season rate. Finally, I restricted to aggregate data from the 331 pitchers who threw at least 300 innings (2 full seasons by standard reckoning) between 2007 and 2013; 2007 was chosen because it’s the beginning of the PitchF/X era, which I’ll get to in a little bit. My thought is that a sample size of 330 is pretty reasonable, and while players will have changed over the full time frame it also provides enough innings that the estimates will be a bit more stable.

One aside is that DRS, as a counting stat, doesn’t adjust for how many opportunities a given fielder has, so a pitcher who induces lots of strikeouts and fly balls will necessarily have DRS values smaller in magnitude than another pitcher of the same fielding ability but different pitching style.

Below is a histogram of pitcher fielding runs/150 IP for the population in question:

If you’re curious, the extreme positive values are Greg Maddux and Jake Westbrook, and the extreme negative values are Philip Humber, Brandon League, and Daniel Cabrera.

This raises another set of questions: what sort of pitchers tend to be better fielders? To test this, I decided to use linear regression—not because I want to make particularly nuanced predictions using the estimates, but because it is a way to examine how much of a correlation remains between fielding and a given variable after controlling for other factors. Most of the rest of the post will deal with the regression methods, so feel free to skip to the bold text at the end to see what my conclusions were.

What jumped out to me initially, is that Buehrle, R.A. Dickey, Westbrook, and Maddux are all extremely good fielding pitchers that aren’t hard throwers; to that end, I included their average velocity as one of the independent variables in the regression. (Hence the restriction to the PitchF/X era.) To control for the fact that harder throwers also strike out more batters and thus don’t have as many opportunities to make plays, I included the pitcher’s strikeouts per nine IP as a control as well.

It also seems plausible to me that there might be a handedness effect or a starter/reliever gap, so I added indicator variables for those to the model as well. (Given that righties and relievers throw harder than lefties and starters, controlling for velocity is key. Relievers are defined as those with at least half their innings in relief.) I also added in ground ball rate, with the thought that having more plays to make could have a substantial effect on the demonstrated fielding ability.

There turns out to be a noticeable negative correlation between velocity and fielding ability. This doesn’t surprise me, as it’s consistent with harder throwers having a longer, more intense delivery that makes it harder for them to react quickly to a line drive or ground ball. According to the model, we’d associate each mile per hour increase with a 0.2 fielding run per season decrease; however, I’d shy away from doing anything with that estimate given how poor the model is. (The R-squared values on the models discussed here are all less than 0.2, which is not very good.) Even if we take that estimate at face value, though, it’s a pretty small effect, and one that’s hard to read much into.

We don’t see any statistically significant results for K/9, handedness, or starter/reliever status. (Remember that this doesn’t take into account runs saved through stolen base prevention; in that case, it’s likely that left handers will rate as superior and hard throwers will do better due to having a faster time to the plate, but I’ll save that for another post.) In fact, of the non-velocity factors considered, only ground ball rate has a significant connection to fielding; it’s positively related, with a rough estimate that a percentage point increase in groundball rate will have a pitcher snag 0.06 extra fielding runs per 150 innings. That is statistically significant, but it’s a very small amount in practice and I suspect it’s contaminated by the fact that an increase in ground ball rate is related to an increase in fielding opportunities.

To attempt to control for that contamination, I changed the model so that the dependent (i.e. predicted) variable was [fielding runs / (IP/150 * GB%)]. That stat is hard to interpret intuitively (if you elide the batters faced vs. IP difference, it’s fielding runs per groundball), so I’m not thrilled about using it, but for this single purpose it should be useful to help figure out if ground ball pitchers tend to be better fielders even after adjusting for additional opportunities.

As it turns out, the same variables are significant in the new model, meaning that even after controlling for the number of opportunities GB pitchers and soft tossers are generally stronger fielders. The impact of one extra point of GB% is approximately equivalent to losing 0.25 mph off the average pitch speed; however, since pitch speed has a pretty small coefficient we wouldn’t expect either of these things to have a large impact on pitcher fielding.

This was a lot of math to not a huge effect, so here’s a quick summary of what I found in case I lost you:

Harder throwers contribute less on defense even after controlling for having fewer defensive opportunities due to strikeouts. Ground ball pitchers contribute more than other pitchers even if you control for having more balls they can make plays on.
The differences here are likely to be very small and fairly noisy (especially if you remember that the DRS numbers themselves are a bit wonky), meaning that, while they apply in broad terms, there will be lots and lots of exceptions to the rule.
Handedness and role (i.e. starter/reliever) have no significant impact on fielding contribution.

All told, then, we shouldn’t be too surprised Buehrle is a great fielder, given that he doesn’t throw very hard. On the other hand, though, there are plenty of other soft tossers who are minus fielders (Freddy Garcia, for instance), so it’s not as though Buehrle was bound to be good at this. To me, that just makes him a little bit quirkier and reminds me of why I’ll have a soft spot for him above-and-beyond what he got just for being a great hurler for the Sox.

That's a Clown Hypothesis, Bro!

Sports analysis and commentary, mostly empirically-based.

Tag Archives: regressions

Some Fun With Jeopardy! Data

A Look at Pitcher Defense

Decile	Expected Win %	Lower Bound	Upper Bound	Actual Win %
1	25	12	38	41
2	31	15	47	35
3	34	18	50	29
4	37	21	55	33
5	39	24	56	26
6	41	24	59	41
7	44	27	61	36
8	47	32	65	35
9	52	35	68	59
10	60	42	76	61

Decile	Expected Win %	Lower Bound	Upper Bound	Actual Win %
1	25	12	38	41
2	31	15	47	35
3	34	18	50	29
4	37	21	55	33
5	39	24	56	26
6	41	24	59	41
7	44	27	61	36
8	47	32	65	35
9	52	35	68	59
10	60	42	76	61

Decile	Expected Win %	Lower Bound	Upper Bound	Actual Win %
1	25	12	38	41
2	31	15	47	35
3	34	18	50	29
4	37	21	55	33
5	39	24	56	26
6	41	24	59	41
7	44	27	61	36
8	47	32	65	35
9	52	35	68	59
10	60	42	76	61