Category Archives: Research

A Look at Pitcher Defense

Like most White Sox fans, I was disappointed when Mark Buehrle left the team. I didn’t necessarily think they made a bad decision, but Buehrle is one of those guys that makes me really appreciate baseball on a sentimental level. He’s never seemed like a real ace, but he’s more interesting: he worked at a quicker pace than any other pitcher, was among the very best fielding pitchers, and held runners on like few others (it’s a bit out of date, but this post has him picking off two batters for each one that steals, which is astonishing).

In my experience, these traits are usually discussed as though they’re unrelated to his value as a pitcher, and the same could probably be said of the fielding skills possessed by guys like Jim Kaat and Greg Maddux. However, that’s covering up a non-negligible portion of what Buehrle has brought to his teams over the year; using a crude calculation of 10 runs per win, his 87 Defensive Runs Saved are equal to about 20% of his 41 WAR during the era for which have DRS numbers. (Roughly half of that 20% is from fielding his position, with the other half coming from his excellent work in inhibiting base thieves. Defensive Runs Saved are a commonly used, all-encompassing defensive metric from Baseball Info Solutions. All numbers in this piece are from Fangraphs. ) Buehrle’s extreme, but he’s not the only pitcher like this; Jake Westbrook had 62 DRS and only 18 WAR or so in the DRS era, which means the DRS equate to more than 30% of the WAR.

So fielding can make up a substantial portion of a pitcher’s value, but it seems like we rarely discuss it. That makes a fair amount of sense; single season fielding metrics are considered to be highly variable for position players who will be on the field for six times as many innings as a typical starting pitcher, and pitcher defensive metrics are less trustworthy even beyond that limitation. Still, though, I figured it’d be interesting to look at which sorts of pitchers tend to be better defensively.

For purposes of this study, I only looked at what I’ll think of as “fielding runs saved,” which is total Defensive Runs Saved less runs saved from stolen bases (rSB). (If you’re curious, there is a modest but noticeable 0.31 correlation between saving runs on stolen bases and fielding runs saved.) I also converted it into a rate stat by dividing by the number of innings pitched and then multiplying by 150 to give a full season rate. Finally, I restricted to aggregate data from the 331 pitchers who threw at least 300 innings (2 full seasons by standard reckoning) between 2007 and 2013; 2007 was chosen because it’s the beginning of the PitchF/X era, which I’ll get to in a little bit. My thought is that a sample size of 330 is pretty reasonable, and while players will have changed over the full time frame it also provides enough innings that the estimates will be a bit more stable.

One aside is that DRS, as a counting stat, doesn’t adjust for how many opportunities a given fielder has, so a pitcher who induces lots of strikeouts and fly balls will necessarily have DRS values smaller in magnitude than another pitcher of the same fielding ability but different pitching style.

Below is a histogram of pitcher fielding runs/150 IP for the population in question:


If you’re curious, the extreme positive values are Greg Maddux and Jake Westbrook, and the extreme negative values are Philip Humber, Brandon League, and Daniel Cabrera.

This raises another set of questions: what sort of pitchers tend to be better fielders? To test this, I decided to use linear regression—not because I want to make particularly nuanced predictions using the estimates, but because it is a way to examine how much of a correlation remains between fielding and a given variable after controlling for other factors. Most of the rest of the post will deal with the regression methods, so feel free to skip to the bold text at the end to see what my conclusions were.

What jumped out to me initially, is that Buehrle, R.A. Dickey, Westbrook, and Maddux are all extremely good fielding pitchers that aren’t hard throwers; to that end, I included their average velocity as one of the independent variables in the regression. (Hence the restriction to the PitchF/X era.) To control for the fact that harder throwers also strike out more batters and thus don’t have as many opportunities to make plays, I included the pitcher’s strikeouts per nine IP as a control as well.

It also seems plausible to me that there might be a handedness effect or a starter/reliever gap, so I added indicator variables for those to the model as well. (Given that righties and relievers throw harder than lefties and starters, controlling for velocity is key. Relievers are defined as those with at least half their innings in relief.) I also added in ground ball rate, with the thought that having more plays to make could have a substantial effect on the demonstrated fielding ability.

There turns out to be a noticeable negative correlation between velocity and fielding ability. This doesn’t surprise me, as it’s consistent with harder throwers having a longer, more intense delivery that makes it harder for them to react quickly to a line drive or ground ball. According to the model, we’d associate each mile per hour increase with a 0.2 fielding run per season decrease; however, I’d shy away from doing anything with that estimate given how poor the model is. (The R-squared values on the models discussed here are all less than 0.2, which is not very good.) Even if we take that estimate at face value, though, it’s a pretty small effect, and one that’s hard to read much into.

We don’t see any statistically significant results for K/9, handedness, or starter/reliever status. (Remember that this doesn’t take into account runs saved through stolen base prevention; in that case, it’s likely that left handers will rate as superior and hard throwers will do better due to having a faster time to the plate, but I’ll save that for another post.) In fact, of the non-velocity factors considered, only ground ball rate has a significant connection to fielding; it’s positively related, with a rough estimate that a percentage point increase in groundball rate will have a pitcher snag 0.06 extra fielding runs per 150 innings. That is statistically significant, but it’s a very small amount in practice and I suspect it’s contaminated by the fact that an increase in ground ball rate is related to an increase in fielding opportunities.

To attempt to control for that contamination, I changed the model so that the dependent (i.e. predicted) variable was [fielding runs / (IP/150 * GB%)]. That stat is hard to interpret intuitively (if you elide the batters faced vs. IP difference, it’s fielding runs per groundball), so I’m not thrilled about using it, but for this single purpose it should be useful to help figure out if ground ball pitchers tend to be better fielders even after adjusting for additional opportunities.

As it turns out, the same variables are significant in the new model, meaning that even after controlling for the number of opportunities GB pitchers and soft tossers are generally stronger fielders. The impact of one extra point of GB% is approximately equivalent to losing 0.25 mph off the average pitch speed; however, since pitch speed has a pretty small coefficient we wouldn’t expect either of these things to have a large impact on pitcher fielding.

This was a lot of math to not a huge effect, so here’s a quick summary of what I found in case I lost you:

  • Harder throwers contribute less on defense even after controlling for having fewer defensive opportunities due to strikeouts. Ground ball pitchers contribute more than other pitchers even if you control for having more balls they can make plays on.
  • The differences here are likely to be very small and fairly noisy (especially if you remember that the DRS numbers themselves are a bit wonky), meaning that, while they apply in broad terms, there will be lots and lots of exceptions to the rule.
  • Handedness and role (i.e. starter/reliever) have no significant impact on fielding contribution.

All told, then, we shouldn’t be too surprised Buehrle is a great fielder, given that he doesn’t throw very hard. On the other hand, though, there are plenty of other soft tossers who are minus fielders (Freddy Garcia, for instance), so it’s not as though Buehrle was bound to be good at this. To me, that just makes him a little bit quirkier and reminds me of why I’ll have a soft spot for him above-and-beyond what he got just for being a great hurler for the Sox.


Picking a Pitch and the Pace of the Game

Here’s a short post to answer a straight-forward question: do pitchers that throw more pitches pitch more slowly? If it’s not clear, the idea is that a pitcher who throws several pitches frequently will take longer because the catcher has to spend more time calling the pitch, perhaps with a corresponding increase in how often the pitcher shakes off the catcher.

To make a quick pass at this, I pulled FanGraphs data on how often each pitcher threw fastballs, sliders, curveballs, changeups, cutters, splitters, and knucklers, using data from 2009–13 on all pitches with at least 200 innings. (See the data here. There are well-documented issues with the categorizations, but for a small question like this they are good enough.) The statistic used for how quickly the pitcher worked was the appropriately named Pace, which measures the number of seconds between pitches thrown.

To easily test the hypothesis, we need a single number to measure how even the pitcher’s pitch mix is, which we believe to be linked to the complexity of the decision they need to make. There are many ways to do this, but I decided to go with the Herfindahl-Hirschman Index, which is usually used to measure market concentration in economics. It’s computed by squaring the percentage share of each pitch and adding them together, so higher values mean things are more concentrated. (The theoretical max is 10,000.) As an example, Mariano Rivera threw 88.9% cutters and 11.1% fastballs over the time period we’re examining, so his HHI was 88.9^{2} + 11.1^{2} = 8026. David Price threw 66.7% fastballs, 5.8% sliders, 6.6% cutters, 10.6% curveballs, and 10.4% changeups, leading to an HHI of 4746. (See additional discussion below.) If you’re curious, the most and least concentrated repertoires split by role are in a table at the bottom of the post.

As an aside, I find two people on those leader/trailer lists most interesting. The first is Yu Darvish, who’s surrounded by junkballers—it’s pretty cool that he has such amazing stuff and still throws 4.5 pitches with some regularity. The second is that Bartolo Colon has, according to this metric, less variety in his pitch selection over the last five years than the two knuckleballers in the sample. He’s somehow a junkballer but with only one pitch, which is a pretty #Mets thing to be.

Back to business: after computing HHIs, I split the sample into 99 relievers and 208 starters, defined as pitchers who had at least 80% of their innings come in the respective role. I enforced the starter/reliever split because a) relievers have substantially less pitch diversity (unweighted mean HHI of 4928 vs. 4154 for starters, highly significant) and b) they pitch substantially slower, possibly due to pitching more with men on base and in higher leverage situations (unweighted mean Pace of 23.75 vs. 21.24, a 12% difference that’s also highly significant).

So, how does this HHI match up with pitching pace for these two groups? Pretty poorly. The correlation for starters is -0.11, which is the direction we’d expect but a very small correlation (and one that’s not statistically significant at p = 0.1, to the limit extent that statistical significance matters here). For relievers, it’s actually 0.11, which runs against our expectation but is also statistically and practically no different from 0. Overall, there doesn’t seem to be any real link, but if you want to gaze at the entrails, I’ve put scatterplots at the bottom as well.

One important note: a couple weeks back, Chris Teeter at Beyond the Box Score took a crack at the same question, though using a slightly different method. Unsurprisingly, he found the same thing. If I’d seen the article before I’d had this mostly typed up, I might not have gone through with it, but as it stands, it’s always nice to find corroboration for a result.


Relief Pitchers with Most Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Sean Marshall 25.6 18.3 17.7 38.0 0.5 0.0 0.0 2748
2 Brandon Lyon 43.8 18.3 14.8 18.7 4.4 0.0 0.0 2841
3 D.J. Carrasco 32.5 11.2 39.6 14.8 2.0 0.0 0.0 2973
4 Alfredo Aceves 46.5 0.0 17.9 19.8 13.5 2.3 0.0 3062
5 Logan Ondrusek 41.5 2.0 30.7 20.0 0.0 5.8 0.0 3102
Relief Pitchers with Least Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Kenley Jansen 91.4 7.8 0.0 0.2 0.6 0.0 0.0 8415
2 Mariano Rivera 11.1 0.0 88.9 0.0 0.0 0.0 0.0 8026
3 Ronald Belisario 85.4 12.7 0.0 0.0 0.0 1.9 0.0 7458
4 Matt Thornton 84.1 12.5 3.3 0.0 0.1 0.0 0.0 7240
5 Ernesto Frieri 82.9 5.6 0.0 10.4 1.1 0.0 0.0 7013
Starting Pitchers with Most Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Shaun Marcum 36.6 9.3 17.6 12.4 24.1 0.0 0.0 2470
2 Freddy Garcia 35.4 26.6 0.0 7.9 13.0 17.1 0.0 2485
3 Bronson Arroyo 42.6 20.6 5.1 14.2 17.6 0.0 0.0 2777
4 Yu Darvish 42.6 23.3 16.5 11.2 1.2 5.1 0.0 2783
5 Mike Leake 43.5 11.8 23.4 9.9 11.6 0.0 0.0 2812
Starting Pitchers with Least Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Bartolo Colon 86.2 9.1 0.2 0.0 4.6 0.0 0.0 7534
2 Tim Wakefield 10.5 0.0 0.0 3.7 0.0 0.0 85.8 7486
3 R.A. Dickey 16.8 0.0 0.0 0.2 1.5 0.0 81.5 6927
4 Justin Masterson 78.4 20.3 0.0 0.0 1.3 0.0 0.0 6560
5 Aaron Cook 79.7 9.7 2.8 7.6 0.4 0.0 0.0 6512

Boring methodological footnote: There’s one primary conceptual problem with using HHI, and that’s that in certain situations it gives a counterintuitive result for this application. For instance, under our line of reasoning we would think that, ceteris paribus, a pitcher who throws a fastball 90% of a time and a change 10% of the time would have an easier decision to make than one who throws a fastball 90% of the time and a change and slider 5% each. However, the HHI is higher for the latter pitcher—which makes sense in the context of market concentration, but not in this scenario. (The same issue holds for the Gini coefficient, for that matter.) There’s a very high correlation between HHI and the frequency of a pitcher’s most common pitch, though, and using the latter doesn’t change any of the conclusions of the post.

The Joy of the Internet, Pt. 2

I wrote one of these posts a while back about trying to figure out which game Bunk and McNulty attend in a Season 3 episode of The Wire. This time, I’m curious about a different game, and we have a bit less information to go on, so it took a bit more digging to find.

The intro to the Drake song “Connect” features the call of a home run being hit. Given that it probably required getting the express written consent of MLB for this sample, my guess is that he got it recorded by an announcer in the studio (as he implies around the 10:30 mark of this video). Still, does it match any games we have on record?

To start, I’m going to assume that this is a major league game, though there’s of course no way of knowing for sure. From the song, all we get is the count, the fact that it was a home run, the direction of the home run, and the name of the outfielder.  The first three are easy to hear, but the fourth is a bit tricky—a few lyrics sites (including the description of the video I linked) list it as “Molina,” but that can’t be the case, as none of the Molinas who’ve played in the bigs played the outfield.

RapGenius, however, lists it as “Revere,” and I’m going to go with that, since Ben Revere is an active major league center fielder and it seems likely that Drake would have sampled a recent game. So, can we find a game that matches all these parameters?

I first checked for only games Revere has played against the Blue Jays, since Drake is from Toronto and the RapGenius notes say (without a source) that the call is from a Jays game. A quick check of Revere’s game logs against the Jays, though, says that he’s never been on the field for a 3-1 homer by a Jay.

What about against any other team? Since checking this by hand wasn’t going to fly (har har), I turned to play-by-play data, available from the always-amazing Retrosheet. With the help of some code from possibly the nerdiest book I own, I was able to filter every play since Revere has joined the league to find only home runs hit to center when Revere was in center and the count was 3-1.

Somewhat magically, there was only one: a first inning shot by Carlos Gomez against the Twins in 2011. The video is here, for reference. I managed to find the Twins’ TV call via MLB.TV, and the Brewers’ team did the video, and (unsurprisingly) neither call fits the sample, though I didn’t go looking for the radio call. Still, the home run is such that it wouldn’t be surprising if either one of the radio calls matched what Drake used, or if it was close and Drake had it rerecorded in such a way that preserved the details of the play.

So, probably through dumb luck, Drake managed to pick a unique play to sample for his track. But even though it’s a baseball sample, I still click back to “Hold On, We’re Going Home” damn near every time I listen to the album.

Principals of Hitter Categorization

(Note: The apparent typo in the title is deliberate.)

In my experience with introductory statistics classes, both ones I’ve taken and ones I’ve heard about, they typically have two primary phases. The second involves hypothesis testing and regression, which entail trying to evaluate the statistical evidence regarding well-formulated questions. (Well, in an ideal world the questions are well-formulated. Not always the case, as I bitched about on Twitter recently.) This is the more challenging, mathematically sophisticated part of the course, and for those reasons it’s probably the one that people don’t remember quite so well.

What’s the first part? It tends to involve lots of summary statistics and plotting—means, scatterplots, interquartile ranges, all of that good stuff that one does to try to get a handle on what’s going on in the data. Ideally, some intuition regarding stats and data is getting taught here, but that (at least in my experience) is pretty hard to teach in a class. Because this part is more introductory and less complicated, I think this portion of statistics—which is called exploratory data analysis, though there are some aspects of the definition I’m glossing over—can get short shrift when people discuss cool stuff one can do with statistics (though data visualization is an important counterpoint here).

A slightly more complex technique one can do as part of exploratory data analysis is principal component analysis (PCA), which is a way of redefining a data set’s variables based on the correlations present therein. While a technical explanation can be found elsewhere, the basic gist is that PCA allows us to combine variables that are related within the data so that we can pack as much explanatory power as possible into them.

One classic application of this is to athletes’ scores in the decathlon in the Olympics (see example here). There are 10 events, which can be clustered into groups of similar events like the 100 meters and 400 meters and the shot put and discus. If we want to describe the two most important factors contributing to an athlete’s success, we might subjectively guess something like “running ability” and “throwing skill.” PCA can use the data to give us numerical definitions of the two most important factors determining the variation in the data, and we can explore interpretations of those factors in terms of our intuition about the event.

So, what if we take this idea and apply it to baseball hitting data? This would us allow to derive some new factors that explain a lot of the variation in hitting, and by using those factors judiciously we can use this as a way to compare different batters. This idea is not terribly novel—here are examples of some previous work—but I haven’t seen anyone taking the approach I have now. For this post, I’m focused more on what I will call hitting style, i.e. I’d like to divorce similarity based on more traditional results (e.g. home runs—this is the sort of similarity Baseball-Reference uses) in favor of lower order data, namely a batter’s batted ball profile (e.g. line drive percentage and home run to fly ball ratio). However, the next step is certainly to see how these components correlate with traditional measures of power, for instance Isolated Slugging (ISO).

So, I pulled career-level data from FanGraphs for all batters with at least 1000 PA since 2002 (when batted ball data began being collected) on the following categories: line drive rate (LD%), ground ball rate (GB%), outfield fly ball rate (FB%), infield fly ball rate (IFFB%), home run/fly ball ratio (HR/FB), walk rate (BB%), and strike rate (K%). (See report here.) (I considered using infield hit rate as well, but it doesn’t fit in with the rest of these things—it’s more about speed and less about hitting, after all.)

I then ran the PCA on these data in R, and here are the first two components, i.e. the two weightings that together explain as much of the data as possible. (Things get a bit harder to interpret when you add a third dimension.) All data are normalized, so that coefficients are comparable, and it’s most helpful to focus on the signs and relative magnitudes—if one variable is weighted 0.6 and the other -0.3, the takeaway is that the first is twice as important for the component as the second and pushes that component in the opposite direction.

Weights for First Two Principal Components
LD% -0.030 0.676
GB% -0.459 0.084
FB% 0.526 0.093
IFFB% -0.067 -0.671
HR/FB 0.459 -0.137
BB% 0.375 0.205
K% 0.394 -0.126

The first two components explain 39% and 22%, respectively, of the overall variation in our data. (The next two explain 16% and 10%, respectively, so they are still important.) This means, basically, that we can explain about 60% of a given batter’s batted ball profile with only these two parameters. (I have all seven components with their importance in a table at the bottom of the post. It’s also worth noting that, as the later components explain less variation, their variance decreases and players are clustered close together on that dimension.)

Arguably the whole point of this exercise is to come up with a reasonable interpretation for these components, so it’s worth it for you to take a look at the values and the interplay between them. I would describe the two components (which we should really think of as axes) as follows:

  1. The first is a continuum: slap hitters who make a lot of contact, don’t walk much, hit mostly ground balls and few fly balls with few home runs are on the negative end, and big boppers—three true outcomes guys—place on the top end, as they walk a lot, strike out a lot, and hit more fly balls. This interpretation is borne out by the players with the large magnitude values for this component (found below). For lack of a better term, let’s call this component BSF, for “Big Stick Factor.”
  2. The second measures, basically, what some people might call “line drive power.” It measures people’s propensity to hit the ball hard, as it opposes line drives and infield flies. It also rewards guys with good batting eyes, since it opposes walk rate and strikeout rate. I think of this as assessing an old-fashioned view of what makes a good hitter—lots of contact and line drives, with less upper cutting and thus fewer line drives. Let’s call it LDP, for “Line Drive Power.” (I’m open to suggestions on both names.)

Here are some tables showing the top and bottom 10 for both BSF and LDP:

Extreme Values for BSF
Name PC1
1 Russell Branyan 5.338
2 Barry Bonds 5.257
3 Adam Dunn 4.768
4 Jack Cust 4.535
5 Ryan Howard 4.296
6 Jim Thome 4.278
7 Jason Giambi 4.237
8 Frank Thomas 4.206
9 Jim Edmonds 4.114
10 Mark Reynolds 3.890
633 Aaron Miles -3.312
634 Cesar Izturis -3.397
635 Einar Diaz -3.518
636 Ichiro Suzuki -3.523
637 Rey Sanchez -3.893
638 Luis Castillo -4.013
639 Juan Pierre -4.267
640 Wilson Valdez -4.270
641 Ben Revere -5.095
642 Joey Gathright -5.164
Extreme Values for LDP
Name PC2
1 Cory Sullivan 4.292
2 Matt Carpenter 4.052
3 Joey Votto 3.779
4 Joe Mauer 3.255
5 Ruben Tejada 3.079
6 Todd Helton 3.065
7 Julio Franco 2.933
8 Jason Castro 2.780
9 Mark Loretta 2.772
10 Alex Avila 2.747
633 Alexi Casilla -2.482
634 Rocco Baldelli -2.619
635 Mark Trumbo -2.810
636 Nolan Reimold -2.932
637 Marcus Thames -3.013
638 Tony Batista -3.016
639 Scott Hairston -3.041
640 Eric Byrnes -3.198
641 Jayson Nix -3.408
642 Jeff Mathis -3.668

These actually map pretty closely onto what some of our preexisting ideas might have been: the guys with the highest BSF are some of the archetypal three true outcomes players, while the guys with the high LDP are guys we think of as being good hitters with “doubles power,” as it were. It’s also interesting to note that these are not entirely correlated with hitter quality, as there’s some mediocre players near the top of each list (though most of the players at the bottom aren’t too great). That suggests to me that this did actually a pretty decent job of capturing style, rather than just quality (though obviously it’s easier to observe someone’s style when they actually have strengths).

Now, another thing about this is that while we would think that BSF and LDP are correlated based on my qualitative descriptions, by construction there’s zero correlation between the two sets of values, so these are actually largely independent stats. Consider the plot below of BSF vs. LDP:

PCA Cloud

And this plot, which isolates some of the more extreme values:

Big Values

One final thing for this post: given that we have plotted these like coordinates, we can use the standard measure of distance between two points as a measure for similarity. For this, I’m going to change tacks slightly and use the first The two players most like each other in this sample form a slightly unlikely pair: Marlon Byrd, with coordinates (-0.756, 0.395), and Carlos Ruiz (-0.755, 0.397).

As you see below, if you look at their batted ball profiles, they don’t appear to be hugely similar.  I spent a decent amount of time playing around with this; if you increase the number of components used from two to three or more, the similar players look much more similar in terms of these statistics. However, that gets away from the point of PCA, which is to abstract away from the data a bit. Thus, these pairs of similar players are players who have very similar amounts of BSF and LDP, rather than players who have the most similar statistics overall.

Comparison of Ruiz and Byrd
Carlos Ruiz 0.198 0.455 0.255 0.092 0.074 0.098 0.111
Marlon Byrd 0.206 0.471 0.241 0.082 0.093 0.064 0.180

Another pair that’s approximately as close as Ruiz and Byrd is Mark Teahen (-0.420,-0.491) and Akinori Iwamura (-0.421,-0.490), with the third place pair being Yorvit Torrealba (-1.919, -0.500) and Eric Young (-1.909, -0.497), who are seven times farther apart than the first two pairs.

Who stands out as outliers? It’s not altogether surprising if you look at the labelled charts above, though not all of them are labelled. (Also, be wary of the scale—the graph is a bit squished, so many players are farther apart numerically than they appear visually.) Joey Gathright turns out to be by far the most unusual player in our data—the distance to his closest comp, Einar Diaz, is more than 1000x the distance from Ruiz to Byrd, more than thirteen times the average distance to a player’s nearest neighbor, and more than eleven standard deviations above that average nearest neighbor difference.

In this case, though, having a unique style doesn’t appear to be beneficial. You’ll note Gathright is at the bottom of the BSF list, and he’s pretty far down the LDP list as well, meaning that he somehow stumbled into a seven year career despite having no power of any sort. Given that he posted an extremely pedestrian 0.77 bWAR per 150 games (meaning about half as valuable as an average player), hit just one home run in 452 games, and had the 13th lowest slugging percentage of any qualifying non-pitcher since 1990, we probably shouldn’t be surprised that there’s nobody who’s quite like him.

The rest of the players on the outliers list are the ones you’d expect—guys with extreme values for one or both statistics: Joey Votto, Barry BondsCory Sullivan, Matt Carpenter, and Mark Reynolds. Votto is the second biggest outlier, and he’s less than two thirds as far from his nearest neighbor (Todd Helton) as Gathright is from his. Two things to notice here:

  • To reiterate what I just said about Gathright, style doesn’t necessarily correlate with results. Cory Sullivan hit a lot of line drives (28.2%, the largest value in my sample—the mean is 20.1%) and popped out infrequently (3%, the mean is 10.1%). His closest comps are Matt Carpenter and Joe Mauer, which is pretty good company. And yet, he finished as a replacement level player with no power. Baseball is weird.
  • Many of the most extreme outliers are players where we are missing a big chunk of their careers, either because they haven’t actually had them yet or because the data are unavailable. Given that there’s some research indicating that various power-related statistics change with age, I suspect we’ll see some regression to the mean for guys like Votto and Carpenter. (For instance, I imagine Bonds’s profile would look quite different if it included the first 16 years of his career.)

This chart shows the three tightest pairs of players and the six biggest outliers:

New Comps

This is a bit of a lengthy post without necessarily an obvious point, but, as I said at the beginning, exploratory data analysis can be plenty interesting on its own, and I think this turned into a cool way of classifying hitters based on certain styles. An obvious extension is to find some way to merge both results and styles into one PCA analysis (essentially combining what I did with the Bill James/BR Similarity Score mentioned above), but I suspect that’s a big question, and one for another time.

If you’re curious, here’s a link to a public Google Doc with my principal components, raw data, and nearest distances and neighbors, and below is the promised table of PCA breakdown:

Weights and Explanatory Power of Principal Components
LD% -0.030 0.676 -0.299 -0.043 0.629 0.105 -0.210
GB% -0.459 0.084 0.593 0.086 -0.044 0.020 -0.648
FB% 0.526 0.093 -0.288 -0.226 -0.434 -0.014 -0.626
IFFB% -0.067 -0.671 -0.373 0.247 0.442 -0.071 -0.379
HR/FB 0.459 -0.137 0.347 0.113 0.214 0.769 -0.000
BB% 0.375 0.205 0.156 0.808 -0.012 -0.373 -0.000
K% 0.394 -0.126 0.437 -0.461 0.415 -0.503 0.000
Proportion of Variance 0.394 0.218 0.163 0.102 0.069 0.053 0.000
Cumulative Proportion 0.394 0.612 0.775 0.877 0.947 1.000 1.000

Wear Down, Chicago Bears?

I watched the NFC Championship game the weekend before last via a moderately sketchy British stream. It used the Joe Buck/Troy Aikman feed, but whenever that went to commercials they had their own British commentary team whose level of insight, I think it’s fair to say, was probably a notch below what you’d get if you picked three thoughtful-looking guys at random out of an American sports bar. (To be fair, that’s arguably true of most of the American NFL studio crews as well.)

When discussing Marshawn Lynch, one of them brought out the old chestnut that big running backs wear down the defense and thus are likely to get big chunks of yardage toward the end of games, citing Jerome Bettis as an example of this. This is accepted as conventional wisdom when discussing football strategy, but I’ve never actually seen proof of this one way or another, and I couldn’t find any analysis of this before typing up this post.

The hypothesis I want to examine is that bigger running backs are more successful late in games than smaller running backs. All of those terms are tricky to define, so here’s what I’m going with:

  • Bigger running backs are determined by weight, BMI, or both. I’m using Pro Football Reference data for this, which has some limitations in that it’s not dynamic, but I haven’t heard of any source that has any dynamic information on player size.
  • Late in games is the simplest thing to define: fourth quarter and overtime.
  • More successful is going to be measured in terms of yards per carry. This is going to be compared to the YPC in the first three quarters to account for the baseline differences between big and small backs. The correlation between BMI and YPC is -0.29, which is highly significant (p = 0.0001). The low R squared (about 0.1) says that BMI explains about 10% of variation in YPC, which isn’t great but does say that there’s a meaningful connection. There’s a plot below of BMI vs. YPC with the trend line added; it seems like close to a monotonic effect to me, meaning that getting bigger is on average going to hurt YPC. (Assuming, of course, that the player is big enough to actually be an NFL back.)


My data set consisted of career-level data split into 4th quarter/OT and 1st-3rd quarters, which I subset to only include carries occurring while the game was within 14 points (a cut popular with writers like Bill Barnwell—see about halfway down this post, for example) to attempt to remove huge blowouts, which may affect data integrity. My timeframe was 1999 to the present, which is when PFR has play-by-play data in its database. I then subset the list of running backs to only those with at least 50 carries in the first three quarters and in the fourth quarter and overtime (166 in all). (I looked at different carry cutoffs, and they don’t change any of my conclusions.)

Before I dive into my conclusions, I want to preemptively bring up a big issue with this, which is that it’s only on aggregate level data. This involves pairing up data from different games or even different years, which raises two problems immediately. The first is that we’re not directly testing the hypothesis; I think it is closer in spirit to interpret as “if a big running back gets lots of carries early on, his/his team’s YPC will increase in the fourth quarter,” which can only be looked at with game level data. I’m not entirely sure what metrics to look at, as there are a lot of confounds, but it’s going in the bucket of ideas for research.

The second is that, beyond having to look at this potentially effect indirectly, we might actually have biases altering the perceived effect, as when a player runs ineffectively in the first part of the game, he will probably get fewer carries at the end—partially because he is probably running against a good defense, and partially because his team is likely to be behind and thus passing more. This means that it’s likely that more of the fourth quarter carries come when a runner is having a good day, possibly biasing our data.

Finally, it’s possible that the way that big running backs wear the defense down is that they soften it up so that other running backs do better in the fourth quarter. This is going to be impossible to detect with aggregate data, and if this effect is actually present it will bias against finding a result using aggregate data, as it will be a lurking variable inflating the fourth quarter totals for smaller running backs.

Now, I’m not sure that either of these issues will necessarily ruin any results I get with the aggregate data, but they are caveats to be mentioned. I am planning on redoing some of this analysis with play-by-play level data, but those data are rather messy and I’m a little scared of small sample sizes that come with looking at one quarter at a time, so I think presenting results using aggregated data still adds something to the conversation.

Enough equivocating, let’s get to some numbers. Below is a plot of fourth quarter YPC versus early game YPC; the line is the identity, meaning that points above the line are better in the fourth. The unweighted mean of the difference (Q4 YPC – Q1–3 YPC) is -0.14, with the median equal to -0.15, so by the regular measures a typical running back is less effective in the 4th quarter (on aggregate in moderately close games). (A paired t-test shows this difference is significant, with p < 0.01.)

Q1-3 & Q4

A couple of individual observations jump out here, and if you’re curious, here’s who they are:

  • The guy in the top right, who’s very consistent and very good? Jamaal Charles. His YPC increases by about 0.01 yards in the fourth quarter, the second smallest number in the data (Chester Taylor has a drop of about 0.001 yards).
  • The outlier in the bottom right, meaning a major dropoff, is Darren Sproles, who has the highest early game YPC of any back in the sample.
  • The outlier in the top center with a major increase is Jerious Norwood.
  • The back on the left with the lowest early game YPC in our sample is Mike Cloud, whom I had never heard of. He’s the only guy below 3 YPC for the first three quarters.

A simple linear model gives us a best fit line of (Predicted Q4 YPC) = 1.78 + 0.54 * (Prior Quarters YPC), with an R squared of 0.12. That’s less predictive than I thought it would be, which suggests that there’s a lot of chance in these data and/or there is a lurking factor explaining the divergence. (It’s also possible this isn’t actually a linear effect.)

However, that lurking variable doesn’t appear to be running back size. Below is a plot showing running back BMI vs. (Q4 YPC – Q1–3 YPC); there doesn’t seem to be a real relationship. The plot below it shows difference and fourth quarter carries (the horizontal line is the average value of -0.13), which somewhat suggests that this is an effect that decreases with sample size increasing, though these data are non-normal, so it’s not an easy thing to immediately assess.

BMI & DiffCarries & Diff

That intuition is borne out if we look at the correlation between the two, with an estimate of 0.02 that is not close to significant (p = 0.78). Using weight and height instead of BMI give us larger apparent effects, but they’re still not significant (r = 0.08 with p = 0.29 for weight, r = 0.10 with p = 0.21 for height). Throwing these variables in the regression to predict Q4 YPC based on previous YPC also doesn’t have any effect that’s close to significant, though I don’t think much of that because I don’t think much of that model to begin with.

Our talking head, though, mentioned Lynch and Bettis by name. Do we see anything for them? Unsurprisingly, we don’t—Bettis has a net improvement of 0.35 YPC, with Lynch actually falling off by 0.46 YPC, though both of these are within one standard deviation of the average effect, so they don’t really mean much.

On a more general scale, it doesn’t seem like a change in YPC in the fourth quarter can be attributed to running back size. My hunch is that this is accurate, and that “big running backs make it easier to run later in the game” is one of those things that people repeat because it sounds reasonable. However, given all of the data issues I outlined earlier, I can’t conclude that with any confidence, and all we can say for sure is that it doesn’t show up in an obvious manner (though at some point I’d love to pick at the play by play data). At the very least, though, I think that’s reason for skepticism next time some ex-jock on TV mentions this.

Do Low Stakes Hockey Games Go To Overtime More Often?

Sean McIndoe wrote another piece this week about NHL overtime and the Bettman point (the 3rd point awarded for a game that is tied at the end of regulation—depending on your preferred interpretation, it’s either the point for the loser or the second point for the winner), and it raises some interesting questions. I agree with one part of his conclusion (the loser point is silly), but not with his proposed solution—I think a 10 or 15 minute overtime followed by a tie is ideal, and would rather get rid of the shootout altogether. (There may be a post in the future about different systems and their advantages/disadvantages.)

At one point, McIndoe is discussing how the Bettman point affects game dynamics, namely that it makes teams more likely to play for a tie:

So that’s exactly what teams have learned to do. From 1983-84 until the 1998-99 season, 18.4 percent of games went to overtime. Since the loser point was introduced, that number has up to 23.5 percent. 11 That’s far too big a jump to be a coincidence. More likely, it’s the result of an intentional, leaguewide strategy: Whenever possible, make sure the game gets to overtime.

In fact, if history holds, this is the time of year when we’ll start to see even more three-point games. After all, the more important standings become, the more likely teams will be to try to maximize the number of points available. And sure enough, this has been the third straight season in which three-point games have increased every month. In each of the last three full seasons, three-point games have mysteriously peaked in March.

So, McIndoe is arguing that teams are effectively playing for overtime later in the season because teams feel a more acute need for points. If you’re curious, based on my analysis this trend he cites is statistically significant, looking at a simple correlation of fraction of games ending in ties with the relative month of the season. If one assumes the effect is linear, each month the season goes on, a game becomes 0.5 percentage points more likely to go to overtime. (As an aside, I suspect a lot of the year-over-year trend is explained by a decrease in scoring over time, but that’s also a topic for another post.)

I’m somewhat unconvinced of this, given that later in the year there are teams who are tanking for draft position (would rather just take the loss) and teams in playoff contention want to deprive rivals of the extra point. (Moreover, teams may also become more sensitive to playoff tiebreakers, the first one of which is regulation and overtime wins.) If I had to guess, I would imagine that the increase in ties is due to sloppy play due to injuries and fatigue, but that’s something I’d like to investigate and hopefully will in the future.

Still, McIndoe’s idea is interesting, as it (along with his discussion of standings inflation, in which injecting more points into the standings makes everyone likelier to keep their jobs) suggests to me that there could be some element of collusion in hockey play, in that under some circumstances both teams will strategically maximize the likelihood of a game going to overtime. He believes that both teams will want the points in a playoff race. If this quasi-collusive mechanism is actually in place, where else might we see it?

My idea to test this is to look at interconference matchups. Why? This will hopefully be clear from looking at the considerations when a team wins in regulation instead of OT or a shootout:

  1. The other team gets one point instead of zero. Because the two teams are in different conferences, this has no effect on whether either team makes the playoffs, or their seeding in their own conference. The only way it matters is if a team suspects it would want home ice advantage in a matchup against the team it is playing…in the Stanley Cup Finals, which is so unlikely that a) it won’t play into a team’s plans and b) even if it did, would affect very few games. So, from this perspective there’s no incentive to win a 2 point game rather than a 3 point game.
  2. Regulation and overtime wins are a tiebreaker. However, points are much more important than the tiebreaker, so a decision that increases the probability of getting points will presumably dominate considerations about needing the regulation win. Between 1 and 2, we suspect that one team benefits when an interconference game goes to overtime, and the other is not hurt by the result.
  3. The two teams could be competing for draft position. If both teams are playing to lose, we would suspect this would be similar to a scenario in which both teams are playing to win, though that’s a supposition I can test some other time.

So, it seems to me that, if there is this incentive issue, we might see it in interconference games. So our hypothesis is that interconference games result in more three point games than intraconference games.

Using data from Hockey Reference, I looked at the results of every regular season game since 1999, when overtime losses began getting teams a point, counting the number of games that went to overtime. (During the time they were possible, I included ties in this category.) I also looked at the stats restricted to games since 2005, when ties were abolished, and I didn’t see any meaningful differences in the results.

As it turns out, 24.0% of interconference games have gone to OT since losers started getting a point, compared with…23.3% of intraconference games. That difference isn’t statistically significant (p = 0.44); I haven’t done power calculations, but since our sample of interconference games has N > 3000, I’m not too worried about power. Moreover, given the point estimate (raw difference) of 0.7%, we are looking at such a small effect even if it were significant that I wouldn’t put much stock in it. (The corresponding figures for the shootout era are 24.6% and 23.1%, with a p-value of 0.22, so still not significant.)

My idea was that we would see more overtime games, not more shootout games, as it’s unclear how the incentives align for teams to prefer the shootout, but I looked at the numbers anyway. Since 2005, 14.2% of interconference games have gone to the skills competition, compared to 13.0% of intraconference games. Not to repeat myself too much, but that’s still not significant (p = 0.23). Finally, even if we look at shootouts as a fraction of games that do go to overtime, we see no substantive difference—57.6% for interconference games, 56.3% for intraconference games, p = 0.69.

So, what do we conclude from all of these null results? Well, not much, at least directly—such is the problem with null results, especially when we are testing an inference from another hypothesis. It suggests that NHL teams aren’t repeatedly and blatantly colluding to maximize points, and it also suggests that if you watch an interconference game you’ll get to see the players trying just as hard, so that’s good, if neither novel nor what we set out to examine. More to the point, my read is that this does throw some doubt on McIndoe’s claims about a deliberate increase in ties over the course of the season, as it shows that in another circumstance where teams have an incentive to play for a tie, there’s no evidence that they are doing so. However, I’d like to do several different analyses that ideally address this question more directly before stating that firmly.

Or, to borrow the words of a statistician I’ve worked with: “We don’t actually know anything, but we’ve tried to quantify all the stuff we don’t know.”

Casey Stengel: Hyperbole Proof

Today, as an aside in Jayson Stark’s column about replay:

“I said, ‘Just look at this as something you’ve never had before,'” Torre said. “And use it as a strategy. … And the fact that you only have two [challenges], even if you’re right — it’s like having a pinch hitter.’ Tony and I have talked about it. It’s like, ‘When are you going to use this guy?'”

But here’s the problem with that analogy: No manager would ever burn his best pinch hitter in the first inning, right? Even if the bases were loaded, and Clayton Kershaw was pitching, and you might never have a chance this good again.

No manager would do that? In the same way that no manager would ramble on and on when speaking before the Senate Antitrust Subcommittee. That is to say, Casey Stengel would do it. Baseball Reference doesn’t have the best interface for this, and it would have taken me a while to dig this out of Retrosheet, but Google led me to this managerial-themed quiz, which led me in turn to the Yankees-Tigers game from June 10, 1954. Casey pinch hit in the first inning—twice! I’m sure there are more examples of this, but this was the first one I could find.

Casey Stengel: great manager, and apparently immune to rhetorical questions.