# Stealing an Advantage from First and Third

(Note: Inspired by this post from Jeff Fogle, I decided to change the format up a bit for this post, specifically by putting an abstract at the beginning. We’ll see if it sticks.) This post looks at baserunning strategy with runners on first and third, specifically having to do with when to have the runner on first attempt to steal. My research suggests that teams may be currently employing this strategy in a non-optimal manner. While they start the runner as often as they should with one out, they should probably run more frequently with zero and two outs with runners on first and third than they currently. The gain from this aggressiveness is likely to be small, on the order of a few runs a season. Read on if you want to know how I came to this conclusion.

Back when I used to play a lot of the Triple Play series, I loved calling for a steal with runners on first and third. It seemed like you could basically always get the runner to second, and if he drew a throw then the runner on third would score. It’s one of those fun plays that introduced a bit of chaos and works disproportionately frequently in videogames. Is that last statement true? Well, I don’t know how frequently it worked in Triple Play 99, but I can look at how frequently it works in the majors. And it appears to work pretty darn frequently.*

* I haven’t found any prior research directly addressing this, but this old post by current Pirates analytics honcho Dan Fox obliquely touches on it. I’m pretty confident that his conclusions are different because he’s omitting an important case and focusing directly on double steals, and not because either one of us is wrong.

The data I looked at were Retrosheet play-by-play data from 1989–2013, looking at events classified as caught stealing, stolen bases, balks, and pickoffs with runners at first and third. I then removed caught stealing and steals where the runner on first remained on first at the end of the play, leaving 8500 events or so. That selection of events is similar to what Tom Tango et al. do in The Book and control for the secondary effects of base stealing, but I added the restriction about the runner on first to remove failed squeezes, straight steals of home, and other things that aren’t related to what we’re looking at. This isn’t going to perfectly capture the events we want, but modulo the limitations of play-by-play data it’s the best cut of the data I could think of. (It’s missing two big things: the impact of running on batter performance and what happens when the runners go and the ball is put in play. The first would take a lot of digging to guess at, and the second is impossible to get from my data, so I’m going to postulate they have a small effect and leave it at that.)

So, let’s say we define an outcome to be successful if it leads to an increased run expectancy. (Run expectancy is computed empirically and is essentially the average number of runs scored in the remainder of an inning given where the baserunners are and how many outs there are.) In this particular scenario, increased run expectancy is equivalent to an outcome where both runners are safe, which occurs 82.7% of the time. For reference, league average stolen base percentage over this period is 69.9% (via the Lahman database), so that’s a sizeable difference in success rates (though the latter figure doesn’t account for pickoffs, errors, and balks). (For what it’s worth, both of those numbers have gone up between 4 and 6 percentage points in the last five years.)

How much of that is due to self-selection and how much is intrinsic to the situation itself? In other words, is this just a function of teams picking their spots? It’s hard to check every aspect of this (catcher, pitcher, leverage, etc.), so I chose to focus on one, which is the stolen base percentage of the runner on first. I used a three year centered average for the players (meaning if the attempt took place in 1999, I used their combined stolen base figures from 1998–2000), and it turns out that on aggregate runners on first during 1st and 3rd steal attempts are about one percentage point better than the league average. That’s noticeable and not meaningless, but given how large the gap in success rate is the increased runner quality can’t explain the whole thing.

Now, what if we want to look at the outcomes more granularly? The results are in the table below. (The zeros are actually zero, not rounded.)

Outcomes of 1st/3rd Steal Attempts (Percentage)
Runner on First’s Destination
Runner on Third’s Destination Out 1st Base 2nd Base 3rd Base Run
Out 0.20 0.97 2.78 0.23 0.00
3rd Base 12.06 0.00 69.89 0.00 0.00
Run 1.07 0.36 9.31 2.98 0.15

This doesn’t directly address run expectancy, which is what we need if we’re going to actually determine the utility of this tactic. If you take into account the number of outs, balks, and pickoffs and combine the historical probabilities seen in that table with Baseball Prospectus’s 2013 run expectancy tables*, you get that each attempt is worth about 0.07 runs. (Restricting to the last five years, it’s 0.09.) That’s something, but it’s not much—you’d need to have 144 attempts a year at that success rate to get an extra win, which isn’t likely to happen given that there only about 200 1st and 3rd situations per team per year according to my quick count. Overall, the data suggest the break even success rate is on the order of 76%.**

* I used 2013 tables a) to simplify things and b) to make these historical rates more directly applicable to the current run environment.

** That’s computed using a slight simplification—I averaged the run values of all successful and unsuccessful outcomes separately, then calculated the break even point for that constructed binary process. Take the exact values with a grain of salt given the noise in the low-probability, high-impact outcomes (e.g. both runners score, both runners are out).

There’s a wrinkle to this, though, which is that the stakes and decision making processes are going to be different with zero, one, or two outs.  In the past, the expected value of running with first and third is actually negative with one out (-0.04), whereas the EV for running with two outs is about twice the overall figure. (The one out EV is almost exactly 0 over the last five years, but I don’t want to draw too many conclusions from that if it’s a blip and not a structural change.) That’s a big difference, probably driven by the fact that the penalty for taking the out is substantially less with two outs, and it’s not due to a small sample—two out attempts make up more than half the data. (For what it’s worth, there aren’t substantive discrepancies in the SB% of the runners involved between the different out states.) The table below breaks it down more clearly:

Success and Break Even Rates for 1st/3rd Steal Attempts by Outs
Number of Outs Historical Success Percentage Break Even Percentage
0 81.64 74.61
1 73.65 78.00
2 88.71 69.03
Overall 82.69 75.52

That third row is where I think there’s a lot of hay to be made, and I think the table makes a pretty clear case: managers should be quite aggressive about starting the runner if there’s a first and third with two outs, even if there’s a slightly below average runner at first. They should probably be a bit more aggressive than they currently are with no outs, and more conservative with one out.

There’s also plenty of room for this to happen more frequently; with two outs, the steal attempt rate last year was about 6.6% (it’s 5% with one out, and 4% with no outs). The number of possible attempts per team last year was roughly 200, split 100/70/30 between 2/1/0 outs, so there are some reasonable gains to be made. It’s not going to make a gigantic impact, but if a team sends the runner twice as often as they have been with two outs (about one extra time per 25 games), that’s a run gained, which is small but still an edge worth taking. Maybe my impulses when playing Triple Play had something to them after all.

# A Look at Pitcher Defense

Like most White Sox fans, I was disappointed when Mark Buehrle left the team. I didn’t necessarily think they made a bad decision, but Buehrle is one of those guys that makes me really appreciate baseball on a sentimental level. He’s never seemed like a real ace, but he’s more interesting: he worked at a quicker pace than any other pitcher, was among the very best fielding pitchers, and held runners on like few others (it’s a bit out of date, but this post has him picking off two batters for each one that steals, which is astonishing).

In my experience, these traits are usually discussed as though they’re unrelated to his value as a pitcher, and the same could probably be said of the fielding skills possessed by guys like Jim Kaat and Greg Maddux. However, that’s covering up a non-negligible portion of what Buehrle has brought to his teams over the year; using a crude calculation of 10 runs per win, his 87 Defensive Runs Saved are equal to about 20% of his 41 WAR during the era for which have DRS numbers. (Roughly half of that 20% is from fielding his position, with the other half coming from his excellent work in inhibiting base thieves. Defensive Runs Saved are a commonly used, all-encompassing defensive metric from Baseball Info Solutions. All numbers in this piece are from Fangraphs. ) Buehrle’s extreme, but he’s not the only pitcher like this; Jake Westbrook had 62 DRS and only 18 WAR or so in the DRS era, which means the DRS equate to more than 30% of the WAR.

So fielding can make up a substantial portion of a pitcher’s value, but it seems like we rarely discuss it. That makes a fair amount of sense; single season fielding metrics are considered to be highly variable for position players who will be on the field for six times as many innings as a typical starting pitcher, and pitcher defensive metrics are less trustworthy even beyond that limitation. Still, though, I figured it’d be interesting to look at which sorts of pitchers tend to be better defensively.

For purposes of this study, I only looked at what I’ll think of as “fielding runs saved,” which is total Defensive Runs Saved less runs saved from stolen bases (rSB). (If you’re curious, there is a modest but noticeable 0.31 correlation between saving runs on stolen bases and fielding runs saved.) I also converted it into a rate stat by dividing by the number of innings pitched and then multiplying by 150 to give a full season rate. Finally, I restricted to aggregate data from the 331 pitchers who threw at least 300 innings (2 full seasons by standard reckoning) between 2007 and 2013; 2007 was chosen because it’s the beginning of the PitchF/X era, which I’ll get to in a little bit. My thought is that a sample size of 330 is pretty reasonable, and while players will have changed over the full time frame it also provides enough innings that the estimates will be a bit more stable.

One aside is that DRS, as a counting stat, doesn’t adjust for how many opportunities a given fielder has, so a pitcher who induces lots of strikeouts and fly balls will necessarily have DRS values smaller in magnitude than another pitcher of the same fielding ability but different pitching style.

Below is a histogram of pitcher fielding runs/150 IP for the population in question:

If you’re curious, the extreme positive values are Greg Maddux and Jake Westbrook, and the extreme negative values are Philip Humber, Brandon League, and Daniel Cabrera.

This raises another set of questions: what sort of pitchers tend to be better fielders? To test this, I decided to use linear regression—not because I want to make particularly nuanced predictions using the estimates, but because it is a way to examine how much of a correlation remains between fielding and a given variable after controlling for other factors. Most of the rest of the post will deal with the regression methods, so feel free to skip to the bold text at the end to see what my conclusions were.

What jumped out to me initially, is that Buehrle, R.A. Dickey, Westbrook, and Maddux are all extremely good fielding pitchers that aren’t hard throwers; to that end, I included their average velocity as one of the independent variables in the regression. (Hence the restriction to the PitchF/X era.) To control for the fact that harder throwers also strike out more batters and thus don’t have as many opportunities to make plays, I included the pitcher’s strikeouts per nine IP as a control as well.

It also seems plausible to me that there might be a handedness effect or a starter/reliever gap, so I added indicator variables for those to the model as well. (Given that righties and relievers throw harder than lefties and starters, controlling for velocity is key. Relievers are defined as those with at least half their innings in relief.) I also added in ground ball rate, with the thought that having more plays to make could have a substantial effect on the demonstrated fielding ability.

There turns out to be a noticeable negative correlation between velocity and fielding ability. This doesn’t surprise me, as it’s consistent with harder throwers having a longer, more intense delivery that makes it harder for them to react quickly to a line drive or ground ball. According to the model, we’d associate each mile per hour increase with a 0.2 fielding run per season decrease; however, I’d shy away from doing anything with that estimate given how poor the model is. (The R-squared values on the models discussed here are all less than 0.2, which is not very good.) Even if we take that estimate at face value, though, it’s a pretty small effect, and one that’s hard to read much into.

We don’t see any statistically significant results for K/9, handedness, or starter/reliever status. (Remember that this doesn’t take into account runs saved through stolen base prevention; in that case, it’s likely that left handers will rate as superior and hard throwers will do better due to having a faster time to the plate, but I’ll save that for another post.) In fact, of the non-velocity factors considered, only ground ball rate has a significant connection to fielding; it’s positively related, with a rough estimate that a percentage point increase in groundball rate will have a pitcher snag 0.06 extra fielding runs per 150 innings. That is statistically significant, but it’s a very small amount in practice and I suspect it’s contaminated by the fact that an increase in ground ball rate is related to an increase in fielding opportunities.

To attempt to control for that contamination, I changed the model so that the dependent (i.e. predicted) variable was [fielding runs / (IP/150 * GB%)]. That stat is hard to interpret intuitively (if you elide the batters faced vs. IP difference, it’s fielding runs per groundball), so I’m not thrilled about using it, but for this single purpose it should be useful to help figure out if ground ball pitchers tend to be better fielders even after adjusting for additional opportunities.

As it turns out, the same variables are significant in the new model, meaning that even after controlling for the number of opportunities GB pitchers and soft tossers are generally stronger fielders. The impact of one extra point of GB% is approximately equivalent to losing 0.25 mph off the average pitch speed; however, since pitch speed has a pretty small coefficient we wouldn’t expect either of these things to have a large impact on pitcher fielding.

This was a lot of math to not a huge effect, so here’s a quick summary of what I found in case I lost you:

• Harder throwers contribute less on defense even after controlling for having fewer defensive opportunities due to strikeouts. Ground ball pitchers contribute more than other pitchers even if you control for having more balls they can make plays on.
• The differences here are likely to be very small and fairly noisy (especially if you remember that the DRS numbers themselves are a bit wonky), meaning that, while they apply in broad terms, there will be lots and lots of exceptions to the rule.
• Handedness and role (i.e. starter/reliever) have no significant impact on fielding contribution.

All told, then, we shouldn’t be too surprised Buehrle is a great fielder, given that he doesn’t throw very hard. On the other hand, though, there are plenty of other soft tossers who are minus fielders (Freddy Garcia, for instance), so it’s not as though Buehrle was bound to be good at this. To me, that just makes him a little bit quirkier and reminds me of why I’ll have a soft spot for him above-and-beyond what he got just for being a great hurler for the Sox.

# Picking a Pitch and the Pace of the Game

Here’s a short post to answer a straight-forward question: do pitchers that throw more pitches pitch more slowly? If it’s not clear, the idea is that a pitcher who throws several pitches frequently will take longer because the catcher has to spend more time calling the pitch, perhaps with a corresponding increase in how often the pitcher shakes off the catcher.

To make a quick pass at this, I pulled FanGraphs data on how often each pitcher threw fastballs, sliders, curveballs, changeups, cutters, splitters, and knucklers, using data from 2009–13 on all pitches with at least 200 innings. (See the data here. There are well-documented issues with the categorizations, but for a small question like this they are good enough.) The statistic used for how quickly the pitcher worked was the appropriately named Pace, which measures the number of seconds between pitches thrown.

To easily test the hypothesis, we need a single number to measure how even the pitcher’s pitch mix is, which we believe to be linked to the complexity of the decision they need to make. There are many ways to do this, but I decided to go with the Herfindahl-Hirschman Index, which is usually used to measure market concentration in economics. It’s computed by squaring the percentage share of each pitch and adding them together, so higher values mean things are more concentrated. (The theoretical max is 10,000.) As an example, Mariano Rivera threw 88.9% cutters and 11.1% fastballs over the time period we’re examining, so his HHI was $88.9^{2} + 11.1^{2} = 8026$. David Price threw 66.7% fastballs, 5.8% sliders, 6.6% cutters, 10.6% curveballs, and 10.4% changeups, leading to an HHI of 4746. (See additional discussion below.) If you’re curious, the most and least concentrated repertoires split by role are in a table at the bottom of the post.

As an aside, I find two people on those leader/trailer lists most interesting. The first is Yu Darvish, who’s surrounded by junkballers—it’s pretty cool that he has such amazing stuff and still throws 4.5 pitches with some regularity. The second is that Bartolo Colon has, according to this metric, less variety in his pitch selection over the last five years than the two knuckleballers in the sample. He’s somehow a junkballer but with only one pitch, which is a pretty #Mets thing to be.

Back to business: after computing HHIs, I split the sample into 99 relievers and 208 starters, defined as pitchers who had at least 80% of their innings come in the respective role. I enforced the starter/reliever split because a) relievers have substantially less pitch diversity (unweighted mean HHI of 4928 vs. 4154 for starters, highly significant) and b) they pitch substantially slower, possibly due to pitching more with men on base and in higher leverage situations (unweighted mean Pace of 23.75 vs. 21.24, a 12% difference that’s also highly significant).

So, how does this HHI match up with pitching pace for these two groups? Pretty poorly. The correlation for starters is -0.11, which is the direction we’d expect but a very small correlation (and one that’s not statistically significant at p = 0.1, to the limit extent that statistical significance matters here). For relievers, it’s actually 0.11, which runs against our expectation but is also statistically and practically no different from 0. Overall, there doesn’t seem to be any real link, but if you want to gaze at the entrails, I’ve put scatterplots at the bottom as well.

One important note: a couple weeks back, Chris Teeter at Beyond the Box Score took a crack at the same question, though using a slightly different method. Unsurprisingly, he found the same thing. If I’d seen the article before I’d had this mostly typed up, I might not have gone through with it, but as it stands, it’s always nice to find corroboration for a result.

Relief Pitchers with Most Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Sean Marshall 25.6 18.3 17.7 38.0 0.5 0.0 0.0 2748
2 Brandon Lyon 43.8 18.3 14.8 18.7 4.4 0.0 0.0 2841
3 D.J. Carrasco 32.5 11.2 39.6 14.8 2.0 0.0 0.0 2973
4 Alfredo Aceves 46.5 0.0 17.9 19.8 13.5 2.3 0.0 3062
5 Logan Ondrusek 41.5 2.0 30.7 20.0 0.0 5.8 0.0 3102
Relief Pitchers with Least Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Kenley Jansen 91.4 7.8 0.0 0.2 0.6 0.0 0.0 8415
2 Mariano Rivera 11.1 0.0 88.9 0.0 0.0 0.0 0.0 8026
3 Ronald Belisario 85.4 12.7 0.0 0.0 0.0 1.9 0.0 7458
4 Matt Thornton 84.1 12.5 3.3 0.0 0.1 0.0 0.0 7240
5 Ernesto Frieri 82.9 5.6 0.0 10.4 1.1 0.0 0.0 7013
Starting Pitchers with Most Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Shaun Marcum 36.6 9.3 17.6 12.4 24.1 0.0 0.0 2470
2 Freddy Garcia 35.4 26.6 0.0 7.9 13.0 17.1 0.0 2485
3 Bronson Arroyo 42.6 20.6 5.1 14.2 17.6 0.0 0.0 2777
4 Yu Darvish 42.6 23.3 16.5 11.2 1.2 5.1 0.0 2783
5 Mike Leake 43.5 11.8 23.4 9.9 11.6 0.0 0.0 2812
Starting Pitchers with Least Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Bartolo Colon 86.2 9.1 0.2 0.0 4.6 0.0 0.0 7534
2 Tim Wakefield 10.5 0.0 0.0 3.7 0.0 0.0 85.8 7486
3 R.A. Dickey 16.8 0.0 0.0 0.2 1.5 0.0 81.5 6927
4 Justin Masterson 78.4 20.3 0.0 0.0 1.3 0.0 0.0 6560
5 Aaron Cook 79.7 9.7 2.8 7.6 0.4 0.0 0.0 6512

Boring methodological footnote: There’s one primary conceptual problem with using HHI, and that’s that in certain situations it gives a counterintuitive result for this application. For instance, under our line of reasoning we would think that, ceteris paribus, a pitcher who throws a fastball 90% of a time and a change 10% of the time would have an easier decision to make than one who throws a fastball 90% of the time and a change and slider 5% each. However, the HHI is higher for the latter pitcher—which makes sense in the context of market concentration, but not in this scenario. (The same issue holds for the Gini coefficient, for that matter.) There’s a very high correlation between HHI and the frequency of a pitcher’s most common pitch, though, and using the latter doesn’t change any of the conclusions of the post.

# Is There a Hit-by-Pitch Hangover?

One of the things I’ve been curious about recently and have on my list of research questions is what the ramifications of a hit-by-pitch are in terms of injury risk—basically, how much of the value of an HBP does the batter give back through the increased injury risk? Today, though, I’m going to look at something vaguely similar but much simpler: Is an HBP associated with an immediate decrease in player productivity?

To assess this, I looked at how players performed in the plate appearance immediately following their HBP in the same game. (This obviously ignores players who are injured by their HBP and leave the game, but I’m looking for something subtler here.) To evaluate performance, I used wOBA, a rate stat that encapsulates a batter’s overall offensive contributions. There are, however, two obvious effects (and probably other more subtle ones) that mean we can’t only look at the post-HBP wOBA and compare it to league average.

The first is that, ceteris paribus, we expect that a pitcher will do worse the more times he sees a given batter (the so-called “trips through the order penalty”). Since in this context we will never include a batter’s first PA of a game because it couldn’t be preceded by an HBP, we need to adjust for this. The second adjustment is simple selection bias—not every batter has the same likelihood of being hit by a pitch, and if the average batter getting hit by a pitch is better or worse than the overall average batter, we will get a biased estimate of the effect of the HBP. If you don’t care about how I adjusted for this, skip to the next bold text.

I attempted to take those factors into account by computing the expected wOBA as follows. Using Retrosheet play-by-play data for 2004–2012 (the last year I had on hand), for each player with at least 350 PA in a season, I computed their wOBA over all PA that were not that player’s first PA in a given game. (I put the 350 PA condition in to make sure my average wasn’t swayed by low PA players with extreme wOBA values.) I then computed the average wOBA of those players weighted by the number of HBP they had and compared it to the actual post-HBP wOBA put up by this sample of players.

To get a sense of how likely or unlikely any discrepancy would be, I also ran a simulation where I chose random HBPs and then pulled a random plate appearance from the hit batter until I had the same number of post-HBP PA as actually occurred in my nine year sample, then computed the post-HBP wOBA in that simulated world. I ran 1000 simulations and so have some sense of how unlikely the observed post-HBP performance is under the null hypothesis that there’s no difference between post-HBP performance and other performance.

To be honest, though, those adjustments don’t make me super confident that I’ve covered all the necessary bases to find a clean effect—the numbers are still a bit wonky, and this is not such a simple thing to examine that I’m confident I’ve gotten all the noise out. For instance, it doesn’t filter out park or pitcher effects (i.e. selection bias due to facing a worse pitcher, or a pitcher having an off day), both of which play a meaningful role in these performance and probably lead to additional selection biases I don’t control for.

With all those caveats out of the way, what do we see? In the data, we have an expected post-HBP wOBA of .3464 and an actual post-HBP wOBA of .3423, for an observed difference of about 4 points of wOBA, which is a small but non-negligible difference. However, it’s in the 24th percentile of outcomes according to the simulation, which indicates there’s a hefty chance that it’s randomness. (Though league average wOBA changed noticeably over the time period I examined, I did some sensitivities and am fairly confident those changes aren’t covering up a real result.)

The main thing (beyond the aforementioned haziness in this analysis) that makes me believe there might be an effect is that the post-walk effect is actually a 2.7 point (i.e. 0.0027) increase in wOBA. If we think that boost is due to pitcher wildness then we would expect the same thing to pop up for the post-HBP plate appearances, and the absence of such an increase suggests that there is a hangover effect. However, to conclude from that that there is a post-HBP swoon seems to be an unreasonably baroque chain of logic given the rest of the evidence, so I’m content to let it go for now.

The main takeaway from all of this is that there’s an observed decrease in expected performance after an HBP, but it’s not particularly large and doesn’t seem likely to have any predictive value. I’m open to the idea that a more sophisticated simulator that includes pitcher and park effects could help detect this effect, but I expect that even if the post-HBP hangover is a real thing, it doesn’t have a major impact.

# Do High Sock Players Get “Hosed” by the Umpires?

I was reading one of Baseball Prospectus’s collections this morning and came across an interesting story. It’s a part of baseball lore that Willie Mays started his career on a brutal cold streak (though one punctuated by a long home run off Warren Spahn). Apparently, manager Leo Durocher told Mays toward the end of the slump that he needed to pull his pants up because the pant knees were below Mays’s actual knees, which was costing him strikes. Mays got two hits the day after the change and never looked back.

To me, this is a pretty great story and (to the extent it’s true) a nice example of the attention to detail that experienced athletes and managers are capable of. However, it prompted another question: do uniform details actually affect the way that umpires call the game?

Assessing where a player belts his pants is hard, however, so at this point I’ll have to leave that question on the shelf. What is slightly easier is looking at which hitters wear their socks high and which cover their socks with their baseball pants. The idea is that by clearly delineating the strike zone, the batter will get fairer calls on balls near the bottom of the strike zone than he might otherwise. This isn’t a novel idea—besides the similarity to what Durocher said, it’s also been suggested herehere, and in the comments here—but I wasn’t able to find any studies looking at this. (Two minor league teams in the 1950s did try this with their whole uniforms instead of just the socks, however. The experiments appear to have been short-lived.)

There are basically two ways of looking at the hypothesis: the first is that it will be a straightforward benefit/detriment to the player to hike his socks because the umpire will change his definition of the bottom of the zone; this is what most of the links I cited above would suggest, though they didn’t agree on which direction. I’m somewhat skeptical of this, unless we think that the umpires have a persistent bias for or against certain players and that that bias would be resolved by the player changing how he wears his socks. The second interpretation is that it will make the umpire’s calls more precise, meaning simply that borderline pitches are called more consistently, but that it won’t actually affect where the umpire thinks the bottom of the zone is.

At first blush, this seems like the sort of thing that Pitch F/X would be perfectly suited to, as it gives oodles of information about nearly every pitch thrown in the majors in the last several years. However, it doesn’t include a variable for the hosiery of the batter, so to do a broader study we need additional data. After doing some research and asking around, I wasn’t able to find a good database of players that consistently wear high socks, much less a game-by-game list, which basically ruled out a large-scale Pitch F/X study.

However, I got a very useful suggestion from Paul Lukas, who runs the excellent Uni Watch site. He pointed out that a number of organizations require their minor leaguers to wear high socks and only give the option of covered hose to the major leaguers, providing a natural means of comparison between the two types of players. This will allow us to very broadly test the hypothesis that there is a single direction change in how low strikes are called.

I say very broadly because minor league Pitch F/X data aren’t publicly available, so we’re left with extremely aggregate data. I used data from Minor League Central, which has called strikes and balls for each batter. In theory, if the socks lead to more or fewer calls for the batter at the bottom of the zone, that will show up in the aggregate data and the four high-socked teams (Omaha, Durham, Indianapolis, and Scranton/Wilkes-Barre) will have a different percentage of pitches taken go for strikes. (I found those teams by looking at a sample of clips from the 2013 season; their AA affiliates also require high socks.)  Now, there are a lot of things that could be confounding factors in this analysis:

1. Players on other teams are allowed to wear their socks high, so this isn’t a straight high socks/no high socks comparison, but rather an all high socks/some high socks comparison. (There’s also a very limited amount of non-compliance on the all socks side, as based on the clips I could find it appears that major leaguers on rehab aren’t bound by the same rules; look at some Derek Jeter highlights with Scranton if you’re curious.)
2. AAA umpires are prone to more or different errors than major league umpires.
3. Which pitches are taken is a function of the team makeup and these teams might take more or fewer balls for reasons unrelated to their hose.
4. This only affects borderline low pitches, and so it will only make up a small fraction of the overall numbers we observe and the impact will be smothered.

I’m inclined to downplay the first and last issues, because if those are enough to suppress the entire difference over the course of a whole season then the practical significance of the change is pretty small. (Furthermore, for #1, from my research it didn’t look like there were many teams with a substantial number of optional socks-showers. Please take that with a grain of salt.)

I don’t really have anything to say about the second point, because it has to do with extrapolation, and for now I’d be fine just looking at AAA. I don’t have even have that level of brushoff response for the third point except to wave my hands and say that I hope it doesn’t matter given that these reflect pitches thrown by the rest of the league, so they will hopefully converge around league average.

So, having substantially caveated my results…what are they? As it turns out, the percentage of pitches the stylish high sock teams took that went for strikes was 30.83% and the equivalent figure for the sartorially challenged was…30.83%. With more than 300,000 pitches thrown in AAA last year, you need to go to the seventh decimal place of the fraction to see a difference. (If this near equality seems off to you, it does to me as well. I checked my figures a couple of ways, but I (obviously) can’t rule out an error here.)

What this says to me is that it’s pretty unlikely that this ends up mattering, unless there is an effect and it’s exactly cancelled out by the confounding factors listed above (or others I failed to consider). That can’t be ruled out as a possibility, nor can data quality issues, but I’m comfortable saying that the likeliest possibility by a decent margin is that socks don’t lead to more or fewer strikes being called against the batter. (Regardless, I’m open to suggestions for why the effect might be suppressed or analysis based on more granular data I either don’t have access to or couldn’t find.)

What about the accuracy question, i.e. is the bottom of the strike zone called more consistently or correctly for higher-socked players? Due to the lack of nicely collected data, I couldn’t take a broad approach to answering this, but I do want to record an attempt I made regardless. David Wright is known for wearing high socks in day games but covering his hosiery at night, which gives us a natural experiment we can look at for results.

I spent some amount of time looking at the 2013 Pitch F/X data for his day/night splits on taken low pitches and comparing those to the same splits for the Mets as a whole, trying a few different logistic regression models as well as just looking at the contingency tables to see if anything jumped out, and nothing really did in terms of either greater accuracy or precision. I didn’t find any cuts of the data that yielded a sufficiently clean comparison or sample size that I was confident in the results. Since this is a messy use of these data in the first place (it relies on unreliable estimates of the lower edge of a given batter’s strike zone, for instance), I’m going to characterize the analysis as incomplete for now. Given a more rigorous list of which players wear high socks and when, though, I’d love to redo this with more data.

Overall, though, there isn’t any clear evidence that the socks do influence the strike zone. I will say, though, that this seems like something that a curious team could test by randomly having players (presumably on their minor league teams) wear the socks high and doing this analysis with cleaner data. It might be so silly as to not be worth a shot, but if this is something that can affect the strike zone at all then it could be worthwhile to implement in the long run—if it can partially negate pitch framing, for instance, then that could be quite a big deal.

White Sox backup catcher Adrian Nieto has done some unusual things in the last few days. To start with, he made the team. That doesn’t sound like much, but as a Rule 5 draft pick, it’s a bit more meaningful than it might be otherwise, and it’s somewhat unusual because he was jumping from A ball to the majors as a catcher. (Sox GM Rick Hahn said he didn’t know of anyone who’d done it in the last 5+ years.)

Secondly, he pinch ran today against the Twins, which is an activity not usually associated with catchers (even young ones). This probably says more about the Sox bench, as he pinch ran for Paul Konerko, who is the worst baserunner by BsR among big league regulars this decade by a hefty margin. Still: a catcher pinch running! How often does this happen?

More frequently than I thought, as it turns out; there were 1530 instances of a catcher pinch running from 1974 to 2013, or roughly 38 times a year. This is about 4% of all pinch running appearances over that time, so it’s not super common, but it’s not unheard of either. (My source for this is the Lahman database, which is why I have the date cutoff. For transparency’s sake, I called a player a catcher if he played catcher in at least half of his appearances in a given year.)

If you connect the dots, though, you’ll realize that Nieto is a catcher made his major league debut as a pinch runner. How often does that happen? As it turns out, just five times previously since 1974 (cross-referencing Retrosheet with Lahman):

• John Wathan, Royals; May 26, 1976. Wathan entered for pinch hitter Tony Solaita, who had pinch hit for starter Bob Stinson. He came around to score on two hits (though he failed to make it home from third after a flyball to right), but he also grounded into a double play with the bases loaded in the 9th. The Royals lost in extra innings, but he lasted 10 years with them, racking up 5 rWAR.
• Juan Espino, Yankees; June 25, 1982. Espino pinch ran for starter Butch Wynegar with the Yankees up 11-3 in the 7th and was forced at second immediately. He racked up -0.4 rWAR in 49 games spread across four seasons, all with the Yanks.
• Doug Davis, Angels, July 8, 1988. This one’s sort of cheating, as Davis entered for third baseman Jack Howell after a hit by pitch and stayed in the game at the hot corner; he scored that time around, then made two outs further up. According to the criteria I threw out earlier, though, he counts, as three of the six games he played in that year were at catcher (four of seven lifetime).
• Gregg Zaun, Orioles; June 24, 1995. Zaun entered for starter Chris Hoiles with the O’s down 3-2 in the 7th. He moved to second on a groundout, then third on a groundout, then scored the tying run on a Brady Anderson home run. Zaun had a successful career as a journeyman, playing for 9 teams in 16 years and averaging less than 1 rWAR per year.
• Andy Stewart, Royals; September 6, 1997. Ran for starter Mike McFarlane in the 8th and was immediately wiped out on a double play. Stewart only played 5 games in the bigs lifetime.

So, just by scoring a run, Nieto didn’t necessarily have a more successful debut than this cohort. However, as a Sox fan I’m hoping (perhaps unreasonably) that he has a bit better career than Davis, Stewart, and Espino–and hey, if he’s a good backup for 10 or more years, that’s just gravy.

One of my favorite things about baseball is the number of quirky things like this that happen, and while this one wasn’t unique, it was pretty close. When you have low expectations for a team (like this year’s White Sox), you just hope the history they make isn’t too embarrassing.

# The Joy of the Internet, Pt. 2

I wrote one of these posts a while back about trying to figure out which game Bunk and McNulty attend in a Season 3 episode of The Wire. This time, I’m curious about a different game, and we have a bit less information to go on, so it took a bit more digging to find.

The intro to the Drake song “Connect” features the call of a home run being hit. Given that it probably required getting the express written consent of MLB for this sample, my guess is that he got it recorded by an announcer in the studio (as he implies around the 10:30 mark of this video). Still, does it match any games we have on record?

To start, I’m going to assume that this is a major league game, though there’s of course no way of knowing for sure. From the song, all we get is the count, the fact that it was a home run, the direction of the home run, and the name of the outfielder.  The first three are easy to hear, but the fourth is a bit tricky—a few lyrics sites (including the description of the video I linked) list it as “Molina,” but that can’t be the case, as none of the Molinas who’ve played in the bigs played the outfield.

RapGenius, however, lists it as “Revere,” and I’m going to go with that, since Ben Revere is an active major league center fielder and it seems likely that Drake would have sampled a recent game. So, can we find a game that matches all these parameters?

I first checked for only games Revere has played against the Blue Jays, since Drake is from Toronto and the RapGenius notes say (without a source) that the call is from a Jays game. A quick check of Revere’s game logs against the Jays, though, says that he’s never been on the field for a 3-1 homer by a Jay.

What about against any other team? Since checking this by hand wasn’t going to fly (har har), I turned to play-by-play data, available from the always-amazing Retrosheet. With the help of some code from possibly the nerdiest book I own, I was able to filter every play since Revere has joined the league to find only home runs hit to center when Revere was in center and the count was 3-1.

Somewhat magically, there was only one: a first inning shot by Carlos Gomez against the Twins in 2011. The video is here, for reference. I managed to find the Twins’ TV call via MLB.TV, and the Brewers’ team did the MLB.com video, and (unsurprisingly) neither call fits the sample, though I didn’t go looking for the radio call. Still, the home run is such that it wouldn’t be surprising if either one of the radio calls matched what Drake used, or if it was close and Drake had it rerecorded in such a way that preserved the details of the play.

So, probably through dumb luck, Drake managed to pick a unique play to sample for his track. But even though it’s a baseball sample, I still click back to “Hold On, We’re Going Home” damn near every time I listen to the album.

# Throne of Games (Most Played, Specifically)

I was trawling for some stats on hockey-reference (whence most of the hockey facts in this post) the other day and ran into something unexpected: Bill Guerin’s 2000-01 season. Specifically, Guerin led the league with 85 games played. Which wouldn’t have seemed so odd, except for the fact that the season is 82 games long.

How to explain this? It turns out there are two unusual things happening here. Perhaps obviously, Guerin was traded midseason, and the receiving team had games in hand on the trading team. Thus, Guerin finished with three games more than the “max” possible.

Now, is this the most anyone’s racked up? Like all good questions, the answer to that is “it depends.” Two players—Bob Kudelski in 93-94 and Jimmy Carson in 92-93—played 86 games, but those were during the short span of the 1990s when each team played 84 games in a season, so while they played more games than Guerin, Guerin played in more games relative to his team. (A couple of other players have played 84 since the switch to 82 games, among them everyone’s favorite Vogue intern, Sean Avery.)

What about going back farther? The season was 80 games from 1974–75 to 1991–92, and one player in that time managed to rack up 83: the unknown-to-me Brad Marsh, in 1981-82, who tops Guerin at least on a percentage level. Going back to the 76- and 78-game era from 1968-74, we find someone else who tops Guerin and Marsh, specifically Ross Lonsberry, who racked up 82 games (4 over the team maximum) with the Kings and Flyers in 1971–72. (Note that Lonsberry and Marsh don’t have game logs listed at hockey-reference, so I can’t verify if there was any particularly funny business going on.) I couldn’t find anybody who did that during the 70 game seasons of the Original Six era, and given how silly this investigation is to begin with, I’m content to leave it at that.

What if we go to other sports? This would be tricky in football, and I expect it would require being traded on a bye week. Indeed, nobody has played more than the max games at least since the league went to a 14 game schedule according to the results at pro-football-reference.

In baseball, it certainly seems possible to get over the max, but actually clearing this out of the data is tricky for the following two reasons:

• Tiebreaker games are counted as regular season games. Maury Wills holds the raw record for most games played with 165 after playing in a three game playoff for the Dodgers in 1962.
• Ties that were replayed. I started running into this a lot in some of the older data: games would be called after a certain number of innings with the score tied due to darkness or rain or some unexplained reason, and the stats would be counted, but the game wouldn’t count in the standings. Baseball is weird like that, and no matter how frustrating this can be as a researcher, it was one of the things that attracted me to the sport in the first place.

So, those are my excuses if you find any errors in what I’m about to present; I used FanGraphs and baseball-reference to spot candidates. I believe there’s only been a few cases of baseball players playing more than the scheduled number of games when none of the games fell into those two problem categories mentioned above. The most recent is Todd Zeile, who, while he didn’t play in a tied game, nevertheless benefited from one. In 1996, he was traded from the Phillies to the Orioles after the O’s had stumbled into a tie, thus giving him 163 games played, though they all counted.

Possibly more impressive is Willie Montanez, who played with the Giants and Braves in 1976. He racked up 163 games with no ties, but arguably more impressive is that, unlike Zeile, Montanez missed several opportunities to take it even farther. He missed one game before being traded, then one game during the trade, and then two games after he was traded. (He was only able to make it to 1963 because the Braves had several games in hand on the Giants at the time of the trade.)

The only other player to achieve this feat in the 162 game era is Frank Taveras, who in 1979 played in 164 games; however, one of those was a tie, meaning that according to my twisted system he only gets credit for 163. He, like Montanez, missed an opportunity, as he had one game off after getting traded.

Those are the only three in the 162-game era. While I don’t want to bother looking in-depth at every year of the 154-game era due to the volume of cases to filter, one particular player stands out. Ralph Kiner managed to put up 158 games with only one tie in 1953, making him by my count the only baseball player to play three meaningful games more than his team did in baseball since 1901.

Now, I’ve sort of buried the lede here, because it turns out that the NBA has the real winners in this category. This isn’t surprising, as the greater number of days off between games means it’s easier for teams to get out of whack and it’s more likely than one player will play in every game. Thus, a whole host of players have played more than 82 games, led by Walt Bellamy, who put up 88 in 1968-69. While one player got to 87 since, and a few more to 86 and 85, Bellamy stands alone atop the leaderboard in this particular category. (That fact made it into at least one of his obituaries.)

Since Bellamy is the only person I’ve run across to get 6 extra games in a season and nobody from any of the other sports managed even 5, I’m inclined to say that he’s the modern, cross-sport holder of this nearly meaningless record for most games played adjusted for season length.

Ending on a tangent: one of the things I like about sports records in general, and the sillier ones in particular, is trying to figure out when they are likely to fall. For instance, Cy Young won 511 games playing a sport so different from contemporary baseball that, barring a massive structural change, nobody can come within 100 games of that record. On the other hand, with strikeouts and tolerance for strikeouts at an all-time high, several hitter-side strikeout records are in serious danger (and have been broken repeatedly over the last 15 years).

This one seems a little harder to predict, because there are factors pointed in different directions. On the one hand, players are theoretically in better shape than ever, meaning that they are more likely to be able to make it through the season, and being able to play every game is a basic prerequisite for playing more than every game. On the other, the sports are a lot more organized, which would intuitively seem to decrease the ease of moving to a team with meaningful games in hand on one’s prior employer. Anecdotally, I would also guess that teams are less likely to let players play through a minor injury (hurting the chances). The real wild card is the frequency of in-season trades—I honestly have no rigorous idea of which direction that’s trending.

So, do I think someone can take Bellamy’s throne? I think it’s unlikely, due to the organizational factors laid out above, but I’ll still hold out hope that someone can do it—or at least, finding new players to join the bizarre fraternity of men playing more games than their teams.

# Uncertainty and Pitching Statistics

One of the things that I occasionally get frustrated by in sports statistics is the focus on estimates without presenting the associated uncertainty. While small sample size is often bandied about as an explanation for unusual results, one of the first things presented in statistics courses is the notion of a confidence interval. The simplest explanation of a confidence interval is that of a margin of error—you take the data and the degree of certainty you want, and it will give you a range covering likely values of the parameter you are interested in. It tacitly includes the sample size and gives you an implicit indication of how trustworthy the results are.

The most common version of this is the 95% confidence interval, which, based on some data, gives a range that will contain the actual value 95% of the time. For instance, say we poll a random sample of 100 people and ask them if they are right-handed. If 90 are right handed, the math gives us a 95% CI of (0.820, 0.948). We can draw additional sample and get more intervals; if we were to continue doing this, 95% of such intervals will contain the true percentage we are looking for. (How the math behind this works is a topic for another time, and indeed, I’m trying to wave away as much of it as possible in this post.)

One big caveat I want to mention before I get into my application of this principle is that there are a lot of assumptions that go into producing these mathematical estimates that don’t hold strictly in baseball. For instance, we assume that our data are a random sample of a single, well-defined population. However, if we use pitcher data from a given year, we know that the batters they face won’t be random, nor will the circumstances they face them under. Furthermore, any extrapolation of this interval is a bit trickier, because confidence intervals are usually employed in estimating parameters that are comparatively stable. In baseball, by contrast, a player’s talent level will change from year to year, and since we usually estimate something using a single year’s worth of data, to interpret our factors we have to take into account not only new random results but also a change in the underlying parameters.

(Hopefully all of that made sense, but if it didn’t and you’re still reading, just try to treat the numbers below as the margin of error on the figures we’re looking at, and realize that some of our interpretations need to be a bit looser than is ideal.)

For this post, I wanted to look at how much margin of error is in FIP, which is one of the more common sabermetric stats to evaluate pitchers. It stands for Fielding Independent Pitching, and is based only on walks, strikeouts, and home runs—all events that don’t depend on the defense (hence the name). It’s also scaled so that the numbers are comparable to ERA. For more on FIP, see the Fangraphs page here.

One of the reasons I was prompted to start with FIP is that a common modification of the stat is to render it as xFIP (x for Expected). xFIP recognizes that FIP can be comparatively volatile because it depends highly on the number of home runs a pitcher gives up, which, as rare events, can bounce around a lot even in a medium size sample with no change in talent. (They also partially depend on park factors.) xFIP replaces the HR component of FIP with the expected number of HR they would have given up if they had allowed the same number of flyballs but had a league average home run to fly ball ratio.

Since xFIP already embeds the idea that FIP is volatile, I was curious as to how volatile FIP actually is, and how much of that volatility is taken care of by xFIP. To do this, I decided to simulate a large number of seasons for a set of pitchers to get an estimate for what an actual distribution of a pitcher’s FIP given an estimated talent level is, then look at how wide a range of results we see in the simulated seasons to get a sense for how volatile FIP is—effectively rerunning seasons with pitchers whose talent level won’t change, but whose luck will.

To provide an example, say we have a pitcher who faces 800 batters, with a line of 20 HR, 250 fly balls (FB), 50 BB, and 250 K. We then assume that, if that pitcher were to face another 800 batters, each has a 250/800 chance of striking out, a 50/800 chance of walking, a 250/800 chance of hitting a fly ball, and a 20/250 chance of each fly ball being a HR. Plugging those into some random numbers, we will get a new line for a player with the same underlying talent—maybe it’ll be 256 K, 45 BB, and 246 FB, of which 24 were HR. From these values, we recompute the FIP. Do this 10,000 times, and we get an idea for how much FIP can bounce around.

For my sample of pitchers to test, I took every pitcher season with at least 50 IP since 2002, the first year for which the number of fly balls was available. I then computed 10,000 FIPs for each pitcher season and took the 97.5th percentile and 2.5th percentile, which give the spread that the middle 95% of the data fall in—in other words, our confidence interval.

(Nitty-gritty aside: One methodological detail that’s mostly important for replication purposes is that pitchers that gave up 0 HR in the relevant season were treated as having given up 0.5 HR; otherwise, there’s not actually any variation on that component. The 0.5 is somewhat arbitrary but, in my experience, is a standard small sample correction for things like odds ratios and chi-squared tests.)

One thing to realize is that these confidence intervals needn’t be symmetric, and in fact they basically never are—the portion of the confidence interval above the pitcher’s actual FIP is almost always larger than the portion below. For instance, in 2011 Bartolo Colon had an actual FIP of 3.83, but his confidence interval is (3.09, 4.64), and the gap from 3.83 to 4.64 is larger than the gap from 3.09 to 3.83. The reasons for this aren’t terribly important without going into details of the binomial distribution, and anyhow, the asymmetry of the interval is rarely very large, so I’m going to use half the length of the interval as my metric for volatility (the margin of error, as it were); for Colon, that’s (4.64 – 3.09) / 2 = 0.775.

So, how big are these intervals? To me, at least, they are surprisingly large. I put some plots below, but even for the pitchers with the most IP, our margin of error is around 0.5 runs, which is pretty substantial (roughly half a standard deviation in FIP, for reference). For pitchers with only about 150 IP, it’s in the 0.8 range, which is about a standard deviation in FIP. A 0.8 gap in FIP is nothing to sneeze at—it’s the difference between 2013 Clayton Kershaw and 2013 Zack Greinke, or between 2013 Zack Greinke and 2013 Scott Feldman. (Side note: Clayton Kershaw is really damned good.)

As a side note, I was concerned when I first got these numbers that the intervals are too wide and overestimate the volatility. Because we can’t repeat seasons, I can’t think of a good way to test volatility, but I did look at how many times a pitcher’s FIP confidence interval contained his actual FIP from the next year. There are some selection issues with this measure (as a pitcher has to post 50 IP in consecutive years to be counted), but about 71% of follow-up season FIPs fall into the previous season’s CI. This may be a bit surprising, as our CI is supposed to include the actual value 95% of the time, but given the amount of volatility in baseball performance due to changes in skill levels, I would expect to see that the intervals diverge from actual values fairly frequently. Though this doesn’t confirm that my estimated intervals aren’t too wide, the magnitude of difference suggests to me it’s unlikely that that is our problem.

Given how sample sizes work, it’s unsurprising that the margin of error decreases substantially as IP increases. Unfortunately, there’s no neat function to get volatility from IP, as it depends strongly on the values of the FIP components as well. If we wanted to, we could construct a model of some sort, but a model whose inputs come from simulations seemed to me to be straying a bit far from the real world.

As I only want to see a rule of thumb, I picked a couple of round IP cutoffs and computed the average margin of error for every pitcher within 15 IP of that cutoff. The 15 IP is arbitrary, but it’s not a huge amount for a starting pitcher (2–3 starts) and ensures we can get a substantial number of pitchers included in each interval. The average FIP margin of error for pitchers within 15 IP of the cutoffs is presented below; beneath that is are scatterplots comparing IP to margin of error.

Mean Margin of Error for Pitchers by Innings Pitched
Approximate IP FIP Margin of Error Number of Pitchers
65 1.16 1747
100 0.99 428
150 0.81 300
200 0.66 532
250 0.54 37

Note that due to construction I didn’t include anyone with less than 50 IP, and the most innings pitched in my sample is 266, so these cutoffs span the range of the data. I also looked at the median values, and there is no substantive difference.

This post has been fairly exploratory in nature, but I wanted to answer one specific question: given that the purpose of xFIP is to stabilize FIP, how much of FIP’s volatility is removed by using xFIP as an ERA estimator instead?

This can be evaluated a few different ways. First, the mean xFIP margin of error in my sample is about 0.54, while the mean FIP margin of error is 0.97; that difference is highly highly significant. This means there is actually a difference between the two, but looking at the average absolute difference of 0.43 is pretty meaningless—obviously a pitcher with an FIP margin of error of 0.5 can’t have a negative margin of error. Thus, we instead look at the percentage difference, which gives us the figure that 43% of the volatility in FIP is removed when using xFIP instead. (The median number is 45%, for reference.)

Finally, here is the above table showing average margins of error by IP, but this time with xFIP as well; note that the differences are all in the 42-48% range.

Mean Margin of Error for Pitchers by Innings Pitched
Approximate IP FIP Margin of Error xFIP Margin of Error Number of Pitchers
65 1.16 0.67 1747
100 0.99 0.53 428
150 0.81 0.43 300
200 0.66 0.36 532
250 0.54 0.31 37

Thus, we see that about 45% of the FIP volatility is stripped away by using xFIP. I’m sort of burying the lede here, but if you want a firm takeaway from this post, there it is.

I want to conclude this somewhat wonkish piece by clarifying a couple of things. First, these numbers largely apply to season-level data; career FIP stats will be much more stable, though the utility of using a rate stat over an entire career may be limited depending on the situation.

Second, this volatility is not something that is unique to FIP—it could be applied to basically any of the stats that we bandy about on a daily basis. I chose to look at FIP partially for its simplicity and partially because people have already looked into its instability (hence xFIP); in the future, I’d like to apply this to other stats as well; for instance, SIERA comes to mind as something directly comparable to FIP, and since Fangraphs’ WAR is computed using FIP, my estimates in this piece can be applied to those numbers as well.

Third, the diminished volatility of xFIP isn’t necessarily a reason to prefer that particular stat. If a pitcher has an established track record of consistently allowing more/fewer HR on fly balls than the average pitcher, that information is important and should be considered. One alternative is to use the pitcher’s career HR/FB in lieu of league average, which gives some of the benefits of a larger sample size while also considering the pitcher’s true talent, though that’s a bit more involved in terms of aggregating data.

Since I got to rambling and this post is long on caveats relative to substance, here’s the tl;dr:

• Even if you think FIP estimates a pitcher’s true talent level accurately, random variation means that there’s a lot of volatility in the statistic.
• If you want a rough estimate for how much volatility there is, see the tables above.
• Using xFIP instead of FIP shrinks the margin of error by about 45%.
• This is not an indictment of FIP as a stat, but rather a reminder that a lot of weird stuff can happen in a baseball season, especially for pitchers.

# Principals of Hitter Categorization

(Note: The apparent typo in the title is deliberate.)

In my experience with introductory statistics classes, both ones I’ve taken and ones I’ve heard about, they typically have two primary phases. The second involves hypothesis testing and regression, which entail trying to evaluate the statistical evidence regarding well-formulated questions. (Well, in an ideal world the questions are well-formulated. Not always the case, as I bitched about on Twitter recently.) This is the more challenging, mathematically sophisticated part of the course, and for those reasons it’s probably the one that people don’t remember quite so well.

What’s the first part? It tends to involve lots of summary statistics and plotting—means, scatterplots, interquartile ranges, all of that good stuff that one does to try to get a handle on what’s going on in the data. Ideally, some intuition regarding stats and data is getting taught here, but that (at least in my experience) is pretty hard to teach in a class. Because this part is more introductory and less complicated, I think this portion of statistics—which is called exploratory data analysis, though there are some aspects of the definition I’m glossing over—can get short shrift when people discuss cool stuff one can do with statistics (though data visualization is an important counterpoint here).

A slightly more complex technique one can do as part of exploratory data analysis is principal component analysis (PCA), which is a way of redefining a data set’s variables based on the correlations present therein. While a technical explanation can be found elsewhere, the basic gist is that PCA allows us to combine variables that are related within the data so that we can pack as much explanatory power as possible into them.

One classic application of this is to athletes’ scores in the decathlon in the Olympics (see example here). There are 10 events, which can be clustered into groups of similar events like the 100 meters and 400 meters and the shot put and discus. If we want to describe the two most important factors contributing to an athlete’s success, we might subjectively guess something like “running ability” and “throwing skill.” PCA can use the data to give us numerical definitions of the two most important factors determining the variation in the data, and we can explore interpretations of those factors in terms of our intuition about the event.

So, what if we take this idea and apply it to baseball hitting data? This would us allow to derive some new factors that explain a lot of the variation in hitting, and by using those factors judiciously we can use this as a way to compare different batters. This idea is not terribly novel—here are examples of some previous work—but I haven’t seen anyone taking the approach I have now. For this post, I’m focused more on what I will call hitting style, i.e. I’d like to divorce similarity based on more traditional results (e.g. home runs—this is the sort of similarity Baseball-Reference uses) in favor of lower order data, namely a batter’s batted ball profile (e.g. line drive percentage and home run to fly ball ratio). However, the next step is certainly to see how these components correlate with traditional measures of power, for instance Isolated Slugging (ISO).

So, I pulled career-level data from FanGraphs for all batters with at least 1000 PA since 2002 (when batted ball data began being collected) on the following categories: line drive rate (LD%), ground ball rate (GB%), outfield fly ball rate (FB%), infield fly ball rate (IFFB%), home run/fly ball ratio (HR/FB), walk rate (BB%), and strike rate (K%). (See report here.) (I considered using infield hit rate as well, but it doesn’t fit in with the rest of these things—it’s more about speed and less about hitting, after all.)

I then ran the PCA on these data in R, and here are the first two components, i.e. the two weightings that together explain as much of the data as possible. (Things get a bit harder to interpret when you add a third dimension.) All data are normalized, so that coefficients are comparable, and it’s most helpful to focus on the signs and relative magnitudes—if one variable is weighted 0.6 and the other -0.3, the takeaway is that the first is twice as important for the component as the second and pushes that component in the opposite direction.

Weights for First Two Principal Components
PC1 PC2
LD% -0.030 0.676
GB% -0.459 0.084
FB% 0.526 0.093
IFFB% -0.067 -0.671
HR/FB 0.459 -0.137
BB% 0.375 0.205
K% 0.394 -0.126

The first two components explain 39% and 22%, respectively, of the overall variation in our data. (The next two explain 16% and 10%, respectively, so they are still important.) This means, basically, that we can explain about 60% of a given batter’s batted ball profile with only these two parameters. (I have all seven components with their importance in a table at the bottom of the post. It’s also worth noting that, as the later components explain less variation, their variance decreases and players are clustered close together on that dimension.)

Arguably the whole point of this exercise is to come up with a reasonable interpretation for these components, so it’s worth it for you to take a look at the values and the interplay between them. I would describe the two components (which we should really think of as axes) as follows:

1. The first is a continuum: slap hitters who make a lot of contact, don’t walk much, hit mostly ground balls and few fly balls with few home runs are on the negative end, and big boppers—three true outcomes guys—place on the top end, as they walk a lot, strike out a lot, and hit more fly balls. This interpretation is borne out by the players with the large magnitude values for this component (found below). For lack of a better term, let’s call this component BSF, for “Big Stick Factor.”
2. The second measures, basically, what some people might call “line drive power.” It measures people’s propensity to hit the ball hard, as it opposes line drives and infield flies. It also rewards guys with good batting eyes, since it opposes walk rate and strikeout rate. I think of this as assessing an old-fashioned view of what makes a good hitter—lots of contact and line drives, with less upper cutting and thus fewer line drives. Let’s call it LDP, for “Line Drive Power.” (I’m open to suggestions on both names.)

Here are some tables showing the top and bottom 10 for both BSF and LDP:

Extreme Values for BSF
Name PC1
1 Russell Branyan 5.338
2 Barry Bonds 5.257
4 Jack Cust 4.535
5 Ryan Howard 4.296
6 Jim Thome 4.278
7 Jason Giambi 4.237
8 Frank Thomas 4.206
9 Jim Edmonds 4.114
10 Mark Reynolds 3.890
633 Aaron Miles -3.312
634 Cesar Izturis -3.397
635 Einar Diaz -3.518
636 Ichiro Suzuki -3.523
637 Rey Sanchez -3.893
638 Luis Castillo -4.013
639 Juan Pierre -4.267
640 Wilson Valdez -4.270
641 Ben Revere -5.095
642 Joey Gathright -5.164
Extreme Values for LDP
Name PC2
1 Cory Sullivan 4.292
2 Matt Carpenter 4.052
3 Joey Votto 3.779
4 Joe Mauer 3.255
6 Todd Helton 3.065
7 Julio Franco 2.933
8 Jason Castro 2.780
9 Mark Loretta 2.772
10 Alex Avila 2.747
633 Alexi Casilla -2.482
634 Rocco Baldelli -2.619
635 Mark Trumbo -2.810
636 Nolan Reimold -2.932
637 Marcus Thames -3.013
638 Tony Batista -3.016
639 Scott Hairston -3.041
640 Eric Byrnes -3.198
641 Jayson Nix -3.408
642 Jeff Mathis -3.668

These actually map pretty closely onto what some of our preexisting ideas might have been: the guys with the highest BSF are some of the archetypal three true outcomes players, while the guys with the high LDP are guys we think of as being good hitters with “doubles power,” as it were. It’s also interesting to note that these are not entirely correlated with hitter quality, as there’s some mediocre players near the top of each list (though most of the players at the bottom aren’t too great). That suggests to me that this did actually a pretty decent job of capturing style, rather than just quality (though obviously it’s easier to observe someone’s style when they actually have strengths).

Now, another thing about this is that while we would think that BSF and LDP are correlated based on my qualitative descriptions, by construction there’s zero correlation between the two sets of values, so these are actually largely independent stats. Consider the plot below of BSF vs. LDP:

And this plot, which isolates some of the more extreme values:

One final thing for this post: given that we have plotted these like coordinates, we can use the standard measure of distance between two points as a measure for similarity. For this, I’m going to change tacks slightly and use the first The two players most like each other in this sample form a slightly unlikely pair: Marlon Byrd, with coordinates (-0.756, 0.395), and Carlos Ruiz (-0.755, 0.397).

As you see below, if you look at their batted ball profiles, they don’t appear to be hugely similar.  I spent a decent amount of time playing around with this; if you increase the number of components used from two to three or more, the similar players look much more similar in terms of these statistics. However, that gets away from the point of PCA, which is to abstract away from the data a bit. Thus, these pairs of similar players are players who have very similar amounts of BSF and LDP, rather than players who have the most similar statistics overall.

Comparison of Ruiz and Byrd
Name LD% GB% FB% IFFB% HR/FB BB% K%
Carlos Ruiz 0.198 0.455 0.255 0.092 0.074 0.098 0.111
Marlon Byrd 0.206 0.471 0.241 0.082 0.093 0.064 0.180

Another pair that’s approximately as close as Ruiz and Byrd is Mark Teahen (-0.420,-0.491) and Akinori Iwamura (-0.421,-0.490), with the third place pair being Yorvit Torrealba (-1.919, -0.500) and Eric Young (-1.909, -0.497), who are seven times farther apart than the first two pairs.

Who stands out as outliers? It’s not altogether surprising if you look at the labelled charts above, though not all of them are labelled. (Also, be wary of the scale—the graph is a bit squished, so many players are farther apart numerically than they appear visually.) Joey Gathright turns out to be by far the most unusual player in our data—the distance to his closest comp, Einar Diaz, is more than 1000x the distance from Ruiz to Byrd, more than thirteen times the average distance to a player’s nearest neighbor, and more than eleven standard deviations above that average nearest neighbor difference.

In this case, though, having a unique style doesn’t appear to be beneficial. You’ll note Gathright is at the bottom of the BSF list, and he’s pretty far down the LDP list as well, meaning that he somehow stumbled into a seven year career despite having no power of any sort. Given that he posted an extremely pedestrian 0.77 bWAR per 150 games (meaning about half as valuable as an average player), hit just one home run in 452 games, and had the 13th lowest slugging percentage of any qualifying non-pitcher since 1990, we probably shouldn’t be surprised that there’s nobody who’s quite like him.

The rest of the players on the outliers list are the ones you’d expect—guys with extreme values for one or both statistics: Joey Votto, Barry BondsCory Sullivan, Matt Carpenter, and Mark Reynolds. Votto is the second biggest outlier, and he’s less than two thirds as far from his nearest neighbor (Todd Helton) as Gathright is from his. Two things to notice here:

• To reiterate what I just said about Gathright, style doesn’t necessarily correlate with results. Cory Sullivan hit a lot of line drives (28.2%, the largest value in my sample—the mean is 20.1%) and popped out infrequently (3%, the mean is 10.1%). His closest comps are Matt Carpenter and Joe Mauer, which is pretty good company. And yet, he finished as a replacement level player with no power. Baseball is weird.
• Many of the most extreme outliers are players where we are missing a big chunk of their careers, either because they haven’t actually had them yet or because the data are unavailable. Given that there’s some research indicating that various power-related statistics change with age, I suspect we’ll see some regression to the mean for guys like Votto and Carpenter. (For instance, I imagine Bonds’s profile would look quite different if it included the first 16 years of his career.)

This chart shows the three tightest pairs of players and the six biggest outliers:

This is a bit of a lengthy post without necessarily an obvious point, but, as I said at the beginning, exploratory data analysis can be plenty interesting on its own, and I think this turned into a cool way of classifying hitters based on certain styles. An obvious extension is to find some way to merge both results and styles into one PCA analysis (essentially combining what I did with the Bill James/BR Similarity Score mentioned above), but I suspect that’s a big question, and one for another time.

If you’re curious, here’s a link to a public Google Doc with my principal components, raw data, and nearest distances and neighbors, and below is the promised table of PCA breakdown:

Weights and Explanatory Power of Principal Components
PC1 PC2 PC3 PC4 PC5 PC6 PC7
LD% -0.030 0.676 -0.299 -0.043 0.629 0.105 -0.210
GB% -0.459 0.084 0.593 0.086 -0.044 0.020 -0.648
FB% 0.526 0.093 -0.288 -0.226 -0.434 -0.014 -0.626
IFFB% -0.067 -0.671 -0.373 0.247 0.442 -0.071 -0.379
HR/FB 0.459 -0.137 0.347 0.113 0.214 0.769 -0.000
BB% 0.375 0.205 0.156 0.808 -0.012 -0.373 -0.000
K% 0.394 -0.126 0.437 -0.461 0.415 -0.503 0.000
Proportion of Variance 0.394 0.218 0.163 0.102 0.069 0.053 0.000
Cumulative Proportion 0.394 0.612 0.775 0.877 0.947 1.000 1.000