# The Quality of Postseason Play

Summary: I look at averages for hitters and pitchers in the postseason to see how their quality (relative to league average) has changed over time. Unsurprisingly, the gap between postseason and regular season average pitchers is larger than the comparable gap for hitters. The trend over time for pitchers is expected, with a decrease in quality relative to league average from the 1900s to mid-1970s and a slight increase since then that appears to be linked with the increased usage of relievers. The trend for hitters is more confusing, with a dip from 1950 to approximately 1985 and an increase since then. Overall, however, the average quality of both batters and pitchers in the postseason relative to league average is as high as it has been in the expansion era.

Quality of play in the postseason is a common trope of baseball discussion. Between concerns about optics (you want casual fans to watch high quality baseball) and rewarding the best teams, there was a certain amount of handwringing about the number of teams with comparatively poor records into the playoffs (e.g., the Giants and Royals made up the only pair of World Series teams ever without a 90 game winner). This prompted me to wonder about the quality of the average players in the postseason and how that’s changed over time with the many changes in the game—increased competitive balance, different workloads for pitchers, changes in the run environment, etc.

For pitchers, I looked at weighted league-adjusted RA9, which I computed as follows:

1. For each pitcher in the postseason, compute their Runs Allowed per 9 IP during the regular season. Lower is better, obviously.
2. Take the average for each pitcher, weighted by the number of batters faced.
3. Divide that average by the major league average RA9 that year.

You can think of this as the expected result you would get if you chose a random plate appearance during the playoffs and looked at the pitcher’s RA9. Four caveats here:

1. By using RA9, this is a combined pitching/defense metric that really measures how much the average playoff team is suppressing runs relative to league average.
2. This doesn’t adjust for park factors, largely because I thought that adjustment was more trouble than it was worth. I’m pretty sure the only effect that this has on aggregate is injecting some noise, though I’m not positive.
3. I considered using projected RA9 instead of actual RA9, but after playing around with the historical Marcel projections at Baseball Heat Maps, I didn’t see any meaningful differences on aggregate.
4. For simplicity’s sake, I used major league average rather than individual league average, which could influence some of the numbers in the pre-interleague play era.

When I plot that number over time, I get the following graph. The black dots are observed values, and the ugly blue line is a smoothed rolling estimate (using LOESS). (The gray is the confidence interval for the LOESS estimate.)

While I wouldn’t put too much weight in the LOESS estimate (these numbers should be subject to a large bit of randomness), it’s pretty easy to come up with a basic explanation of why the curve looks the way it does. For the first seventy years of that chart, the top pitchers pitched ever smaller shares of the overall innings (except for an uptick in the 1960s), ceding those innings to lesser starters and dropping the average quality. However, starting in the 1970s, relievers have covered larger portions of innings (covered in this FiveThirtyEight piece), and since relievers are typically more effective on a rate basis than starters, that’s a reasonable explanation for the shape of the overall pitcher trend.

What about hitters? I did the same calculations for them, using wOBA instead of RA9 and excluding pitchers from both postseason and league average calculations. (Specifically, I used the static version of wOBA that doesn’t have different coefficients each year. The coefficients used are the ones in The Book.) Again, this includes no park adjustments and rolls the two leagues together for the league average calculation. Here’s what the chart looks like:

Now, for this one I have no good explanation for the trend curve. There’s a dip in batter quality starting around integration and a recovery starting around 1985. If you have ideas about why this might be happening, leave them in the comments or Twitter. (It’s also quite possible that the LOESS estimate is picking up something that isn’t really there.)

What’s the upshot of all of this? This is an exploratory post, so there’s no major underlying point, but from the plots I’m inclined to conclude that, relative to average, the quality of the typical player (both batter and pitcher) in the playoffs is as good as it’s been since expansion. (To be clear, this mostly refers to the 8 team playoff era of 1995–2011; the last few years aren’t enough to conclude anything about letting two more wild cards in for a single game.) I suspect a reason for that is that, while the looser postseason restrictions have made it easier for flawed teams to make it in the playoffs, they’ve also made it harder for very good teams to be excluded because of bad luck, which lifts the overall quality, a point raised in this recent Baseball Prospectus article by Sam Miller.

• I used data from the Lahman database and Fangraphs for this article, which means there may be slight inconsistencies. For instance, there’s apparently an error in Lahman’s accounting for HBP in postseason games the last 5 years or so, which should have a negligible but non-zero effect on the results.
• I mentioned that the share of batters faced in the postseason by the top pitchers has decreased steadily over time. I assessed that using the Herfindahl-Hirschman index (which I also used in an old post about pitchers’ repertoires.) The chart of the HHI for batters faced is included below. I cut the chart off at 1968 to exclude the divisional play era, which by doubling the number of teams decreased the level of concentration substantially.

# Rookie Umpires and the Strike Zone

Summary: Based on a suggestion heard at SaberSeminar, I use a few different means to examine how rookie umpires call the strike zone. Those seven umpires appear to consistently call more low strikes than the league as a whole, but some simple statistics suggest it’s unlikely they are actually moving the needle.

Red Sox manager John Farrell was one of the speakers at Saberseminar, which I attended last weekend. As I mentioned in my recap, he was asked about the reasons offense is down a hair this year (4.10 runs per team per game as I type this, down from 4.20 through this date (4.17 overall) in 2013). He mentioned a few things, but one that struck me was his suggestion that rookie umpires calling a larger “AAA strike zone” might have something to do with it.

Of course, that’s something we can examine using some empirical evidence. Using this Hardball Talk article as a guide, I identified the seven new umpires this year. (Note that they are new to being full-fledged umps, but had worked a number of games as substitutes over the last several years.) I then pulled umpire strike zone maps from the highly useful Baseball Heat Maps, which I’ve put below. Each map shows the comparison between the umpire* and league average, with yellow marking areas more likely to be called strikes and blue areas less likely to be called strikes by the umpire.

* I used the site’s settings to add in 20 pitches of regression toward the mean, meaning that the values displayed in the charts are suppressed a bit.

Jordan Baker:

Lance Barrett:

Cory Blaser:

Mike Estabrook:

Mike Muchlinski:

David Rackley:

D.J. Reyburn:

The common thread, to me, is that almost all of them call more pitches for strikes at the bottom of the zone, and most of them take away outside strikes for some batters. Unfortunately, these maps don’t adjust for the number of pitches thrown in each area, so it’s hard to get aggregate figures for how many strikes below or above average the umpires are generating. The two charts below, from Baseball Savant, are a little more informative; red dots are the bars corresponding to rookie umps. (Labeling was done by hand in MS Paint, so there may be some error involved.)

The picture is now a bit murkier; just based on visual inspection, it looks like rookie umps call a few strikes more than average on pitches outside the zone, and maybe call a few extra balls on pitches in the zone, so we’d read that as nearly a wash, but maybe a bit on the strike side.

So, we’ve now looked at their strike zones adjusted for league average but not the number of pitches thrown and their strike zones adjusted for the relative frequencies of pitches but not seriously adjusted for league average. One more comparison, since I wasn’t able to find a net strikes leaderboard, is to use aggregate ball/strike data, which has accurate numbers but is unadjusted for a bunch of other stuff. Taking that information from Baseball Prospectus and subtracting balls in play from their strikes numbers, I find that rookie umps have witnessed in total about 20 strikes more than league average would suggest, though that’s not accounting for swinging vs. called or the location that pitches were thrown. (Those are substantial things to consider, and I wouldn’t necessarily expect them to even out in 30 or so games.)

At 0.12 runs per strike (a figure quoted by Baseball Info Solutions at the conference) that’s about 2.4 runs, which is about 0.4% of the gap between this year’s scoring and last year’s. (For what it’s worth, BIS showed the umpires who’d suppressed the most offense with their strike zones, and if I remember correctly, taking the max value and applying it to each rookie would be 50–60 total runs, which is still way less than the total change in offense.)

A different way of thinking about it is that the rookie umps have worked 155 games, so they’ve given up an extra strike every 8 or so games, or every 16 or so team-games. If the change in offense is 0.07 runs per team-game, that’s about one strike per game. So these calculations, heavily unadjusted, suggest that rookie umpires are unlikely to account for much of the decrease in scoring.

So, we have three different imperfect calculations, plus a hearsay back of the envelope plausibility analysis using BIS’s estimates, that each point to a very small effect from rookie umps. Moreover, rookie umps have worked 8.3% of all games and 8.7% of Red Sox games, so it seems like an odd thing for Farrell to pick up on. It’s possible that a more thorough analysis would reveal something big, but based on the data easily available I don’t think it’s true that rookie umpires are affecting offense with their strike zones.

# Do Platoon Splits Mess Up Projections?

Quick summary: I test the ZiPS and Marcel projection systems to see if their errors are larger for players with larger platoon splits. A first check says that they are not, though a more nuanced examination of the system remains to be conducted.

First, a couple housekeeping notes:

• I will be giving a short talk at Saberseminar, which is a baseball research conference held in Boston in 10 days! If you’re there, you should go—I’ll be talking about how the strike zone changes depending on where and when games are played. Right now I’m scheduled for late Sunday afternoon.
• Sorry for the lengthy gap between updates; work obligations plus some other commitments plus working on my talk have cut into my blogging time.

After the A’s went on their trading sprees last week at the trading deadline, there was much discussion about how they were going to intelligently deploy the rest of their roster to cover for the departure of Yoenis Cespedes. This is part of a larger pattern with the A’s as they continue to be very successful with their platoons and wringing lots of value out of their depth. Obviously, when people have tried to determine the impact of this trade, they’ve been relying on projections for each of the individual players involved.

What prompted my specific question is that Jonny Gomes is one of those helping to fill Cespedes’s shoes, and Gomes has very large platoon splits. (His career OPS is .874 against left-handed pitchers and .723 against righties.) The question is what proportion of Gomes’s plate appearances the projection systems assume will be against right handers; one might expect that if he is deployed more often against lefties than the system projects, he might beat the projections substantially.

Since Jonny Gomes in the second half of 2014 constitutes an extremely small sample, I decided to look at a bigger pool of players from the last few years and see if platoon splits correlated at all with a player beating (or missing) preseason projections. Specifically, I used the 2010, 2012, and 2013 ZiPS and Marcel projections (via the Baseball Projection Project, which doesn’t have 2011 ZiPS numbers).

A bit of background: ZiPS is the projection system developed by Dan Szymborski, and it’s one of the more widely used ones, if only because it’s available at FanGraphs and relatively easy to find there. Marcel is a very simple projection system developed by Tangotiger (it’s named after the monkey from Friends) that is sometimes used as a baseline for other projection systems. (More information on the two systems is available here.)

So, once I had the projections, I needed to come up with a measure of platoon tendencies. Since the available ZiPS projections only included one rate stat, batting average, I decided to use that as my measure of batting success. I computed platoon severity by taking the larger of a player’s BA against left-handers and BA against right-handers and dividing by the smaller of those two numbers. (As an example, Gomes’s BA against RHP is .222 and against LHP is .279, so his ratio is .279/.222 = 1.26.) My source for those data is FanGraphs.

I computed that severity for players with at least 500 PA against both left-handers and right-handers going into the season for which they were projected; for instance, for 2010 I would have used career data stopping at 2009. I then looked at their actual BA in the projected year, computed the deviation between that BA and the projected BA, and saw if there was any correlation between the deviation and the platoon ratio. (I actually used the absolute value of the deviation, so that magnitude was taken into account without worrying about direction.) Taking into account the availability of projections and requiring that players have at least 150 PA in the season where the deviation is measured, we have a sample size of 556 player seasons.

As it turns out, there isn’t any correlation between the two parameters. My hypothesis was that there’d be a positive correlation, but the correlation is -0.026 for Marcel projections and -0.047 for ZiPS projections, neither of which is practically or statistically significantly different from 0. The scatter plots for the two projection systems are below:

Now, there are a number of shortcomings to the approach I’ve taken:

• It only looks at two projection systems; it’s possible this problem arises for other systems.
• It only looks at batting average due to data availability issues, when wOBA, OPS, and wRC+ are better, less luck-dependent measures of offensive productivity.
• Perhaps most substantially, we would expect the projection to be wrong if the player has a large platoon split and faces a different percentage of LHP/RHP during the season in question than he has in his career previously. I didn’t filter on that (I was having issues collecting those data in an efficient format), but I intend to come back to it.

So, if you’re looking for a takeaway, it’s that large platoon-split players on the whole do not appear to be poorly projected (for BA by ZiPS and Marcel), but it’s still possible that those with a large change in circumstances might differ from their projections.

# Picking a Pitch and the Pace of the Game

Here’s a short post to answer a straight-forward question: do pitchers that throw more pitches pitch more slowly? If it’s not clear, the idea is that a pitcher who throws several pitches frequently will take longer because the catcher has to spend more time calling the pitch, perhaps with a corresponding increase in how often the pitcher shakes off the catcher.

To make a quick pass at this, I pulled FanGraphs data on how often each pitcher threw fastballs, sliders, curveballs, changeups, cutters, splitters, and knucklers, using data from 2009–13 on all pitches with at least 200 innings. (See the data here. There are well-documented issues with the categorizations, but for a small question like this they are good enough.) The statistic used for how quickly the pitcher worked was the appropriately named Pace, which measures the number of seconds between pitches thrown.

To easily test the hypothesis, we need a single number to measure how even the pitcher’s pitch mix is, which we believe to be linked to the complexity of the decision they need to make. There are many ways to do this, but I decided to go with the Herfindahl-Hirschman Index, which is usually used to measure market concentration in economics. It’s computed by squaring the percentage share of each pitch and adding them together, so higher values mean things are more concentrated. (The theoretical max is 10,000.) As an example, Mariano Rivera threw 88.9% cutters and 11.1% fastballs over the time period we’re examining, so his HHI was $88.9^{2} + 11.1^{2} = 8026$. David Price threw 66.7% fastballs, 5.8% sliders, 6.6% cutters, 10.6% curveballs, and 10.4% changeups, leading to an HHI of 4746. (See additional discussion below.) If you’re curious, the most and least concentrated repertoires split by role are in a table at the bottom of the post.

As an aside, I find two people on those leader/trailer lists most interesting. The first is Yu Darvish, who’s surrounded by junkballers—it’s pretty cool that he has such amazing stuff and still throws 4.5 pitches with some regularity. The second is that Bartolo Colon has, according to this metric, less variety in his pitch selection over the last five years than the two knuckleballers in the sample. He’s somehow a junkballer but with only one pitch, which is a pretty #Mets thing to be.

Back to business: after computing HHIs, I split the sample into 99 relievers and 208 starters, defined as pitchers who had at least 80% of their innings come in the respective role. I enforced the starter/reliever split because a) relievers have substantially less pitch diversity (unweighted mean HHI of 4928 vs. 4154 for starters, highly significant) and b) they pitch substantially slower, possibly due to pitching more with men on base and in higher leverage situations (unweighted mean Pace of 23.75 vs. 21.24, a 12% difference that’s also highly significant).

So, how does this HHI match up with pitching pace for these two groups? Pretty poorly. The correlation for starters is -0.11, which is the direction we’d expect but a very small correlation (and one that’s not statistically significant at p = 0.1, to the limit extent that statistical significance matters here). For relievers, it’s actually 0.11, which runs against our expectation but is also statistically and practically no different from 0. Overall, there doesn’t seem to be any real link, but if you want to gaze at the entrails, I’ve put scatterplots at the bottom as well.

One important note: a couple weeks back, Chris Teeter at Beyond the Box Score took a crack at the same question, though using a slightly different method. Unsurprisingly, he found the same thing. If I’d seen the article before I’d had this mostly typed up, I might not have gone through with it, but as it stands, it’s always nice to find corroboration for a result.

Relief Pitchers with Most Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Sean Marshall 25.6 18.3 17.7 38.0 0.5 0.0 0.0 2748
2 Brandon Lyon 43.8 18.3 14.8 18.7 4.4 0.0 0.0 2841
3 D.J. Carrasco 32.5 11.2 39.6 14.8 2.0 0.0 0.0 2973
4 Alfredo Aceves 46.5 0.0 17.9 19.8 13.5 2.3 0.0 3062
5 Logan Ondrusek 41.5 2.0 30.7 20.0 0.0 5.8 0.0 3102
Relief Pitchers with Least Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Kenley Jansen 91.4 7.8 0.0 0.2 0.6 0.0 0.0 8415
2 Mariano Rivera 11.1 0.0 88.9 0.0 0.0 0.0 0.0 8026
3 Ronald Belisario 85.4 12.7 0.0 0.0 0.0 1.9 0.0 7458
4 Matt Thornton 84.1 12.5 3.3 0.0 0.1 0.0 0.0 7240
5 Ernesto Frieri 82.9 5.6 0.0 10.4 1.1 0.0 0.0 7013
Starting Pitchers with Most Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Shaun Marcum 36.6 9.3 17.6 12.4 24.1 0.0 0.0 2470
2 Freddy Garcia 35.4 26.6 0.0 7.9 13.0 17.1 0.0 2485
3 Bronson Arroyo 42.6 20.6 5.1 14.2 17.6 0.0 0.0 2777
4 Yu Darvish 42.6 23.3 16.5 11.2 1.2 5.1 0.0 2783
5 Mike Leake 43.5 11.8 23.4 9.9 11.6 0.0 0.0 2812
Starting Pitchers with Least Diverse Stuff, 2009–13
Name FB% SL% CT% CB% CH% SF% KN% HHI
1 Bartolo Colon 86.2 9.1 0.2 0.0 4.6 0.0 0.0 7534
2 Tim Wakefield 10.5 0.0 0.0 3.7 0.0 0.0 85.8 7486
3 R.A. Dickey 16.8 0.0 0.0 0.2 1.5 0.0 81.5 6927
4 Justin Masterson 78.4 20.3 0.0 0.0 1.3 0.0 0.0 6560
5 Aaron Cook 79.7 9.7 2.8 7.6 0.4 0.0 0.0 6512

Boring methodological footnote: There’s one primary conceptual problem with using HHI, and that’s that in certain situations it gives a counterintuitive result for this application. For instance, under our line of reasoning we would think that, ceteris paribus, a pitcher who throws a fastball 90% of a time and a change 10% of the time would have an easier decision to make than one who throws a fastball 90% of the time and a change and slider 5% each. However, the HHI is higher for the latter pitcher—which makes sense in the context of market concentration, but not in this scenario. (The same issue holds for the Gini coefficient, for that matter.) There’s a very high correlation between HHI and the frequency of a pitcher’s most common pitch, though, and using the latter doesn’t change any of the conclusions of the post.

# Is There a Hit-by-Pitch Hangover?

One of the things I’ve been curious about recently and have on my list of research questions is what the ramifications of a hit-by-pitch are in terms of injury risk—basically, how much of the value of an HBP does the batter give back through the increased injury risk? Today, though, I’m going to look at something vaguely similar but much simpler: Is an HBP associated with an immediate decrease in player productivity?

To assess this, I looked at how players performed in the plate appearance immediately following their HBP in the same game. (This obviously ignores players who are injured by their HBP and leave the game, but I’m looking for something subtler here.) To evaluate performance, I used wOBA, a rate stat that encapsulates a batter’s overall offensive contributions. There are, however, two obvious effects (and probably other more subtle ones) that mean we can’t only look at the post-HBP wOBA and compare it to league average.

The first is that, ceteris paribus, we expect that a pitcher will do worse the more times he sees a given batter (the so-called “trips through the order penalty”). Since in this context we will never include a batter’s first PA of a game because it couldn’t be preceded by an HBP, we need to adjust for this. The second adjustment is simple selection bias—not every batter has the same likelihood of being hit by a pitch, and if the average batter getting hit by a pitch is better or worse than the overall average batter, we will get a biased estimate of the effect of the HBP. If you don’t care about how I adjusted for this, skip to the next bold text.

I attempted to take those factors into account by computing the expected wOBA as follows. Using Retrosheet play-by-play data for 2004–2012 (the last year I had on hand), for each player with at least 350 PA in a season, I computed their wOBA over all PA that were not that player’s first PA in a given game. (I put the 350 PA condition in to make sure my average wasn’t swayed by low PA players with extreme wOBA values.) I then computed the average wOBA of those players weighted by the number of HBP they had and compared it to the actual post-HBP wOBA put up by this sample of players.

To get a sense of how likely or unlikely any discrepancy would be, I also ran a simulation where I chose random HBPs and then pulled a random plate appearance from the hit batter until I had the same number of post-HBP PA as actually occurred in my nine year sample, then computed the post-HBP wOBA in that simulated world. I ran 1000 simulations and so have some sense of how unlikely the observed post-HBP performance is under the null hypothesis that there’s no difference between post-HBP performance and other performance.

To be honest, though, those adjustments don’t make me super confident that I’ve covered all the necessary bases to find a clean effect—the numbers are still a bit wonky, and this is not such a simple thing to examine that I’m confident I’ve gotten all the noise out. For instance, it doesn’t filter out park or pitcher effects (i.e. selection bias due to facing a worse pitcher, or a pitcher having an off day), both of which play a meaningful role in these performance and probably lead to additional selection biases I don’t control for.

With all those caveats out of the way, what do we see? In the data, we have an expected post-HBP wOBA of .3464 and an actual post-HBP wOBA of .3423, for an observed difference of about 4 points of wOBA, which is a small but non-negligible difference. However, it’s in the 24th percentile of outcomes according to the simulation, which indicates there’s a hefty chance that it’s randomness. (Though league average wOBA changed noticeably over the time period I examined, I did some sensitivities and am fairly confident those changes aren’t covering up a real result.)

The main thing (beyond the aforementioned haziness in this analysis) that makes me believe there might be an effect is that the post-walk effect is actually a 2.7 point (i.e. 0.0027) increase in wOBA. If we think that boost is due to pitcher wildness then we would expect the same thing to pop up for the post-HBP plate appearances, and the absence of such an increase suggests that there is a hangover effect. However, to conclude from that that there is a post-HBP swoon seems to be an unreasonably baroque chain of logic given the rest of the evidence, so I’m content to let it go for now.

The main takeaway from all of this is that there’s an observed decrease in expected performance after an HBP, but it’s not particularly large and doesn’t seem likely to have any predictive value. I’m open to the idea that a more sophisticated simulator that includes pitcher and park effects could help detect this effect, but I expect that even if the post-HBP hangover is a real thing, it doesn’t have a major impact.

# Do High Sock Players Get “Hosed” by the Umpires?

I was reading one of Baseball Prospectus’s collections this morning and came across an interesting story. It’s a part of baseball lore that Willie Mays started his career on a brutal cold streak (though one punctuated by a long home run off Warren Spahn). Apparently, manager Leo Durocher told Mays toward the end of the slump that he needed to pull his pants up because the pant knees were below Mays’s actual knees, which was costing him strikes. Mays got two hits the day after the change and never looked back.

To me, this is a pretty great story and (to the extent it’s true) a nice example of the attention to detail that experienced athletes and managers are capable of. However, it prompted another question: do uniform details actually affect the way that umpires call the game?

Assessing where a player belts his pants is hard, however, so at this point I’ll have to leave that question on the shelf. What is slightly easier is looking at which hitters wear their socks high and which cover their socks with their baseball pants. The idea is that by clearly delineating the strike zone, the batter will get fairer calls on balls near the bottom of the strike zone than he might otherwise. This isn’t a novel idea—besides the similarity to what Durocher said, it’s also been suggested herehere, and in the comments here—but I wasn’t able to find any studies looking at this. (Two minor league teams in the 1950s did try this with their whole uniforms instead of just the socks, however. The experiments appear to have been short-lived.)

There are basically two ways of looking at the hypothesis: the first is that it will be a straightforward benefit/detriment to the player to hike his socks because the umpire will change his definition of the bottom of the zone; this is what most of the links I cited above would suggest, though they didn’t agree on which direction. I’m somewhat skeptical of this, unless we think that the umpires have a persistent bias for or against certain players and that that bias would be resolved by the player changing how he wears his socks. The second interpretation is that it will make the umpire’s calls more precise, meaning simply that borderline pitches are called more consistently, but that it won’t actually affect where the umpire thinks the bottom of the zone is.

At first blush, this seems like the sort of thing that Pitch F/X would be perfectly suited to, as it gives oodles of information about nearly every pitch thrown in the majors in the last several years. However, it doesn’t include a variable for the hosiery of the batter, so to do a broader study we need additional data. After doing some research and asking around, I wasn’t able to find a good database of players that consistently wear high socks, much less a game-by-game list, which basically ruled out a large-scale Pitch F/X study.

However, I got a very useful suggestion from Paul Lukas, who runs the excellent Uni Watch site. He pointed out that a number of organizations require their minor leaguers to wear high socks and only give the option of covered hose to the major leaguers, providing a natural means of comparison between the two types of players. This will allow us to very broadly test the hypothesis that there is a single direction change in how low strikes are called.

I say very broadly because minor league Pitch F/X data aren’t publicly available, so we’re left with extremely aggregate data. I used data from Minor League Central, which has called strikes and balls for each batter. In theory, if the socks lead to more or fewer calls for the batter at the bottom of the zone, that will show up in the aggregate data and the four high-socked teams (Omaha, Durham, Indianapolis, and Scranton/Wilkes-Barre) will have a different percentage of pitches taken go for strikes. (I found those teams by looking at a sample of clips from the 2013 season; their AA affiliates also require high socks.)  Now, there are a lot of things that could be confounding factors in this analysis:

1. Players on other teams are allowed to wear their socks high, so this isn’t a straight high socks/no high socks comparison, but rather an all high socks/some high socks comparison. (There’s also a very limited amount of non-compliance on the all socks side, as based on the clips I could find it appears that major leaguers on rehab aren’t bound by the same rules; look at some Derek Jeter highlights with Scranton if you’re curious.)
2. AAA umpires are prone to more or different errors than major league umpires.
3. Which pitches are taken is a function of the team makeup and these teams might take more or fewer balls for reasons unrelated to their hose.
4. This only affects borderline low pitches, and so it will only make up a small fraction of the overall numbers we observe and the impact will be smothered.

I’m inclined to downplay the first and last issues, because if those are enough to suppress the entire difference over the course of a whole season then the practical significance of the change is pretty small. (Furthermore, for #1, from my research it didn’t look like there were many teams with a substantial number of optional socks-showers. Please take that with a grain of salt.)

I don’t really have anything to say about the second point, because it has to do with extrapolation, and for now I’d be fine just looking at AAA. I don’t have even have that level of brushoff response for the third point except to wave my hands and say that I hope it doesn’t matter given that these reflect pitches thrown by the rest of the league, so they will hopefully converge around league average.

So, having substantially caveated my results…what are they? As it turns out, the percentage of pitches the stylish high sock teams took that went for strikes was 30.83% and the equivalent figure for the sartorially challenged was…30.83%. With more than 300,000 pitches thrown in AAA last year, you need to go to the seventh decimal place of the fraction to see a difference. (If this near equality seems off to you, it does to me as well. I checked my figures a couple of ways, but I (obviously) can’t rule out an error here.)

What this says to me is that it’s pretty unlikely that this ends up mattering, unless there is an effect and it’s exactly cancelled out by the confounding factors listed above (or others I failed to consider). That can’t be ruled out as a possibility, nor can data quality issues, but I’m comfortable saying that the likeliest possibility by a decent margin is that socks don’t lead to more or fewer strikes being called against the batter. (Regardless, I’m open to suggestions for why the effect might be suppressed or analysis based on more granular data I either don’t have access to or couldn’t find.)

What about the accuracy question, i.e. is the bottom of the strike zone called more consistently or correctly for higher-socked players? Due to the lack of nicely collected data, I couldn’t take a broad approach to answering this, but I do want to record an attempt I made regardless. David Wright is known for wearing high socks in day games but covering his hosiery at night, which gives us a natural experiment we can look at for results.

I spent some amount of time looking at the 2013 Pitch F/X data for his day/night splits on taken low pitches and comparing those to the same splits for the Mets as a whole, trying a few different logistic regression models as well as just looking at the contingency tables to see if anything jumped out, and nothing really did in terms of either greater accuracy or precision. I didn’t find any cuts of the data that yielded a sufficiently clean comparison or sample size that I was confident in the results. Since this is a messy use of these data in the first place (it relies on unreliable estimates of the lower edge of a given batter’s strike zone, for instance), I’m going to characterize the analysis as incomplete for now. Given a more rigorous list of which players wear high socks and when, though, I’d love to redo this with more data.

Overall, though, there isn’t any clear evidence that the socks do influence the strike zone. I will say, though, that this seems like something that a curious team could test by randomly having players (presumably on their minor league teams) wear the socks high and doing this analysis with cleaner data. It might be so silly as to not be worth a shot, but if this is something that can affect the strike zone at all then it could be worthwhile to implement in the long run—if it can partially negate pitch framing, for instance, then that could be quite a big deal.

# Wear Down, Chicago Bears?

I watched the NFC Championship game the weekend before last via a moderately sketchy British stream. It used the Joe Buck/Troy Aikman feed, but whenever that went to commercials they had their own British commentary team whose level of insight, I think it’s fair to say, was probably a notch below what you’d get if you picked three thoughtful-looking guys at random out of an American sports bar. (To be fair, that’s arguably true of most of the American NFL studio crews as well.)

When discussing Marshawn Lynch, one of them brought out the old chestnut that big running backs wear down the defense and thus are likely to get big chunks of yardage toward the end of games, citing Jerome Bettis as an example of this. This is accepted as conventional wisdom when discussing football strategy, but I’ve never actually seen proof of this one way or another, and I couldn’t find any analysis of this before typing up this post.

The hypothesis I want to examine is that bigger running backs are more successful late in games than smaller running backs. All of those terms are tricky to define, so here’s what I’m going with:

• Bigger running backs are determined by weight, BMI, or both. I’m using Pro Football Reference data for this, which has some limitations in that it’s not dynamic, but I haven’t heard of any source that has any dynamic information on player size.
• Late in games is the simplest thing to define: fourth quarter and overtime.
• More successful is going to be measured in terms of yards per carry. This is going to be compared to the YPC in the first three quarters to account for the baseline differences between big and small backs. The correlation between BMI and YPC is -0.29, which is highly significant (p = 0.0001). The low R squared (about 0.1) says that BMI explains about 10% of variation in YPC, which isn’t great but does say that there’s a meaningful connection. There’s a plot below of BMI vs. YPC with the trend line added; it seems like close to a monotonic effect to me, meaning that getting bigger is on average going to hurt YPC. (Assuming, of course, that the player is big enough to actually be an NFL back.)

My data set consisted of career-level data split into 4th quarter/OT and 1st-3rd quarters, which I subset to only include carries occurring while the game was within 14 points (a cut popular with writers like Bill Barnwell—see about halfway down this post, for example) to attempt to remove huge blowouts, which may affect data integrity. My timeframe was 1999 to the present, which is when PFR has play-by-play data in its database. I then subset the list of running backs to only those with at least 50 carries in the first three quarters and in the fourth quarter and overtime (166 in all). (I looked at different carry cutoffs, and they don’t change any of my conclusions.)

Before I dive into my conclusions, I want to preemptively bring up a big issue with this, which is that it’s only on aggregate level data. This involves pairing up data from different games or even different years, which raises two problems immediately. The first is that we’re not directly testing the hypothesis; I think it is closer in spirit to interpret as “if a big running back gets lots of carries early on, his/his team’s YPC will increase in the fourth quarter,” which can only be looked at with game level data. I’m not entirely sure what metrics to look at, as there are a lot of confounds, but it’s going in the bucket of ideas for research.

The second is that, beyond having to look at this potentially effect indirectly, we might actually have biases altering the perceived effect, as when a player runs ineffectively in the first part of the game, he will probably get fewer carries at the end—partially because he is probably running against a good defense, and partially because his team is likely to be behind and thus passing more. This means that it’s likely that more of the fourth quarter carries come when a runner is having a good day, possibly biasing our data.

Finally, it’s possible that the way that big running backs wear the defense down is that they soften it up so that other running backs do better in the fourth quarter. This is going to be impossible to detect with aggregate data, and if this effect is actually present it will bias against finding a result using aggregate data, as it will be a lurking variable inflating the fourth quarter totals for smaller running backs.

Now, I’m not sure that either of these issues will necessarily ruin any results I get with the aggregate data, but they are caveats to be mentioned. I am planning on redoing some of this analysis with play-by-play level data, but those data are rather messy and I’m a little scared of small sample sizes that come with looking at one quarter at a time, so I think presenting results using aggregated data still adds something to the conversation.

Enough equivocating, let’s get to some numbers. Below is a plot of fourth quarter YPC versus early game YPC; the line is the identity, meaning that points above the line are better in the fourth. The unweighted mean of the difference (Q4 YPC – Q1–3 YPC) is -0.14, with the median equal to -0.15, so by the regular measures a typical running back is less effective in the 4th quarter (on aggregate in moderately close games). (A paired t-test shows this difference is significant, with p < 0.01.)

A couple of individual observations jump out here, and if you’re curious, here’s who they are:

• The guy in the top right, who’s very consistent and very good? Jamaal Charles. His YPC increases by about 0.01 yards in the fourth quarter, the second smallest number in the data (Chester Taylor has a drop of about 0.001 yards).
• The outlier in the bottom right, meaning a major dropoff, is Darren Sproles, who has the highest early game YPC of any back in the sample.
• The outlier in the top center with a major increase is Jerious Norwood.
• The back on the left with the lowest early game YPC in our sample is Mike Cloud, whom I had never heard of. He’s the only guy below 3 YPC for the first three quarters.

A simple linear model gives us a best fit line of (Predicted Q4 YPC) = 1.78 + 0.54 * (Prior Quarters YPC), with an R squared of 0.12. That’s less predictive than I thought it would be, which suggests that there’s a lot of chance in these data and/or there is a lurking factor explaining the divergence. (It’s also possible this isn’t actually a linear effect.)

However, that lurking variable doesn’t appear to be running back size. Below is a plot showing running back BMI vs. (Q4 YPC – Q1–3 YPC); there doesn’t seem to be a real relationship. The plot below it shows difference and fourth quarter carries (the horizontal line is the average value of -0.13), which somewhat suggests that this is an effect that decreases with sample size increasing, though these data are non-normal, so it’s not an easy thing to immediately assess.

That intuition is borne out if we look at the correlation between the two, with an estimate of 0.02 that is not close to significant (p = 0.78). Using weight and height instead of BMI give us larger apparent effects, but they’re still not significant (r = 0.08 with p = 0.29 for weight, r = 0.10 with p = 0.21 for height). Throwing these variables in the regression to predict Q4 YPC based on previous YPC also doesn’t have any effect that’s close to significant, though I don’t think much of that because I don’t think much of that model to begin with.

Our talking head, though, mentioned Lynch and Bettis by name. Do we see anything for them? Unsurprisingly, we don’t—Bettis has a net improvement of 0.35 YPC, with Lynch actually falling off by 0.46 YPC, though both of these are within one standard deviation of the average effect, so they don’t really mean much.

On a more general scale, it doesn’t seem like a change in YPC in the fourth quarter can be attributed to running back size. My hunch is that this is accurate, and that “big running backs make it easier to run later in the game” is one of those things that people repeat because it sounds reasonable. However, given all of the data issues I outlined earlier, I can’t conclude that with any confidence, and all we can say for sure is that it doesn’t show up in an obvious manner (though at some point I’d love to pick at the play by play data). At the very least, though, I think that’s reason for skepticism next time some ex-jock on TV mentions this.

# Man U and Second Halves

During today’s Aston Villa-Manchester United match, Iain Dowie (the color commentator) mentioned that United’s form is improving and that they are historically a stronger team in the second half of the season, meaning that they may be able to put this season’s troubles behind them and make a run either the title or a Champions League spot. I didn’t get a chance to record the exact statement, but I decided to check up on it regardless.

I pulled data from the last ten completed Premier League seasons (via statto.com) to evaluate whether there’s any evidence that this is the case. What I chose to focus on was simply the number of first half and second half points for United, with first half and second half defined by number of games played (first 19 vs. last 19). One obvious problem with looking at this so simply is strength of schedule considerations. However, the Premier League, by virtue of playing a double round robin, is pretty close to having a balanced schedule—there is a small amount of difference in the teams one might play, and there are issues involving home and away, rest, and matches in other competitions, but I expect that’s random from year to year.

So, going ahead with this, has Man U actually produced better results in the second half of the season? Well, in the last 10 seasons (2003-04 – 2012-13), they had more points in the second half 4 times, and they did worse in the second half the other 6. (Full results are in the table at the bottom of the post.) The differences here aren’t huge—only a couple of points—but not only is there no statistically significant effect, there isn’t even a hint of an effect. Iain Dowie thus appears to be blowing smoke and gets to be the most recent commentator to aggravate me by spouting facts without support. (The aggravation in this case is compounded by the fact that this “fact” was wrong.)

I’ll close with two oddities in the data. The first is that, there are 20 teams that have been in the Premiership for at least 5 of the last 10 years, and exactly one has a significant result at the 5% level for the difference between first half and second half. (Award yourself a cookie if you guessed Birmingham City.) This seems like a textbook example of multiplicity to me.

The second, for the next time you want to throw a real stumper at someone, is that there is one team in the last 16 years (all I could easily pull data for) that had the same goal difference and number of points in the two halves of the season. That team is 2002-03 Birmingham City; I have to imagine that finishing 13th with 48 points and a -8 goal difference is about as dull as a season can get, though they did win both their Derby matches (good for them, no good for this Villa supporter).

Manchester United Results by Half, 2003—2012
Year First Half Points Second Half Points Total Points First Half Goal Difference Second Half Goal Difference Total Goal Difference
2003 46 29 75 25 4 29
2004 37 40 77 17 15 32
2005 41 42 83 20 18 38
2006 47 42 89 31 25 56
2007 45 42 87 27 31 58
2008 41 49 90 22 22 44
2009 40 45 85 22 36 58
2010 41 39 80 23 18 41
2011 45 44 89 32 24 56
2012 46 43 89 20 23 43

As a sentimental Roger Federer fan, the last few years have been a little rough, as it’s hard to sustain much hope watching him run into the Nadal/Djokovic buzzsaw again and again (with help from Murray, Tsonga, Del Potro, et al., of course). Though it’s become clear in the last year or so that the wizardry isn’t there anymore, the “struggles”* he’s dealt with since early 2008 are pretty frequently linked to an inability to win the big points.

*Those six years of “struggles,” by the way, arguably surpass the entire career of someone like Andy Roddick. Food for thought.

Tennis may be the sport with the most discourse about “momentum,” “nerves,” “mental strength,” etc. This is in some sense reasonable, as it’s the most prominent sport that leaves an athlete out there by himself with no additional help–even a golfer gets a caddy. Still, there’s an awful lot of rhetoric floating around there about “clutch” players that is rarely, if ever, backed up. (These posts are exceptions, and related to what I do below, though I have some misgivings about their chosen methods.)

The idea of a “clutch” player is that they should raise their game when it counts. In tennis, one easy way of looking at that is to look at break points. So, who steps their game up when playing break points?

Using data that the ATP provides, I was able to pull year-end summary stats for top men’s players from 1991 to the present, which I then aggregated to get career level stats for every man included in the data. Each list only includes some arbitrary number of players, rather than everyone on tour—this causes some complications, which I’ll address later.

I then computed the fraction of break points won and divided by the fraction of non-break point points won for both service points and return points, then averaged the two ratios. This figure gives you the approximate factor that a player ups his game for a break point. Let’s call it clutch ratio, or CR for short.

This is a weird metric, and one that took me some iteration to come up with. I settled on this as a way to incorporate both service and return “clutchness” into one number. It’s split and then averaged to counter the fact that most people in our sample (the top players) will be playing more break points as a returner than a server.

The first interesting thing we see is that the average value of this stat is just a little more than one—roughly 1.015 (i.e. the average player is about 1.5% better in clutch situations), with a reasonably symmetric distribution if you look at the histogram. (As the chart below demonstrates, this hasn’t changed much over time, and indeed the correlation with time is near 0 and insignificant. And I have no idea what happened in 2004 such that everyone somehow did worse that year.) This average value, to me, suggests that we are dealing at least to some extent with adverse selection issues having to do with looking at more successful players. (This could be controlled for with more granular data, so if you know where I can find those, please holler.)

Still, CR, even if it doesn’t perfectly capture clutch (as it focuses on only one issue, only captures the top players and lacks granularity), does at least stab at the question of who raises their game. First, though, I want to specify some things we might expect to see if a) clutch play exists and b) this is a good way to measure it:

• This should be somewhat consistent throughout a career, i.e. a clutch player one year should be clutch again the next. This is pretty self-explanatory, but just to make clear: a player isn’t “clutch” if their improvement isn’t sustained, they’re lucky. The absence of this consistency is one of the reasons the consensus among baseball folk is that there’s no variation in clutch hitting.
• We’d like to see some connection between success and clutchness, or between having a reputation for being clutch and having a high CR. This is tricky and I want to be careful of circularity, but it would be quite puzzling if the clutchest players we found were journeymen like, I dunno, Igor Andreev, Fabrice Santoro, and Ivo Karlovic.
• As players get older, they get more clutch. This is preeeeeeeeeeetty much pure speculation, but if clutch is a matter of calming down/experience/whatever, that would be one way for it to manifest.

We can tackle these in reverse order. First, there appears to be no improvement year-over-year in a player’s break ratio. If we limit to seasons with at least 50 matches played, the probability that a player had a higher clutch ratio in year t+1 than he did in year t is…47.6%. So, no year-to-year improvement, and actually a little decrease in clutch play. That’s fine, it just means clutch is not a skill someone develops. (The flip side is that it could be that younger players are more confident, though I’m highly skeptical of that. Still, the problem with evaluating these intangibles is that their narratives are really easily flipped.)

Now, the relationship between success and CR. Let’s first go with a reductive measure of success: what fraction of games a player won. Looking at either a season basis (50 match minimum, 1006 observations) or career basis (200 match minimum, 152 observations), we see tiny, insignificant correlations between these two figures. Are these huge datasets? No, but the total absence of any effect suggests there’s really no link here between player quality and clutch, assuming my chosen metrics are coherent. (I would have liked to try this with year end rankings, but I couldn’t find them in a convenient format.)

What if we take a more qualitative approach and just look at the most and least clutch players, as well as some well-regarded players? The tables below show some results in that direction.

Name Clutch Ratio
Best Clutch Ratios
1 Jo-Wilfried Tsonga 1.08
2 Kenneth Carlsen 1.07
3 Alexander Volkov 1.06
4 Goran Ivanisevic 1.05
5 Juan Martin Del Potro 1.05
6 Robin Soderling 1.05
7 Jan-Michael Gambill 1.04
8 Nicolas Kiefer 1.04
9 Paul Haarhuis 1.04
10 Fabio Fognini 1.04
Worst Clutch Ratios
Name Clutch Ratio
1 Mariano Zabaleta 0.97
2 Andrea Gaudenzi 0.97
3 Robby Ginepri 0.98
4 Juan Carlos Ferrero 0.98
5 Jonas Bjorkman 0.98
6 Juan Ignacio Chela 0.98
7 Gaston Gaudio 0.98
8 Arnaud Clement 0.98
9 Thomas Enqvist 0.99
10 Younes El Aynaoui 0.99

See any pattern to this? I’ll cop to not recognizing many of the names, but if there’s a pattern I can see it’s that a number of the guys at the top of the list are real big hitters (I would put Tsonga, Soderling, Del Potro, and Ivanesevic in that bucket, at least). Otherwise, it’s not clear that we’re seeing the guys you would expect to be the most clutch players (journeyman Dolgov at #3?), nor do I see anything meaningful in the list of least clutch players.

Unfortunately, I didn’t have a really strong prior about who should be at the top of these lists, except perhaps the most successful players—who, as we’ve already established, aren’t the most clutch. The only list of clutch players I could find was a BleacherReport article that used as its “methodology” their performance in majors and deciding sets, and their list doesn’t match with these at all.

Since these lists are missing a lot of big names, I’ve put a few of them in the list below.

Clutch Ratios of Notable Names
Overall Rank (of 152) Name Clutch Ratio
18 Pete Sampras 1.03
21 Novak Djokovic 1.03
26 Tomas Berdych 1.03
71 Andy Roddick 1.01
74 Andre Agassi 1.01
92 Lleyton Hewitt 1.01
122 Marat Safin 1.00
128 Roger Federer 1.00

In terms of relative rankings, I guess this makes some sense—Nadal and Djokovic are renowned for being battlers, Safin is a headcase, and Federer is “weak in big points,” they say. Still, these are very small differences, and while over a career 1-2% adds up, I think it’s foolish to conclude anything from this list.

Our results thus far give us some odd ideas about who’s clutch, which is a cause for concern, but we haven’t tested the most important aspect of our theory: that this metric should be consistent year over year. To check this, I took every pair of consecutive years in which a player played at least 50 matches and looked at the clutch ratios in years 1 and 2. We would expect there to be some correlation here if, in fact, this stat captures something intrinsic about a player.

As it turns out, we get a correlation of 0.038 here, which is both small and insignificant. Thus, this metric suggests that players are not intrinsically better or worse in break point situations (or at least, it’s not visible in the data as a whole).

What conclusions can we draw from this? Here we run into a common issue with concepts like clutch that are difficult to quantify—when you get no result, is the reason that nothing’s there or that the metric is crappy? In this case, while I don’t think the metric is outstanding, I don’t see any major issues with it other than a lack of granularity. Thus, I’m inclined to believe that in the grand scheme of things, players don’t really step their games up on break point.

Does this mean that clutch isn’t a thing in tennis? Well, no. There are a lot of other possible clutch metrics, some of which are going to be supremely handicapped by sample size issues (Grand Slam performance, e.g.). All told, I certainly won’t write off the idea that clutch is a thing in tennis, but I would want to see significantly more granular data before I formed an opinion one way or another.