Is There a Hit-by-Pitch Hangover?

One of the things I’ve been curious about recently and have on my list of research questions is what the ramifications of a hit-by-pitch are in terms of injury risk—basically, how much of the value of an HBP does the batter give back through the increased injury risk? Today, though, I’m going to look at something vaguely similar but much simpler: Is an HBP associated with an immediate decrease in player productivity?

To assess this, I looked at how players performed in the plate appearance immediately following their HBP in the same game. (This obviously ignores players who are injured by their HBP and leave the game, but I’m looking for something subtler here.) To evaluate performance, I used wOBA, a rate stat that encapsulates a batter’s overall offensive contributions. There are, however, two obvious effects (and probably other more subtle ones) that mean we can’t only look at the post-HBP wOBA and compare it to league average.

The first is that, ceteris paribus, we expect that a pitcher will do worse the more times he sees a given batter (the so-called “trips through the order penalty”). Since in this context we will never include a batter’s first PA of a game because it couldn’t be preceded by an HBP, we need to adjust for this. The second adjustment is simple selection bias—not every batter has the same likelihood of being hit by a pitch, and if the average batter getting hit by a pitch is better or worse than the overall average batter, we will get a biased estimate of the effect of the HBP. If you don’t care about how I adjusted for this, skip to the next bold text.

I attempted to take those factors into account by computing the expected wOBA as follows. Using Retrosheet play-by-play data for 2004–2012 (the last year I had on hand), for each player with at least 350 PA in a season, I computed their wOBA over all PA that were not that player’s first PA in a given game. (I put the 350 PA condition in to make sure my average wasn’t swayed by low PA players with extreme wOBA values.) I then computed the average wOBA of those players weighted by the number of HBP they had and compared it to the actual post-HBP wOBA put up by this sample of players.

To get a sense of how likely or unlikely any discrepancy would be, I also ran a simulation where I chose random HBPs and then pulled a random plate appearance from the hit batter until I had the same number of post-HBP PA as actually occurred in my nine year sample, then computed the post-HBP wOBA in that simulated world. I ran 1000 simulations and so have some sense of how unlikely the observed post-HBP performance is under the null hypothesis that there’s no difference between post-HBP performance and other performance.

To be honest, though, those adjustments don’t make me super confident that I’ve covered all the necessary bases to find a clean effect—the numbers are still a bit wonky, and this is not such a simple thing to examine that I’m confident I’ve gotten all the noise out. For instance, it doesn’t filter out park or pitcher effects (i.e. selection bias due to facing a worse pitcher, or a pitcher having an off day), both of which play a meaningful role in these performance and probably lead to additional selection biases I don’t control for.

With all those caveats out of the way, what do we see? In the data, we have an expected post-HBP wOBA of .3464 and an actual post-HBP wOBA of .3423, for an observed difference of about 4 points of wOBA, which is a small but non-negligible difference. However, it’s in the 24th percentile of outcomes according to the simulation, which indicates there’s a hefty chance that it’s randomness. (Though league average wOBA changed noticeably over the time period I examined, I did some sensitivities and am fairly confident those changes aren’t covering up a real result.)

The main thing (beyond the aforementioned haziness in this analysis) that makes me believe there might be an effect is that the post-walk effect is actually a 2.7 point (i.e. 0.0027) increase in wOBA. If we think that boost is due to pitcher wildness then we would expect the same thing to pop up for the post-HBP plate appearances, and the absence of such an increase suggests that there is a hangover effect. However, to conclude from that that there is a post-HBP swoon seems to be an unreasonably baroque chain of logic given the rest of the evidence, so I’m content to let it go for now.

The main takeaway from all of this is that there’s an observed decrease in expected performance after an HBP, but it’s not particularly large and doesn’t seem likely to have any predictive value. I’m open to the idea that a more sophisticated simulator that includes pitcher and park effects could help detect this effect, but I expect that even if the post-HBP hangover is a real thing, it doesn’t have a major impact.

Leaf-ed Behind by Analytics

As you may have heard, there’s been a whole hullabaloo recently in the hockey world about the Toronto Maple Leafs. Specifically, they had a good run last year and in the beginning of this season that the more numerically-inclined NHL people believed was due to an unsustainably high shooting percentage that covered up their very weak possession metrics. Accordingly, the stats folk predicted substantial regression, which was met with derision by many Leafs fans and most of the team’s brass. The Leafs have played very poorly since that hot streak and have been eliminated from the playoffs; just a few weeks back, they had an 84% chance of making it. (See Sports Club Standings for the fancy chart.)

Unsurprisingly, this has lead to much saying of “I told you so” by the stats folk and a lot of grumblings about the many flaws of the current Leafs administration. Deadspin has a great write up of the whole situation, but one part in particular stood out. This is a quotation from the Leafs’ general manager, Dave Nonis:

“We’re constantly trying to find solid uses for [our analytics budget,” Nonis said. “The last six, seven years, we’ve had a significant dollar amount in our budget for analytics and most of those years we didn’t use it. We couldn’t find a system or a group we felt we could rely on to help us make reasonable decisions.”

[…]

“People run with these stats like they’re something we should pay attention to and make decisions on, and as of right now, very few of them are worth anything to us,” he said at one point during the panel, blaming media and fans for overhyping the analytics currently available.

This represents a mind-boggling lack of imagination on their part. Let’s say they honestly don’t think there’s a good system currently out there that could help them—that’s entirely irrelevant. They should drop the cash and try to build a system from scratch if they don’t like what’s out there.

There are four factors that determine how good the analysis of a given problem is going to be: 1) the analysts’ knowledge of the problem, 2) their knowledge of the tools needed to solve the problem (basically, stats and critical thinking), 3) the availability of the data, and 4) the amount of time the analysts have to work on the problem. People who know about both hockey and data are available in spades; I imagine you can find a few people in every university statistics department and financial firm in Canada that could rise to the task, to name only two places these people might cluster. (They might not know about hockey stats, but the “advanced” hockey stats aren’t terribly complex, so I have faith that anyone who knows both stats and hockey can figure out their metrics.)

For #3: the data aren’t great for hockey, but they exist and will get better with a minimal investment in infrastructure. Analysts’ having sufficient time is the most important factor in progress, though, and the hardest one to substitute; conveniently, time is an easy thing for the team to buy (via salary, which they even get a discount on because of the non-monetary benefits of working in hockey). If they post some jobs at a decent salary, they basically have their pick of statistically-oriented hockey fans. If a team gets a couple of smart people and has them working 40-60 hours a week thinking about hockey and bouncing ideas off of each other, they’re going to get some worthwhile stuff no matter what.

Let’s say that budget is $200,000 per year, or a fraction of the minimum player salary. At that level, one good idea from the wonks and they’ve paid for themselves many times over. Even if they don’t find a grand unified theory of hockey, they can help with more discrete analyses and provide a slightly different perspective on decisions, and they’re so low cost that it’s hard to see how they’d hurt a team. (After all, if the team thinks the new ideas are garbage it can ignore them—it’s what was happening in the first place, so no harm done.) The only way Toronto’s decision makes sense is if they think that analytics not only are currently useless but can’t become useful in the next decade or so, and it’s hard to believe that anyone really thinks that way. (The alternative is that they’re scared that the analysts would con the current brass into a faulty decision, but given their skepticism that seems like an unlikely problem.)

Is this perspective a bit self-serving? Yeah, to the extent that I like sports and data and I’d like to work for a team eventually. Regardless, it seems to me that the only ways to justify the Leafs’ attitude are penny-pinching and the belief that non-traditional stats are useless, and if either of those is the case, something has gone very wrong in Toronto.

Do High Sock Players Get “Hosed” by the Umpires?

I was reading one of Baseball Prospectus’s collections this morning and came across an interesting story. It’s a part of baseball lore that Willie Mays started his career on a brutal cold streak (though one punctuated by a long home run off Warren Spahn). Apparently, manager Leo Durocher told Mays toward the end of the slump that he needed to pull his pants up because the pant knees were below Mays’s actual knees, which was costing him strikes. Mays got two hits the day after the change and never looked back.

To me, this is a pretty great story and (to the extent it’s true) a nice example of the attention to detail that experienced athletes and managers are capable of. However, it prompted another question: do uniform details actually affect the way that umpires call the game?

Assessing where a player belts his pants is hard, however, so at this point I’ll have to leave that question on the shelf. What is slightly easier is looking at which hitters wear their socks high and which cover their socks with their baseball pants. The idea is that by clearly delineating the strike zone, the batter will get fairer calls on balls near the bottom of the strike zone than he might otherwise. This isn’t a novel idea—besides the similarity to what Durocher said, it’s also been suggested herehere, and in the comments here—but I wasn’t able to find any studies looking at this. (Two minor league teams in the 1950s did try this with their whole uniforms instead of just the socks, however. The experiments appear to have been short-lived.)

There are basically two ways of looking at the hypothesis: the first is that it will be a straightforward benefit/detriment to the player to hike his socks because the umpire will change his definition of the bottom of the zone; this is what most of the links I cited above would suggest, though they didn’t agree on which direction. I’m somewhat skeptical of this, unless we think that the umpires have a persistent bias for or against certain players and that that bias would be resolved by the player changing how he wears his socks. The second interpretation is that it will make the umpire’s calls more precise, meaning simply that borderline pitches are called more consistently, but that it won’t actually affect where the umpire thinks the bottom of the zone is.

At first blush, this seems like the sort of thing that Pitch F/X would be perfectly suited to, as it gives oodles of information about nearly every pitch thrown in the majors in the last several years. However, it doesn’t include a variable for the hosiery of the batter, so to do a broader study we need additional data. After doing some research and asking around, I wasn’t able to find a good database of players that consistently wear high socks, much less a game-by-game list, which basically ruled out a large-scale Pitch F/X study.

However, I got a very useful suggestion from Paul Lukas, who runs the excellent Uni Watch site. He pointed out that a number of organizations require their minor leaguers to wear high socks and only give the option of covered hose to the major leaguers, providing a natural means of comparison between the two types of players. This will allow us to very broadly test the hypothesis that there is a single direction change in how low strikes are called.

I say very broadly because minor league Pitch F/X data aren’t publicly available, so we’re left with extremely aggregate data. I used data from Minor League Central, which has called strikes and balls for each batter. In theory, if the socks lead to more or fewer calls for the batter at the bottom of the zone, that will show up in the aggregate data and the four high-socked teams (Omaha, Durham, Indianapolis, and Scranton/Wilkes-Barre) will have a different percentage of pitches taken go for strikes. (I found those teams by looking at a sample of clips from the 2013 season; their AA affiliates also require high socks.)  Now, there are a lot of things that could be confounding factors in this analysis:

  1. Players on other teams are allowed to wear their socks high, so this isn’t a straight high socks/no high socks comparison, but rather an all high socks/some high socks comparison. (There’s also a very limited amount of non-compliance on the all socks side, as based on the clips I could find it appears that major leaguers on rehab aren’t bound by the same rules; look at some Derek Jeter highlights with Scranton if you’re curious.)
  2. AAA umpires are prone to more or different errors than major league umpires.
  3. Which pitches are taken is a function of the team makeup and these teams might take more or fewer balls for reasons unrelated to their hose.
  4. This only affects borderline low pitches, and so it will only make up a small fraction of the overall numbers we observe and the impact will be smothered.

I’m inclined to downplay the first and last issues, because if those are enough to suppress the entire difference over the course of a whole season then the practical significance of the change is pretty small. (Furthermore, for #1, from my research it didn’t look like there were many teams with a substantial number of optional socks-showers. Please take that with a grain of salt.)

I don’t really have anything to say about the second point, because it has to do with extrapolation, and for now I’d be fine just looking at AAA. I don’t have even have that level of brushoff response for the third point except to wave my hands and say that I hope it doesn’t matter given that these reflect pitches thrown by the rest of the league, so they will hopefully converge around league average.

So, having substantially caveated my results…what are they? As it turns out, the percentage of pitches the stylish high sock teams took that went for strikes was 30.83% and the equivalent figure for the sartorially challenged was…30.83%. With more than 300,000 pitches thrown in AAA last year, you need to go to the seventh decimal place of the fraction to see a difference. (If this near equality seems off to you, it does to me as well. I checked my figures a couple of ways, but I (obviously) can’t rule out an error here.)

What this says to me is that it’s pretty unlikely that this ends up mattering, unless there is an effect and it’s exactly cancelled out by the confounding factors listed above (or others I failed to consider). That can’t be ruled out as a possibility, nor can data quality issues, but I’m comfortable saying that the likeliest possibility by a decent margin is that socks don’t lead to more or fewer strikes being called against the batter. (Regardless, I’m open to suggestions for why the effect might be suppressed or analysis based on more granular data I either don’t have access to or couldn’t find.)

What about the accuracy question, i.e. is the bottom of the strike zone called more consistently or correctly for higher-socked players? Due to the lack of nicely collected data, I couldn’t take a broad approach to answering this, but I do want to record an attempt I made regardless. David Wright is known for wearing high socks in day games but covering his hosiery at night, which gives us a natural experiment we can look at for results.

I spent some amount of time looking at the 2013 Pitch F/X data for his day/night splits on taken low pitches and comparing those to the same splits for the Mets as a whole, trying a few different logistic regression models as well as just looking at the contingency tables to see if anything jumped out, and nothing really did in terms of either greater accuracy or precision. I didn’t find any cuts of the data that yielded a sufficiently clean comparison or sample size that I was confident in the results. Since this is a messy use of these data in the first place (it relies on unreliable estimates of the lower edge of a given batter’s strike zone, for instance), I’m going to characterize the analysis as incomplete for now. Given a more rigorous list of which players wear high socks and when, though, I’d love to redo this with more data.

Overall, though, there isn’t any clear evidence that the socks do influence the strike zone. I will say, though, that this seems like something that a curious team could test by randomly having players (presumably on their minor league teams) wear the socks high and doing this analysis with cleaner data. It might be so silly as to not be worth a shot, but if this is something that can affect the strike zone at all then it could be worthwhile to implement in the long run—if it can partially negate pitch framing, for instance, then that could be quite a big deal.

Adrian Nieto’s Unusual Day

White Sox backup catcher Adrian Nieto has done some unusual things in the last few days. To start with, he made the team. That doesn’t sound like much, but as a Rule 5 draft pick, it’s a bit more meaningful than it might be otherwise, and it’s somewhat unusual because he was jumping from A ball to the majors as a catcher. (Sox GM Rick Hahn said he didn’t know of anyone who’d done it in the last 5+ years.)

Secondly, he pinch ran today against the Twins, which is an activity not usually associated with catchers (even young ones). This probably says more about the Sox bench, as he pinch ran for Paul Konerko, who is the worst baserunner by BsR among big league regulars this decade by a hefty margin. Still: a catcher pinch running! How often does this happen?

More frequently than I thought, as it turns out; there were 1530 instances of a catcher pinch running from 1974 to 2013, or roughly 38 times a year. This is about 4% of all pinch running appearances over that time, so it’s not super common, but it’s not unheard of either. (My source for this is the Lahman database, which is why I have the date cutoff. For transparency’s sake, I called a player a catcher if he played catcher in at least half of his appearances in a given year.)

If you connect the dots, though, you’ll realize that Nieto is a catcher made his major league debut as a pinch runner. How often does that happen? As it turns out, just five times previously since 1974 (cross-referencing Retrosheet with Lahman):

  • John Wathan, Royals; May 26, 1976. Wathan entered for pinch hitter Tony Solaita, who had pinch hit for starter Bob Stinson. He came around to score on two hits (though he failed to make it home from third after a flyball to right), but he also grounded into a double play with the bases loaded in the 9th. The Royals lost in extra innings, but he lasted 10 years with them, racking up 5 rWAR.
  • Juan Espino, Yankees; June 25, 1982. Espino pinch ran for starter Butch Wynegar with the Yankees up 11-3 in the 7th and was forced at second immediately. He racked up -0.4 rWAR in 49 games spread across four seasons, all with the Yanks.
  • Doug Davis, Angels, July 8, 1988. This one’s sort of cheating, as Davis entered for third baseman Jack Howell after a hit by pitch and stayed in the game at the hot corner; he scored that time around, then made two outs further up. According to the criteria I threw out earlier, though, he counts, as three of the six games he played in that year were at catcher (four of seven lifetime).
  • Gregg Zaun, Orioles; June 24, 1995. Zaun entered for starter Chris Hoiles with the O’s down 3-2 in the 7th. He moved to second on a groundout, then third on a groundout, then scored the tying run on a Brady Anderson home run. Zaun had a successful career as a journeyman, playing for 9 teams in 16 years and averaging less than 1 rWAR per year.
  • Andy Stewart, Royals; September 6, 1997. Ran for starter Mike McFarlane in the 8th and was immediately wiped out on a double play. Stewart only played 5 games in the bigs lifetime.

So, just by scoring a run, Nieto didn’t necessarily have a more successful debut than this cohort. However, as a Sox fan I’m hoping (perhaps unreasonably) that he has a bit better career than Davis, Stewart, and Espino–and hey, if he’s a good backup for 10 or more years, that’s just gravy.

One of my favorite things about baseball is the number of quirky things like this that happen, and while this one wasn’t unique, it was pretty close. When you have low expectations for a team (like this year’s White Sox), you just hope the history they make isn’t too embarrassing.

Brackets, Preferences, and the Limits of Data

As you may have heard, it’s March Madness time. If I had to guess, I’d wager that more people make specific, empirically testable predictions this week than any other week of the year. They may be derived without regard to the quality of the teams (the mascot bracket, e.g.), or they might be fairly advanced projections based on as much relevant data as are easily available (Nate Silver’s bracket, for one), but either way we’re talking about probably billions of predictions. (At 63 picks per bracket, we “only” need about 16 million brackets to get to a billion picks, and that doesn’t count all the gambling.)

What compels people to do all of this? Some people do it to win money; if you’re in a small pool, it’s actually feasible that you could win a little scratch. Other people do it because it’s part of their job (Nate Silver, again), or because there might be additional extrinsic benefits (I’d throw the President in that category). This is really a trick question, though: people do it to have fun. More precisely, and to borrow the language of introductory economics, they maximize utility.

The intuitive definition of utility can be viewed as pretty circular (it both explains and is defined by people’s decisions), but it’s useful as a way of encapsulating the notion that people do things for reasons that can’t really be quantified. The notion of unquantifiability, especially unquantifiable preferences, is something people sometimes overlook when discussing the best uses of data. Yelp can tell you which restaurant has the best ratings, but if you hate the food the rating doesn’t do you much good.*

One of the things I don’t like about the proliferation of places letting you simulate the bracket and encouraging you to use that analysis is that it disregards utility. They presume that your interests are either to get the most games correct or (for some of the more sophisticated ones) to win your pool. What that’s missing is that some of us have strongly ingrained preferences that dictate our utility, and that that’s okay. My ideal, when selecting a bracket, is to make it so I have as high a probability as possible of rooting for the winner of a game.

For instance, I don’t think I’ve picked Duke to make it past the Sweet Sixteen in the last 10 or more years. If they get upset before then, my joy in seeing them lose well outweighs the damage to my bracket, especially since most people will have them advancing farther than I do. On the other hand, if I pick them to lose in the first round**, it will just make the sting worse when they win. I’m hedging my emotions, pure and simple.***

This is an extreme example of my rule of thumb when picking teams that I have strong preferences for, which is to have teams I really like/dislike go one round more/less than I would predict to be likely. This reduces the probability that my heart will be abandoned by my bracket. As a pretty passive NCAA fan, I don’t apply this to too many teams besides Duke (and occasionally Illinois, where I’m from) on an annual basis, but I will happily use it with a specific player (Aaron Craft, on the negative side) or team (Wichita State, on the positive side) that is temporarily more charming or loathsome than normal. (This general approach applies to fantasy, as well: I’ve played in a half dozen or so fantasy football leagues over the years, and I’ve yet to have a Packer on my team.)

However, with the way the bracket is structured, this doesn’t necessarily torpedo your chances. Duke has a reasonable shot of doing well, and it’s not super likely that a 12th seeded midmajor is going to make a run, but my preferred scenarios are not so unlikely that they’re not worth submitting to whichever bracket challenge I’m participating in. This lengthens how long my bracket will be viable enough that I’ll still care about it and thus increase the amount of time I will enjoy watching the tournament. (At least, I tell myself that. My picks have crashed and burned in the Sweet Sixteen the last couple of years.)

Another wrinkle to this, of course, is that for games I have little or no prior preference in, simply making the pick makes me root for the team I selected. If it’s, say, Washington against Nebraska, I will happily pick the team in the bracket I think is more likely to win and then pull hard for the team. (I’m not immune to wanting my predictions to be valid.) So, the weaker my preferences are, the more I hew toward the pure prediction strategy. Is this capricious? Maybe, but so is sport in general.

I try not to be too normative in my assessments of sports fandom (though I’m skeptical of people who have multiple highly differing brackets), and if your competitive impulses overwhelm your disdain for Duke, that’s just fine. But if you’re like me, pick based on utility. By definition, it’ll be more fun.

* To be fair, my restaurant preferences aren’t unquantifiable, and the same is true for many other tastes. My point is that following everyone else’s numbers won’t necessarily yield you the best strategy for you.

** Meaning the round of 64. I’m not happy with the NCAA for making the decision that led to this footnote.

*** Incidentally, this is one reason I’m a poor poker player. I don’t enjoy playing in the optimal manner enough to actually do it. Thankfully, I recognize this well enough to not play for real stakes, which amusingly makes me play even less optimally from a winnings perspective.

Valuing Goalie Shootout Performance (Again)

I wrote this article a few months ago about goalie shootout performance and concluded two things:

  • Goalies are not interchangeable with respect to the shootout, i.e. there is skill involved in goalie performance.
  • An extra percentage point in shootout save percentage is worth about 0.002 standings points per game. This is based on some admittedly sketchy calculations based on long term NHL performance, and not something I think is necessarily super accurate.

I’m bringing this up because a couple of other articles have been written about this recently: one by Tom Tango and one much longer one by Michael Lopez. One of the comments over there, from Eric T., mentioned wanting a better sense of the practical significance of the differences in skill, given that Lopez offers an estimate that the difference between the best and worst goalies is worth about 3 standings points per year.

That’s something I was trying to do in the previous post up above, and the comment prompted me to try to redo it. I made some simple assumptions that align with the one’s Lopez did in his followup post:

  • Each shot has equal probability of being saved (i.e. shooter quality doesn’t matter, only goalie quality). This probably reduces the volatility in my estimates, but since a goalie should end up facing a representative sample of shooters, I’m not too concerned.
  • The goalie’s team has an equal probability of converting each shot. This, again, probably reduces the variance, but it makes modelling things much simpler, and I think it makes it easier to isolate the effect that goalie performance has on shootout winning percentage.

Given these assumptions, we can compute an exact probability that one team wins given team 1’s save percentage p_1 and team 2’s p_2. If you don’t care about the math, skip ahead to the table. Let’s call P_{i,j} the probability that team i scores j times in the first three rounds of the shootout:

P_{i,j} = {3 \choose j} p_i^j(1-p_i)^{3-j}

P(\text{Team 1 Wins } | \text{ } p_1, p_2) = \sum_{j=1}^3 \sum_{k=0}^{j-1} P_{1,j} \cdot P_{2,k} + \left ( \sum_{j=1}^3 P_{1,j}\cdot P_{2,j} \right ) \frac{p_1(1-p_2)} {1-(p_1p_2+(1-p_1)(1-p_2))}

The first term on the right side is just the sum of the probabilities of the ways that team 1 can win the first three rounds, e.g. 2 goals for and 1 allowed or 3 goals for and none allowed. The term on the right is the sum of all the ways they can win if the first three rounds end in a tie, which can be expressed easily as the sum of a geometric series.

Ultimately, we don’t really care about the formula so much as the results, so here’s a table and a plot showing the performance of a goalies who are a given percentage below or above league average when facing a league average goalie:

Calculated Winning

Percentage Points Above/Below League Average Winning Percentage
-20 26.12
-19 27.14
-18 28.18
-17 29.24
-16 30.31
-15 31.41
-14 32.52
-13 33.66
-12 34.81
-11 35.98
-10 37.17
-9 38.37
-8 39.60
-7 40.84
-6 42.10
-5 43.37
-4 44.67
-3 45.98
-2 47.30
-1 48.64
0 50.00
1 51.37
2 52.76
3 54.16
4 55.58
5 57.01
6 58.45
7 59.91
8 61.38
9 62.86
10 64.35
11 65.85
12 67.37
13 68.89
14 70.42
15 71.96
16 73.51
17 75.06
18 76.62
19 78.19
20 79.76

We would expect most of these figures to be close to league average, so if we borrow Tom Tango’s results (see the link above) we figure the most and least talented goalies are going to be roughly 6 percentage points away from the mean. The difference between +0.06 and -0.06 is about 0.16 in the simulation output, meaning the best goalies are likely to win sixteen shootouts per hundred more than the worst goalies assuming both play average competition.

Multiplying this by 13.2%, the past frequency of shootouts, and we get an estimated benefit of only about 0.02 standings points / game from switching from the worst shootout goalie to the best. For a goalie making 50 starts, that’s only about 1 point added to the team, and that’s assuming maximal possible impact.

Similarly, moving up this curve by one percentage point appears to be worth about 1.35 wins per hundred; multiplying that by 13.2% gives a value of 0.0018 standings points / game, which is almost exactly what I got when I did this empirically in the prior post, which leads me to believe that that estimate is a lot stronger than I initially thought.

There’s obviously a lot of assumptions in play here, including the assumptions going into my probabilities and Tango’s estimates of true performance, and I’m open to the idea that one or another of those is suppressing the importance of this skill. Overall, though, I’m largely inclined to hew to my prior conclusions saying that for a difference in shootout performance to be enough to make one goalie better overall than another, it has to be a fairly substantial one, and the difference in actual save percentage has to be correspondingly fairly small.

The Joy of the Internet, Pt. 2

I wrote one of these posts a while back about trying to figure out which game Bunk and McNulty attend in a Season 3 episode of The Wire. This time, I’m curious about a different game, and we have a bit less information to go on, so it took a bit more digging to find.

The intro to the Drake song “Connect” features the call of a home run being hit. Given that it probably required getting the express written consent of MLB for this sample, my guess is that he got it recorded by an announcer in the studio (as he implies around the 10:30 mark of this video). Still, does it match any games we have on record?

To start, I’m going to assume that this is a major league game, though there’s of course no way of knowing for sure. From the song, all we get is the count, the fact that it was a home run, the direction of the home run, and the name of the outfielder.  The first three are easy to hear, but the fourth is a bit tricky—a few lyrics sites (including the description of the video I linked) list it as “Molina,” but that can’t be the case, as none of the Molinas who’ve played in the bigs played the outfield.

RapGenius, however, lists it as “Revere,” and I’m going to go with that, since Ben Revere is an active major league center fielder and it seems likely that Drake would have sampled a recent game. So, can we find a game that matches all these parameters?

I first checked for only games Revere has played against the Blue Jays, since Drake is from Toronto and the RapGenius notes say (without a source) that the call is from a Jays game. A quick check of Revere’s game logs against the Jays, though, says that he’s never been on the field for a 3-1 homer by a Jay.

What about against any other team? Since checking this by hand wasn’t going to fly (har har), I turned to play-by-play data, available from the always-amazing Retrosheet. With the help of some code from possibly the nerdiest book I own, I was able to filter every play since Revere has joined the league to find only home runs hit to center when Revere was in center and the count was 3-1.

Somewhat magically, there was only one: a first inning shot by Carlos Gomez against the Twins in 2011. The video is here, for reference. I managed to find the Twins’ TV call via MLB.TV, and the Brewers’ team did the MLB.com video, and (unsurprisingly) neither call fits the sample, though I didn’t go looking for the radio call. Still, the home run is such that it wouldn’t be surprising if either one of the radio calls matched what Drake used, or if it was close and Drake had it rerecorded in such a way that preserved the details of the play.

So, probably through dumb luck, Drake managed to pick a unique play to sample for his track. But even though it’s a baseball sample, I still click back to “Hold On, We’re Going Home” damn near every time I listen to the album.

Throne of Games (Most Played, Specifically)

I was trawling for some stats on hockey-reference (whence most of the hockey facts in this post) the other day and ran into something unexpected: Bill Guerin’s 2000-01 season. Specifically, Guerin led the league with 85 games played. Which wouldn’t have seemed so odd, except for the fact that the season is 82 games long.

How to explain this? It turns out there are two unusual things happening here. Perhaps obviously, Guerin was traded midseason, and the receiving team had games in hand on the trading team. Thus, Guerin finished with three games more than the “max” possible.

Now, is this the most anyone’s racked up? Like all good questions, the answer to that is “it depends.” Two players—Bob Kudelski in 93-94 and Jimmy Carson in 92-93—played 86 games, but those were during the short span of the 1990s when each team played 84 games in a season, so while they played more games than Guerin, Guerin played in more games relative to his team. (A couple of other players have played 84 since the switch to 82 games, among them everyone’s favorite Vogue intern, Sean Avery.)

What about going back farther? The season was 80 games from 1974–75 to 1991–92, and one player in that time managed to rack up 83: the unknown-to-me Brad Marsh, in 1981-82, who tops Guerin at least on a percentage level. Going back to the 76- and 78-game era from 1968-74, we find someone else who tops Guerin and Marsh, specifically Ross Lonsberry, who racked up 82 games (4 over the team maximum) with the Kings and Flyers in 1971–72. (Note that Lonsberry and Marsh don’t have game logs listed at hockey-reference, so I can’t verify if there was any particularly funny business going on.) I couldn’t find anybody who did that during the 70 game seasons of the Original Six era, and given how silly this investigation is to begin with, I’m content to leave it at that.

What if we go to other sports? This would be tricky in football, and I expect it would require being traded on a bye week. Indeed, nobody has played more than the max games at least since the league went to a 14 game schedule according to the results at pro-football-reference.

In baseball, it certainly seems possible to get over the max, but actually clearing this out of the data is tricky for the following two reasons:

  • Tiebreaker games are counted as regular season games. Maury Wills holds the raw record for most games played with 165 after playing in a three game playoff for the Dodgers in 1962.
  • Ties that were replayed. I started running into this a lot in some of the older data: games would be called after a certain number of innings with the score tied due to darkness or rain or some unexplained reason, and the stats would be counted, but the game wouldn’t count in the standings. Baseball is weird like that, and no matter how frustrating this can be as a researcher, it was one of the things that attracted me to the sport in the first place.

So, those are my excuses if you find any errors in what I’m about to present; I used FanGraphs and baseball-reference to spot candidates. I believe there’s only been a few cases of baseball players playing more than the scheduled number of games when none of the games fell into those two problem categories mentioned above. The most recent is Todd Zeile, who, while he didn’t play in a tied game, nevertheless benefited from one. In 1996, he was traded from the Phillies to the Orioles after the O’s had stumbled into a tie, thus giving him 163 games played, though they all counted.

Possibly more impressive is Willie Montanez, who played with the Giants and Braves in 1976. He racked up 163 games with no ties, but arguably more impressive is that, unlike Zeile, Montanez missed several opportunities to take it even farther. He missed one game before being traded, then one game during the trade, and then two games after he was traded. (He was only able to make it to 1963 because the Braves had several games in hand on the Giants at the time of the trade.)

The only other player to achieve this feat in the 162 game era is Frank Taveras, who in 1979 played in 164 games; however, one of those was a tie, meaning that according to my twisted system he only gets credit for 163. He, like Montanez, missed an opportunity, as he had one game off after getting traded.

Those are the only three in the 162-game era. While I don’t want to bother looking in-depth at every year of the 154-game era due to the volume of cases to filter, one particular player stands out. Ralph Kiner managed to put up 158 games with only one tie in 1953, making him by my count the only baseball player to play three meaningful games more than his team did in baseball since 1901.

Now, I’ve sort of buried the lede here, because it turns out that the NBA has the real winners in this category. This isn’t surprising, as the greater number of days off between games means it’s easier for teams to get out of whack and it’s more likely than one player will play in every game. Thus, a whole host of players have played more than 82 games, led by Walt Bellamy, who put up 88 in 1968-69. While one player got to 87 since, and a few more to 86 and 85, Bellamy stands alone atop the leaderboard in this particular category. (That fact made it into at least one of his obituaries.)

Since Bellamy is the only person I’ve run across to get 6 extra games in a season and nobody from any of the other sports managed even 5, I’m inclined to say that he’s the modern, cross-sport holder of this nearly meaningless record for most games played adjusted for season length.

Ending on a tangent: one of the things I like about sports records in general, and the sillier ones in particular, is trying to figure out when they are likely to fall. For instance, Cy Young won 511 games playing a sport so different from contemporary baseball that, barring a massive structural change, nobody can come within 100 games of that record. On the other hand, with strikeouts and tolerance for strikeouts at an all-time high, several hitter-side strikeout records are in serious danger (and have been broken repeatedly over the last 15 years).

This one seems a little harder to predict, because there are factors pointed in different directions. On the one hand, players are theoretically in better shape than ever, meaning that they are more likely to be able to make it through the season, and being able to play every game is a basic prerequisite for playing more than every game. On the other, the sports are a lot more organized, which would intuitively seem to decrease the ease of moving to a team with meaningful games in hand on one’s prior employer. Anecdotally, I would also guess that teams are less likely to let players play through a minor injury (hurting the chances). The real wild card is the frequency of in-season trades—I honestly have no rigorous idea of which direction that’s trending.

So, do I think someone can take Bellamy’s throne? I think it’s unlikely, due to the organizational factors laid out above, but I’ll still hold out hope that someone can do it—or at least, finding new players to join the bizarre fraternity of men playing more games than their teams.

Uncertainty and Pitching Statistics

One of the things that I occasionally get frustrated by in sports statistics is the focus on estimates without presenting the associated uncertainty. While small sample size is often bandied about as an explanation for unusual results, one of the first things presented in statistics courses is the notion of a confidence interval. The simplest explanation of a confidence interval is that of a margin of error—you take the data and the degree of certainty you want, and it will give you a range covering likely values of the parameter you are interested in. It tacitly includes the sample size and gives you an implicit indication of how trustworthy the results are.

The most common version of this is the 95% confidence interval, which, based on some data, gives a range that will contain the actual value 95% of the time. For instance, say we poll a random sample of 100 people and ask them if they are right-handed. If 90 are right handed, the math gives us a 95% CI of (0.820, 0.948). We can draw additional sample and get more intervals; if we were to continue doing this, 95% of such intervals will contain the true percentage we are looking for. (How the math behind this works is a topic for another time, and indeed, I’m trying to wave away as much of it as possible in this post.)

One big caveat I want to mention before I get into my application of this principle is that there are a lot of assumptions that go into producing these mathematical estimates that don’t hold strictly in baseball. For instance, we assume that our data are a random sample of a single, well-defined population. However, if we use pitcher data from a given year, we know that the batters they face won’t be random, nor will the circumstances they face them under. Furthermore, any extrapolation of this interval is a bit trickier, because confidence intervals are usually employed in estimating parameters that are comparatively stable. In baseball, by contrast, a player’s talent level will change from year to year, and since we usually estimate something using a single year’s worth of data, to interpret our factors we have to take into account not only new random results but also a change in the underlying parameters.

(Hopefully all of that made sense, but if it didn’t and you’re still reading, just try to treat the numbers below as the margin of error on the figures we’re looking at, and realize that some of our interpretations need to be a bit looser than is ideal.)

For this post, I wanted to look at how much margin of error is in FIP, which is one of the more common sabermetric stats to evaluate pitchers. It stands for Fielding Independent Pitching, and is based only on walks, strikeouts, and home runs—all events that don’t depend on the defense (hence the name). It’s also scaled so that the numbers are comparable to ERA. For more on FIP, see the Fangraphs page here.

One of the reasons I was prompted to start with FIP is that a common modification of the stat is to render it as xFIP (x for Expected). xFIP recognizes that FIP can be comparatively volatile because it depends highly on the number of home runs a pitcher gives up, which, as rare events, can bounce around a lot even in a medium size sample with no change in talent. (They also partially depend on park factors.) xFIP replaces the HR component of FIP with the expected number of HR they would have given up if they had allowed the same number of flyballs but had a league average home run to fly ball ratio.

Since xFIP already embeds the idea that FIP is volatile, I was curious as to how volatile FIP actually is, and how much of that volatility is taken care of by xFIP. To do this, I decided to simulate a large number of seasons for a set of pitchers to get an estimate for what an actual distribution of a pitcher’s FIP given an estimated talent level is, then look at how wide a range of results we see in the simulated seasons to get a sense for how volatile FIP is—effectively rerunning seasons with pitchers whose talent level won’t change, but whose luck will.

To provide an example, say we have a pitcher who faces 800 batters, with a line of 20 HR, 250 fly balls (FB), 50 BB, and 250 K. We then assume that, if that pitcher were to face another 800 batters, each has a 250/800 chance of striking out, a 50/800 chance of walking, a 250/800 chance of hitting a fly ball, and a 20/250 chance of each fly ball being a HR. Plugging those into some random numbers, we will get a new line for a player with the same underlying talent—maybe it’ll be 256 K, 45 BB, and 246 FB, of which 24 were HR. From these values, we recompute the FIP. Do this 10,000 times, and we get an idea for how much FIP can bounce around.

For my sample of pitchers to test, I took every pitcher season with at least 50 IP since 2002, the first year for which the number of fly balls was available. I then computed 10,000 FIPs for each pitcher season and took the 97.5th percentile and 2.5th percentile, which give the spread that the middle 95% of the data fall in—in other words, our confidence interval.

(Nitty-gritty aside: One methodological detail that’s mostly important for replication purposes is that pitchers that gave up 0 HR in the relevant season were treated as having given up 0.5 HR; otherwise, there’s not actually any variation on that component. The 0.5 is somewhat arbitrary but, in my experience, is a standard small sample correction for things like odds ratios and chi-squared tests.)

One thing to realize is that these confidence intervals needn’t be symmetric, and in fact they basically never are—the portion of the confidence interval above the pitcher’s actual FIP is almost always larger than the portion below. For instance, in 2011 Bartolo Colon had an actual FIP of 3.83, but his confidence interval is (3.09, 4.64), and the gap from 3.83 to 4.64 is larger than the gap from 3.09 to 3.83. The reasons for this aren’t terribly important without going into details of the binomial distribution, and anyhow, the asymmetry of the interval is rarely very large, so I’m going to use half the length of the interval as my metric for volatility (the margin of error, as it were); for Colon, that’s (4.64 – 3.09) / 2 = 0.775.

So, how big are these intervals? To me, at least, they are surprisingly large. I put some plots below, but even for the pitchers with the most IP, our margin of error is around 0.5 runs, which is pretty substantial (roughly half a standard deviation in FIP, for reference). For pitchers with only about 150 IP, it’s in the 0.8 range, which is about a standard deviation in FIP. A 0.8 gap in FIP is nothing to sneeze at—it’s the difference between 2013 Clayton Kershaw and 2013 Zack Greinke, or between 2013 Zack Greinke and 2013 Scott Feldman. (Side note: Clayton Kershaw is really damned good.)

As a side note, I was concerned when I first got these numbers that the intervals are too wide and overestimate the volatility. Because we can’t repeat seasons, I can’t think of a good way to test volatility, but I did look at how many times a pitcher’s FIP confidence interval contained his actual FIP from the next year. There are some selection issues with this measure (as a pitcher has to post 50 IP in consecutive years to be counted), but about 71% of follow-up season FIPs fall into the previous season’s CI. This may be a bit surprising, as our CI is supposed to include the actual value 95% of the time, but given the amount of volatility in baseball performance due to changes in skill levels, I would expect to see that the intervals diverge from actual values fairly frequently. Though this doesn’t confirm that my estimated intervals aren’t too wide, the magnitude of difference suggests to me it’s unlikely that that is our problem.

Given how sample sizes work, it’s unsurprising that the margin of error decreases substantially as IP increases. Unfortunately, there’s no neat function to get volatility from IP, as it depends strongly on the values of the FIP components as well. If we wanted to, we could construct a model of some sort, but a model whose inputs come from simulations seemed to me to be straying a bit far from the real world.

As I only want to see a rule of thumb, I picked a couple of round IP cutoffs and computed the average margin of error for every pitcher within 15 IP of that cutoff. The 15 IP is arbitrary, but it’s not a huge amount for a starting pitcher (2–3 starts) and ensures we can get a substantial number of pitchers included in each interval. The average FIP margin of error for pitchers within 15 IP of the cutoffs is presented below; beneath that is are scatterplots comparing IP to margin of error.

Mean Margin of Error for Pitchers by Innings Pitched
Approximate IP FIP Margin of Error Number of Pitchers
65 1.16 1747
100 0.99 428
150 0.81 300
200 0.66 532
250 0.54 37

FIP Scatter xFIP Scatter

Note that due to construction I didn’t include anyone with less than 50 IP, and the most innings pitched in my sample is 266, so these cutoffs span the range of the data. I also looked at the median values, and there is no substantive difference.

This post has been fairly exploratory in nature, but I wanted to answer one specific question: given that the purpose of xFIP is to stabilize FIP, how much of FIP’s volatility is removed by using xFIP as an ERA estimator instead?

This can be evaluated a few different ways. First, the mean xFIP margin of error in my sample is about 0.54, while the mean FIP margin of error is 0.97; that difference is highly highly significant. This means there is actually a difference between the two, but looking at the average absolute difference of 0.43 is pretty meaningless—obviously a pitcher with an FIP margin of error of 0.5 can’t have a negative margin of error. Thus, we instead look at the percentage difference, which gives us the figure that 43% of the volatility in FIP is removed when using xFIP instead. (The median number is 45%, for reference.)

Finally, here is the above table showing average margins of error by IP, but this time with xFIP as well; note that the differences are all in the 42-48% range.

Mean Margin of Error for Pitchers by Innings Pitched
Approximate IP FIP Margin of Error xFIP Margin of Error Number of Pitchers
65 1.16 0.67 1747
100 0.99 0.53 428
150 0.81 0.43 300
200 0.66 0.36 532
250 0.54 0.31 37

Thus, we see that about 45% of the FIP volatility is stripped away by using xFIP. I’m sort of burying the lede here, but if you want a firm takeaway from this post, there it is.

I want to conclude this somewhat wonkish piece by clarifying a couple of things. First, these numbers largely apply to season-level data; career FIP stats will be much more stable, though the utility of using a rate stat over an entire career may be limited depending on the situation.

Second, this volatility is not something that is unique to FIP—it could be applied to basically any of the stats that we bandy about on a daily basis. I chose to look at FIP partially for its simplicity and partially because people have already looked into its instability (hence xFIP); in the future, I’d like to apply this to other stats as well; for instance, SIERA comes to mind as something directly comparable to FIP, and since Fangraphs’ WAR is computed using FIP, my estimates in this piece can be applied to those numbers as well.

Third, the diminished volatility of xFIP isn’t necessarily a reason to prefer that particular stat. If a pitcher has an established track record of consistently allowing more/fewer HR on fly balls than the average pitcher, that information is important and should be considered. One alternative is to use the pitcher’s career HR/FB in lieu of league average, which gives some of the benefits of a larger sample size while also considering the pitcher’s true talent, though that’s a bit more involved in terms of aggregating data.

Since I got to rambling and this post is long on caveats relative to substance, here’s the tl;dr:

  • Even if you think FIP estimates a pitcher’s true talent level accurately, random variation means that there’s a lot of volatility in the statistic.
  • If you want a rough estimate for how much volatility there is, see the tables above.
  • Using xFIP instead of FIP shrinks the margin of error by about 45%.
  • This is not an indictment of FIP as a stat, but rather a reminder that a lot of weird stuff can happen in a baseball season, especially for pitchers.

Principals of Hitter Categorization

(Note: The apparent typo in the title is deliberate.)

In my experience with introductory statistics classes, both ones I’ve taken and ones I’ve heard about, they typically have two primary phases. The second involves hypothesis testing and regression, which entail trying to evaluate the statistical evidence regarding well-formulated questions. (Well, in an ideal world the questions are well-formulated. Not always the case, as I bitched about on Twitter recently.) This is the more challenging, mathematically sophisticated part of the course, and for those reasons it’s probably the one that people don’t remember quite so well.

What’s the first part? It tends to involve lots of summary statistics and plotting—means, scatterplots, interquartile ranges, all of that good stuff that one does to try to get a handle on what’s going on in the data. Ideally, some intuition regarding stats and data is getting taught here, but that (at least in my experience) is pretty hard to teach in a class. Because this part is more introductory and less complicated, I think this portion of statistics—which is called exploratory data analysis, though there are some aspects of the definition I’m glossing over—can get short shrift when people discuss cool stuff one can do with statistics (though data visualization is an important counterpoint here).

A slightly more complex technique one can do as part of exploratory data analysis is principal component analysis (PCA), which is a way of redefining a data set’s variables based on the correlations present therein. While a technical explanation can be found elsewhere, the basic gist is that PCA allows us to combine variables that are related within the data so that we can pack as much explanatory power as possible into them.

One classic application of this is to athletes’ scores in the decathlon in the Olympics (see example here). There are 10 events, which can be clustered into groups of similar events like the 100 meters and 400 meters and the shot put and discus. If we want to describe the two most important factors contributing to an athlete’s success, we might subjectively guess something like “running ability” and “throwing skill.” PCA can use the data to give us numerical definitions of the two most important factors determining the variation in the data, and we can explore interpretations of those factors in terms of our intuition about the event.

So, what if we take this idea and apply it to baseball hitting data? This would us allow to derive some new factors that explain a lot of the variation in hitting, and by using those factors judiciously we can use this as a way to compare different batters. This idea is not terribly novel—here are examples of some previous work—but I haven’t seen anyone taking the approach I have now. For this post, I’m focused more on what I will call hitting style, i.e. I’d like to divorce similarity based on more traditional results (e.g. home runs—this is the sort of similarity Baseball-Reference uses) in favor of lower order data, namely a batter’s batted ball profile (e.g. line drive percentage and home run to fly ball ratio). However, the next step is certainly to see how these components correlate with traditional measures of power, for instance Isolated Slugging (ISO).

So, I pulled career-level data from FanGraphs for all batters with at least 1000 PA since 2002 (when batted ball data began being collected) on the following categories: line drive rate (LD%), ground ball rate (GB%), outfield fly ball rate (FB%), infield fly ball rate (IFFB%), home run/fly ball ratio (HR/FB), walk rate (BB%), and strike rate (K%). (See report here.) (I considered using infield hit rate as well, but it doesn’t fit in with the rest of these things—it’s more about speed and less about hitting, after all.)

I then ran the PCA on these data in R, and here are the first two components, i.e. the two weightings that together explain as much of the data as possible. (Things get a bit harder to interpret when you add a third dimension.) All data are normalized, so that coefficients are comparable, and it’s most helpful to focus on the signs and relative magnitudes—if one variable is weighted 0.6 and the other -0.3, the takeaway is that the first is twice as important for the component as the second and pushes that component in the opposite direction.

Weights for First Two Principal Components
PC1 PC2
LD% -0.030 0.676
GB% -0.459 0.084
FB% 0.526 0.093
IFFB% -0.067 -0.671
HR/FB 0.459 -0.137
BB% 0.375 0.205
K% 0.394 -0.126

The first two components explain 39% and 22%, respectively, of the overall variation in our data. (The next two explain 16% and 10%, respectively, so they are still important.) This means, basically, that we can explain about 60% of a given batter’s batted ball profile with only these two parameters. (I have all seven components with their importance in a table at the bottom of the post. It’s also worth noting that, as the later components explain less variation, their variance decreases and players are clustered close together on that dimension.)

Arguably the whole point of this exercise is to come up with a reasonable interpretation for these components, so it’s worth it for you to take a look at the values and the interplay between them. I would describe the two components (which we should really think of as axes) as follows:

  1. The first is a continuum: slap hitters who make a lot of contact, don’t walk much, hit mostly ground balls and few fly balls with few home runs are on the negative end, and big boppers—three true outcomes guys—place on the top end, as they walk a lot, strike out a lot, and hit more fly balls. This interpretation is borne out by the players with the large magnitude values for this component (found below). For lack of a better term, let’s call this component BSF, for “Big Stick Factor.”
  2. The second measures, basically, what some people might call “line drive power.” It measures people’s propensity to hit the ball hard, as it opposes line drives and infield flies. It also rewards guys with good batting eyes, since it opposes walk rate and strikeout rate. I think of this as assessing an old-fashioned view of what makes a good hitter—lots of contact and line drives, with less upper cutting and thus fewer line drives. Let’s call it LDP, for “Line Drive Power.” (I’m open to suggestions on both names.)

Here are some tables showing the top and bottom 10 for both BSF and LDP:

Extreme Values for BSF
Name PC1
1 Russell Branyan 5.338
2 Barry Bonds 5.257
3 Adam Dunn 4.768
4 Jack Cust 4.535
5 Ryan Howard 4.296
6 Jim Thome 4.278
7 Jason Giambi 4.237
8 Frank Thomas 4.206
9 Jim Edmonds 4.114
10 Mark Reynolds 3.890
633 Aaron Miles -3.312
634 Cesar Izturis -3.397
635 Einar Diaz -3.518
636 Ichiro Suzuki -3.523
637 Rey Sanchez -3.893
638 Luis Castillo -4.013
639 Juan Pierre -4.267
640 Wilson Valdez -4.270
641 Ben Revere -5.095
642 Joey Gathright -5.164
Extreme Values for LDP
Name PC2
1 Cory Sullivan 4.292
2 Matt Carpenter 4.052
3 Joey Votto 3.779
4 Joe Mauer 3.255
5 Ruben Tejada 3.079
6 Todd Helton 3.065
7 Julio Franco 2.933
8 Jason Castro 2.780
9 Mark Loretta 2.772
10 Alex Avila 2.747
633 Alexi Casilla -2.482
634 Rocco Baldelli -2.619
635 Mark Trumbo -2.810
636 Nolan Reimold -2.932
637 Marcus Thames -3.013
638 Tony Batista -3.016
639 Scott Hairston -3.041
640 Eric Byrnes -3.198
641 Jayson Nix -3.408
642 Jeff Mathis -3.668

These actually map pretty closely onto what some of our preexisting ideas might have been: the guys with the highest BSF are some of the archetypal three true outcomes players, while the guys with the high LDP are guys we think of as being good hitters with “doubles power,” as it were. It’s also interesting to note that these are not entirely correlated with hitter quality, as there’s some mediocre players near the top of each list (though most of the players at the bottom aren’t too great). That suggests to me that this did actually a pretty decent job of capturing style, rather than just quality (though obviously it’s easier to observe someone’s style when they actually have strengths).

Now, another thing about this is that while we would think that BSF and LDP are correlated based on my qualitative descriptions, by construction there’s zero correlation between the two sets of values, so these are actually largely independent stats. Consider the plot below of BSF vs. LDP:

PCA Cloud

And this plot, which isolates some of the more extreme values:

Big Values

One final thing for this post: given that we have plotted these like coordinates, we can use the standard measure of distance between two points as a measure for similarity. For this, I’m going to change tacks slightly and use the first The two players most like each other in this sample form a slightly unlikely pair: Marlon Byrd, with coordinates (-0.756, 0.395), and Carlos Ruiz (-0.755, 0.397).

As you see below, if you look at their batted ball profiles, they don’t appear to be hugely similar.  I spent a decent amount of time playing around with this; if you increase the number of components used from two to three or more, the similar players look much more similar in terms of these statistics. However, that gets away from the point of PCA, which is to abstract away from the data a bit. Thus, these pairs of similar players are players who have very similar amounts of BSF and LDP, rather than players who have the most similar statistics overall.

Comparison of Ruiz and Byrd
Name LD% GB% FB% IFFB% HR/FB BB% K%
Carlos Ruiz 0.198 0.455 0.255 0.092 0.074 0.098 0.111
Marlon Byrd 0.206 0.471 0.241 0.082 0.093 0.064 0.180

Another pair that’s approximately as close as Ruiz and Byrd is Mark Teahen (-0.420,-0.491) and Akinori Iwamura (-0.421,-0.490), with the third place pair being Yorvit Torrealba (-1.919, -0.500) and Eric Young (-1.909, -0.497), who are seven times farther apart than the first two pairs.

Who stands out as outliers? It’s not altogether surprising if you look at the labelled charts above, though not all of them are labelled. (Also, be wary of the scale—the graph is a bit squished, so many players are farther apart numerically than they appear visually.) Joey Gathright turns out to be by far the most unusual player in our data—the distance to his closest comp, Einar Diaz, is more than 1000x the distance from Ruiz to Byrd, more than thirteen times the average distance to a player’s nearest neighbor, and more than eleven standard deviations above that average nearest neighbor difference.

In this case, though, having a unique style doesn’t appear to be beneficial. You’ll note Gathright is at the bottom of the BSF list, and he’s pretty far down the LDP list as well, meaning that he somehow stumbled into a seven year career despite having no power of any sort. Given that he posted an extremely pedestrian 0.77 bWAR per 150 games (meaning about half as valuable as an average player), hit just one home run in 452 games, and had the 13th lowest slugging percentage of any qualifying non-pitcher since 1990, we probably shouldn’t be surprised that there’s nobody who’s quite like him.

The rest of the players on the outliers list are the ones you’d expect—guys with extreme values for one or both statistics: Joey Votto, Barry BondsCory Sullivan, Matt Carpenter, and Mark Reynolds. Votto is the second biggest outlier, and he’s less than two thirds as far from his nearest neighbor (Todd Helton) as Gathright is from his. Two things to notice here:

  • To reiterate what I just said about Gathright, style doesn’t necessarily correlate with results. Cory Sullivan hit a lot of line drives (28.2%, the largest value in my sample—the mean is 20.1%) and popped out infrequently (3%, the mean is 10.1%). His closest comps are Matt Carpenter and Joe Mauer, which is pretty good company. And yet, he finished as a replacement level player with no power. Baseball is weird.
  • Many of the most extreme outliers are players where we are missing a big chunk of their careers, either because they haven’t actually had them yet or because the data are unavailable. Given that there’s some research indicating that various power-related statistics change with age, I suspect we’ll see some regression to the mean for guys like Votto and Carpenter. (For instance, I imagine Bonds’s profile would look quite different if it included the first 16 years of his career.)

This chart shows the three tightest pairs of players and the six biggest outliers:

New Comps

This is a bit of a lengthy post without necessarily an obvious point, but, as I said at the beginning, exploratory data analysis can be plenty interesting on its own, and I think this turned into a cool way of classifying hitters based on certain styles. An obvious extension is to find some way to merge both results and styles into one PCA analysis (essentially combining what I did with the Bill James/BR Similarity Score mentioned above), but I suspect that’s a big question, and one for another time.

If you’re curious, here’s a link to a public Google Doc with my principal components, raw data, and nearest distances and neighbors, and below is the promised table of PCA breakdown:

Weights and Explanatory Power of Principal Components
PC1 PC2 PC3 PC4 PC5 PC6 PC7
LD% -0.030 0.676 -0.299 -0.043 0.629 0.105 -0.210
GB% -0.459 0.084 0.593 0.086 -0.044 0.020 -0.648
FB% 0.526 0.093 -0.288 -0.226 -0.434 -0.014 -0.626
IFFB% -0.067 -0.671 -0.373 0.247 0.442 -0.071 -0.379
HR/FB 0.459 -0.137 0.347 0.113 0.214 0.769 -0.000
BB% 0.375 0.205 0.156 0.808 -0.012 -0.373 -0.000
K% 0.394 -0.126 0.437 -0.461 0.415 -0.503 0.000
Proportion of Variance 0.394 0.218 0.163 0.102 0.069 0.053 0.000
Cumulative Proportion 0.394 0.612 0.775 0.877 0.947 1.000 1.000