The looming data war between front offices and fans

Going back to Henry Chadwick'€™s invention of the box score in the 1850s, statistical summaries have been integral to telling stories about baseball. In the latter half of the 19th century, box scores were a way to explain the narrative of a game to an eager public without TV or photographs, in a time when the only access to sports was at the stadium. Now we are bombarded with a multitude of avenues with which to enjoy baseball, but the role of data is fundamentally the same.

To illustrate my point, consider the following scenario: Let'€™s say you come to me and ask how the game went yesterday because you didn'€™t get the chance to see it. I could say something like, "€œThe Royals beat the A'€™s, 9-8." It might be a factual statement, but it wouldn'€™t be an especially interesting one. A better way to tell the story would be to explain who played, who scored the runs, and when -- €”the fundamental components of a box score. A still better summary of the game might highlight some of the unexpected happenings, like the way that the Royals exposed Jon Lester'€™s inability to stop the running game to the tune of seven steals. A yet more rich description of it might wrap the occurrences of the game up into historical narratives and longer-term trends, noting for example that despite nearly matching the single-season record for innings caught (and presumably suffering under the burden of tremendous fatigue), Sal Perez was able to knock in the walkoff single in the bottom of the 12th inning. All of these details come from data, and help to transform the rote happenings of sport into a story worth listening to.

In the present day, we are on the verge of a data deluge. Having recorded and preserved nearly every at-bat-level event going back decades, the modern baseball fan is treated to a cornucopia of additional statistics concerning a still finer level of analysis, each individual pitch. The output of PITCHf/x has proven invaluable to writing about and (for me) enjoying baseball, not to mention my own research. Soon, we will track not only the path of the ball, but also its speed off the bat, and perhaps the motions of every player on the field (thanks to Statcast).

That information, too, is just a prelude to a host of new technologies. COMMANDf/x, as yet unreleased, traces the motion of the catcher'€™s glove in order to determine the pitcher'€™s command. BIOf/x watches the pitcher'€™s splayed limbs for signs of injury, fatigue, or new mechanics. There are suggestions of still other methods, yet to be developed or applied widely, but with potentially transformative impacts on our ability to enjoy the game. Everywhere we look, new data is being generated, fresh technologies leveraged, novel observations made.

Imagine if, going back to the above scenario, I could tell you that for that fateful pitch Perez struck, Josh Donaldson reacted .2 seconds more slowly than he usually does, and that Jason Hammel'€™s mechanics had been ever-so-slightly off for the past three weeks. These details enrich the observed happenings; they do not diminish them (as some columnists have argued). Every additional layer of data enhances one'€™s appreciation for the depth and complexity of the game.

But there is also a countervailing trend to this surge, and it comes, paradoxically, from some of the same people who once clamored the loudest for more and better data. Even as these new forms of inference develop, the statistically minded arms of front offices (staffed now with dozens of alumni of BP and similar sites) are quietly seeking to prevent us from gaining access to them. As Tom Tippett explained at last year'€™s Saberseminar, disseminating additional data to the public risks allowing that same public to learn interesting things about baseball (oh no!). Because published analytical discoveries are equally available to the most- and least-advanced front offices, any work done, for example, here at BP, decreases the competitive advantage of the most sophisticated teams.

So we arrive at a moment in time in which the incentives of front offices and baseball fans are profoundly misaligned. Baseball fans want and deserve the richer, fuller stories that the various novel F/X systems might provide. Contrarily, front offices respond altogether rationally to the incentives of winning, which favor obfuscation, nigh on cloak-and-dagger levels of secrecy, and the quick hiring of any analyst who threatens to make meaningful progress on intriguing unsolved questions in baseball.

It'€™s hard to blame the front offices for this behavior. They are paid to win, and because of the nature of competitive sport, any advantage in understanding baseball shared with rival teams is no advantage at all. At the same time, we as fans are left in something of a bind. Having complained for years on end that front offices were failing to think optimally about everything from batting orders to talent acquisition, we got our wish, and now perhaps would like them to behave a little bit less optimally, at least with respect to these new technologies.

I don'€™t know how this impasse will resolve itself. The commissioner'€™s office will break the stalemate eventually, and so far seems to have tentatively leaned toward liberating some amount of the Statcast data, which bodes well for the future. Yet there is no guarantee that this release will happen, and the commissioner has been frustratingly vague as to what output of Statcast'€™s cameras will be made public and when.

If the front offices prove more persuasive than the fans, we may see only infrequent summaries of Statcast'€™s insight. Lest you think this possibility remote, consider that COMMANDf/x and HITf/x have been in use for years, while the public at large has acquired only the slightest glimpse of their power (just enough to make us hungry for more). We may never see BIOf/x information revealed to the public. The complexities are multiplied by the addition of third parties who quite reasonably seek to monetize data collection. With the interests of the teams and the data providers allied against the sabermetrically inclined fan, we run the real risk of losing access to the next generation of interesting baseball data. I don't know how Henry Chadwick would feel, but this is a possibility which leaves me melancholy.

We ought not throw our hands in the air and give up, of course. In my experience in the world of sabermetrics, I have been endlessly impressed with the friendliness, tenacity, and resourcefulness of the fans and researchers. As a community, those who love baseball, and by this I mean you, are perpetually ingenious, and we will surely find some way to remedy our lack of data.

One option is simply to collect the data ourselves. (Here, we could draw inspiration from our fellow fans of other sports, who have been successfully crowdsourcing manual data collection for years.) Many of the novel applications developed rely on camera tracking of the players on the field, using computers to analyze the video. The core technology in use€ --called computer vision -- €”is freely and easily available. Since MLB can'€™t very well deny us video of the games, it would be next to impossible to stop the community from harvesting and scrutinizing the same video. The resulting data might be a pale imitation of what the teams have access to, but it might be better than nothing.

I hope the situation won'€™t develop in such a way as to require this level of investment from the community. I know from having made an attempt to collect bat cracks that manual data collection is a painstaking, error-prone, and difficult process. It would be vastly better if the data was made freely available for all to enjoy.

The end of Henry Chadwick'€™s story finds him having popularized baseball to such a tremendous extent that, despite never having played, managed, or GM'™d, he came to be known as the Father of Baseball, and was posthumously inducted into the Hall of Fame. There'€™s a lesson in Chadwick'€™s life, and it's that data is not the enemy of organized sports, but rather its confederate. Data allows the writer (and the fan) to make the happenings on the field come alive, aids her in explaining why all of the minute details of an at-bat are intriguing, and helps to illuminate the arcane aspects of game theory and Nash equilibria that inhabit nearly every decision made by every player.

Speaking of all of those arcane details, I started at BP with a piece last February on understanding pitch variation using entropy, and I was immediately shocked at the degree to which my deranged and opaque ideas were accepted. Coming from a background in academic science, I expected that my articles would be dismissed as hopelessly esoteric, and I would skulk back to my blog to write in obscurity. Instead, you embraced them, which speaks to the hunger for new ways of telling the stories of our favorite pastime. Emboldened by your praise, I attempted ever more devious and preposterous schemes. That I developed an audience for my ramblings is testament to the skill and faith of my editors, Ben Lindbergh and Sam Miller, my collaborators and friends in sabermetrics, and your willingness to entertain my precarious proposals.

Now, I have the great sadness of announcing that this will be my last regularly scheduled article at BP. I'€™m moving over to FiveThirtyEight, where I will write a weekly column. I still hope to contribute sporadically to BP, but my primary space will be there, and so this piece must stand as my farewell.

I called my column "Moonshot"€ because I wanted it to be ambitious and directed at big, data-driven steps forward in our understanding of baseball. Once in a while, I think I even accomplished that goal. But beyond that, I hope that my writing helped galvanize and invigorate your appreciation for baseball. The game is rich with seemingly unbounded detail, and data, whether it be from the box scores of the 19th century or the camera tracking systems of the 21st, helps us organize, codify, and tell beautiful stories about the games.