Hockey Analytics Needs More Theory, Not Just More Data

A number of new companies are trying to position themselves as data tracking solutions for NHL teams. The most well-known of these is likely SportVu, which uses a system of cameras and sensors to track player and puck movement.

More recently, a smaller company called Sportlogiq has come onto the scene with a slightly different solution. For those of you not familiar with Sportlogiq, they specialise in tweeting the same photo of Jeff Petry every day:

petrydish

Occasionally they take a break from their, frankly, slightly creepy obsession with the under-rated Habs defenceman, and instead focus on their side-mission as "an analytics company that specializes in microstat tracking and data visualization."

They've recently uploaded a video showing how their technology works, and it's pretty cool.

I like what they've shown off here, it's a really neat use of technology. But I take issue with one of the central claims that CEO Craig Buntin makes.

"None of this is subjective."

About half-way through the video, Buntin says that we should, "Keep in mind that none of this is subjective." You should not keep that in mind, because there's actually quite a bit about what they're doing that is subjective.

I'm going to dig into this in a fair amount of detail, because these are issues that apply broadly to most hockey stats writing (and, for that matter, to a lot of writing about Big Data ™).

Let's take a look at part of the screen Buntin shows (this data seems to describe Jack Eichel):

Buntin says that this describes a player's "overall playing ability." Even from this screenshot, it's pretty evident that there are a lot of subjective decisions being made here. Here are some I was able to pick out:

As you'll see in the next screen-shot, each of these bars is a combination of different stats. Deciding how to weight those stats in these bars is a subjective decision.
Deciding which stats to track at all is subjective.
What makes a stat indicative of a player's Competitiveness or Energy? That's subjective. And the same can be said for the more simple categories; what makes a particular stat a way to describe a player's Skating? Totally subjective.

Speaking of Skating, let's take a look at that section in a bit more detail.

Plenty of subjectivity here too:

What determines which stats are included here? Subjective.
Who decides what qualifies a deke as "successful"? Subjective.
What makes the carry/dump-in ratio a good descriptor of a player's skating? Subjective.

To draw this out a bit more clearly, let's look at a stat that a lot of people seem to like: zone entries.

There is a lot of subjectivity in how we measure zone entries. For example, what counts as a dump-in? Do we include dump-and-changes? How about a play where the forward in the neutral zone chips the puck off the boards around a defenceman and his team-mate is able to recover the puck before it goes behind the goal line? Those decisions will give us different results, and those different results will lead us to different interpretations of player skill.

This may sound like I'm being pedantic, but these kinds of issues are really important. A computer is not "objective" in any meaningful sense of the term. A computer simply does what a human tells it to do. It reflects the biases and subjective judgements that are built into it by the people who program its algorithms.

Choosing what to measure, how to measure it, how to value it, and so forth - those are all important decisions that can have huge impacts on what lessons you draw from the output the computer gives you.

We need to be honest about this, because not all data is good or valuable. Just because you have a number doesn't mean you have an insight. I often return to this paragraph from economist Paul Krugman, which explains why data needs to be backed up by theory:

But you can’t be an effective fox just by letting the data speak for itself — because it never does. You use data to inform your analysis, you let it tell you that your pet hypothesis is wrong, but data are never a substitute for hard thinking. If you think the data are speaking for themselves, what you’re really doing is implicit theorizing, which is a really bad idea (because you can’t test your assumptions if you don’t even know what you’re assuming.)

This is every bit as true for hockey as it is for economics. Let's say, as in the screenshot above, you've determined that Jack Eichel's carry-in to dump-in ratio is 26. OK. So now what? Does that matter? Why? What is it telling us? You can't just look at the number and go "Aha! 26! Most players are worse than that! Therefore Eichel is good!" You need to know what specifically it is you're measuring and why and what it's supposed to tell you.

And those kinds of judgements are very subjective. They need to be argued and defended, not assumed.

Good Teams Will Need Good Theorists
I've been saying this for a while, and I like the sound of my own voice so I'll repeat it here: The more data that teams have available to them, the more they'll need people who are good at theory, not numbers, to get any value out of the mountain of numbers that tracking technologies make available. Someone who can figure out what is actually worth looking at will be worth a lot more to a team than someone who can do a lot of calculations.

I strongly believe that a lot of teams are going to waste their time looking at statistics that, in the long run, will turn out to be meaningless.

Let's return to zone entries. Right now we talk a lot about the importance of zone entries, and it's one of the advancements made by hockey bloggers that seems to have reached a lot of NHL teams. But no one really cared about zone entries until Eric Tulsky and Geoff Detweiler of Broad Street Hockey began to look at them back in April 2011.

It may seem obvious now, but back then no one was tracking zone entries and no one seemed to have any idea why you might want to. We had no idea how entries related to Corsi or goal scoring or anything else.

But someone had an idea that other people hadn't had: "This aspect of the game looks important, let's break it down in more detail." They made decisions about how to track that data and what to do with it afterwards that lead to a specific set of results, and making other decisions likely would have lead to very different results.

This means that it's important for analytics people to be well versed in numbers and in hockey. If you don't know what you ought to be looking for, you're probably never going to find it.

So the teams that are able to benefit the most from player tracking technology aren't going to be the ones who have analysts with the most skill with SQL or R or MatLab, it will be the teams that have the people who can figure out what's worth looking at to begin with.

It will be the teams with the best theorists. It will be the teams that are able to make the wisest subjective decisions, the ones that know that data never speaks for itself and are able to figure out how to make it say something useful.

Analysis

Hockey Analytics Needs More Theory, Not Just More Data

"None of this is subjective."

Maple Leafs' Top 25 Under 25: #23 Scott Harrington

From the Branches: Fill in the blanks edition

Comment Markdown

Hockey Analytics Needs More Theory, Not Just More Data

"None of this is subjective."

Maple Leafs' Top 25 Under 25: #23 Scott Harrington

From the Branches: Fill in the blanks edition

The Good and the Bad

The Centres who aren't Brayden Schenn

Sick Hands

Comment Markdown