Shot rates are the backbone of a large portion of hockey analytics. A lot of the work done to better understand the game relies on the fact that shot rates are a repeatable and important metric that have a large impact on future winning. With that in mind, I wanted to explore how shot rates (both for and against) are distributed in the NHL, focussing on the 2015/2016 season. A lot of the inspiration for this was provided by this article from Micah Blake McCurdy, where he explores the relation between shot suppression and generation for individual skaters. What he finds is that offense and defense (here defined by Corsi For / 60 minutes and Corsi Against / 60 minutes) are largely unrelated. In this piece, I use some similar techniques and methodology, but at the team level as opposed to the individual level, to see if the same pattern persists.
However, we begin with a more simple question -- how are single game shot attempt results (for and against) distributed.
All figures are 5v5 and adjusted for both score and venue (via Corsica.hockey). I used single-game data for every team, looking at home team results only (as the away team results are simply the inverse of the home team results). This means that every regular season NHL game is represented in this dataset (1230 games overall).
Distributions of Shot Generation and Suppression
Intuitively, one would think that shot generation and suppression are distributed normally (otherwise referred to as the bell curve). This would imply the majority of single-game shot attempt results are observed around the average, with fewer results seen the farther you deviate from the mean (of course, 'around' is dependent on the standard deviation of the distribution). I created histograms for Corsi For / 60 minutes (CF60) and Corsi Against / 60 minutes (CA60) to see if this is accurate, shown below:
A histogram is essentially an estimate of the distribution for a given variable. You simply divide the range of values you've observed into a series of non-overlapping intervals (called bins) and count how many results fall into each bin. In this case, I plotted the densities of each bin as opposed to the frequency, but that doesn't change the shape of the plot. The dark blue curve on each plot represents what a normal distribution would look like, with mean and standard deviation equivalent to that of the datasets for each metric.
It can be seen from the histograms that the distributions of CF60 and CA60 broadly follow the normal distribution, though not exactly. The histogram isn't quite symmetric the way we'd expect a normal distribution to be - it extends further to the right than it does to the left. This is not enough to really conclude anything -- histograms are a good starting point, but the shape of the plot is heavily impacted by the bin size chosen.
Quick aside, actually. You may have noticed that the distributions for CF60 and CA60 are slightly different in range and shape. To see this more clearly, I've overlayed their histograms on top of one another:
It can be seen by the eye that CA60 has a heavier right tail, and that more of the CF60 data is concentrated in the middle of the graph. On its face, this seems odd. After all, every shot attempt for is also a shot attempt against for the opponent. So why do these not mirror one another?
The answer goes back to the fact that we're looking at home teams only. Since away teams aren't represented here, the two datasets won't mirror one another. Anyways, back to the distributions.
If we want to evaluate whether these quantities follow a normal distribution, another graphical test we could use is called a Q-Q Plot (the Q stands for Quantile). Essentially, this is a plot of the distribution we think the data is (in this case, normal) versus the distribution we've actually observed in the data. If the data matches what we think the distribution is, we should see a straight line on the plot (or close to it).
Both CF60 and CA60 exhibit deviation from the straight line we'd expect towards the tails of the distribution. The concave shape of both plots indicate that there is some positive skew -- that is, the right tail is longer than the left tail (this was also observed in the histograms). For reference, here's what we'd expect a true normal distribution to look like on a Q-Q Plot. Note that only a few data points deviate from the line at all.
So if we're looking at it based on these figures, the intuition I mentioned at the start of this section isn't quite correct. CF60 and CA60 at the team level aren't actually distributed normally (at least not for 2015/2016). That said, I think that a normal distribution is actually a fairly reasonable approximation for these metrics - both CF60 and CA60 roughly follow a normal distribution, and as far as approximations go, it's not totally off-base. In the real world, it's rare to get a distribution that is unequivocally normal. In many cases, you'll get weird eccentricities that arise in the tails, simply because the data is not as dense there and anomalous results stand out more.
Now at this point, you're probably annoyed I made you read 800 words and a bunch of graphs just to tell you that:
1) The intuitive idea we have about how CF60 and CA60 are distributed is not exactly correct
2) Despite that, we can probably still use that intuition in approximating how the quantities are actually distributed
But that's how math goes sometimes. Not every result is earth-shattering. But I promise it gets (slightly) more interesting from here. Another thing we can look at with this data is the relationship between CF60 and CA60. This is mentioned in the HockeyViz article I linked at the top of this piece -- in this case, we'll look at it on the team level, as opposed to individual.
Relationship Between CF60 and CA60 (Team Level)
So the first thing I tend to do whenever I want to see the relationship between two variables is plot it. It's super simple, but there's no real need to get fancy here.
Well, that's somewhat unremarkable, right? It basically seems like an amorphous blob -- there's no apparent strong relation between the two quantities. Personally, I find the scatterplot a little busy and difficult to read, given the amount of data points on it. To make a more reader friendly version, I made a kernel density estimate of the data. In a sense, you can think of this as a multi-dimensional histogram -- an estimate of the distribution across more than one quantity.
I've coloured the figure similar to a heat map, so warm colours indicate areas of high density, and cool colours are areas of low density. From this, there is once again no strong pattern to the figure (which is expected -- the kernel density estimate is a different way of viewing the scatterplot). The correlation between CF60 and CA60 is -0.2, which is a weak negative correlation. This lines up with our intuition - the more you have the puck (and are shooting), the less the opponents can do the same. However, this effect is more muted than one might otherwise think.
For context, if offense and defense were totally unrelated, the kernel density estimate would be a perfect circle, centred around the mean for CF60 and CA60. On the other hand, if they were perfectly (negatively) correlated, the kernel density estimate would be a straight diagonal line from the top left of the graph to the bottom right (McCurdy details this in his piece as well - if it's not clear by now, it is highly recommended).
This is similar to the result McCurdy got when he looked at it. Essentially, this means that the patterns McCurdy observed at the individual level persist when aggregated up to the team level -- it's something we'd expect, but it's good to confirm, and in my opinion, it's important to confirm as well.
The implication of this is that it offense and defense are less related than one may think. They seem to be separate skills both on the individual and team level. If they weren't, it's less likely we'd see teams like the Dallas Stars (great offense, bad defense) and the New Jersey Devils (bad offense, great defense).
As always, let me know your thoughts below.