Salaries and Performance in the National Hockey League (NHL)¶

By looking at third league modeled on the North American system we can get a better understanding of the three variables we have used to explain win perecentage: salaries, lagged win percentage, and fixed effects.

We follow the same steps as we did for both those leagues.

# As usual, we begin by loading the packages we will need

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

# Now we load the data

Hockey=pd.read_excel("../../Data/Week 5/NHL pay and performance.xlsx")

Hockey.describe()

Hockey.info()

We can see that we have 301 observations in total covering the seasons 2009 to 2018. This is somewhat more than we has in the NBA case, but much less than the MLB case. We can now look at the changes in total salary spending across the seasons:

Sumsal = Hockey.groupby(['season'])['salaries'].sum().reset_index().rename(columns={'salaries':'allsal'})
Sumsal

Salary inflation has not been as dramatic in the NHL as in other leagues we have looked at, but they have still increased by more than one third in a decade, which is very unlikely to be caused by improving player quality on average. As with the other leagues, the main driver of increasing salaries has been increasing team revenues and the capacity of the players to bargain for higher wages.

As before, we use pd.merge() to add the aggregate salaries for each season to our original dataframe:

Hockey = pd.merge(Hockey, Sumsal, on=['season'], how='left')
display(Hockey)

We can now create a variable which we call 'relsal', which measures the share of a team's salary spend in the total spending of all teams in that season:

Hockey['relsal']= Hockey['salaries']/Hockey['allsal']

Before running a regression, it makes sense to look at the relationship between salaries and win percentage on a chart. To do this we use sns.reglot(). Since our argument is that higher relative salaries mean better players which in turns leads to more wins, we put relsal on the x axis and wpc on the y axis.

sns.regplot(x="relsal", y="wpc", data = Hockey, ci=False)

Self Test¶

Re-run the regplot with smaller dots so that there is no overlap.

Note that the values of relsal on the x axis tend to vary between around 0.02 and 0.045, while wpc on the y axis tends to vary between around .3 and .7. The variation in relsal is much more like the NBA, while the variation in wpc is more like MLB.

As with all the other leagues it is clear from the data that there is a positive correlation between relsal and wpc, as shown by the regression line which regplot adds to the scatter diagram. We now run a regression using smf.ols() in order to derive the coefficients of the regression and other diagnostic statistics.

wpcsal1_lm = smf.ols(formula = 'wpc ~ relsal', data=Hockey).fit()
print(wpcsal1_lm.summary())

For the NHL we find that the coefficient of relsal is quite similar to that of the NBA. In fact, the regression looks quite similar in terms of the intercept (this is the value of win percentage if relsal were zero) and the R-squared.

Self Test¶

Create a subset of the data which includes only the 2018 season and run the regression of wpc on relsal. How do the results compare to the results above?

Let's see how the addition of the lagged dependent variable changes our relsal estimate.

# first we sort the values

Hockey.sort_values(by=['Team','season'], ascending=True)

# this will allow us to inspect all rows in the data

pd.set_option('display.max_rows', 400)
Hockey

# now we create the lagged dependend variable

Hockey['wpc_lag'] = Hockey.groupby('Team')['wpc'].shift(1)
Hockey

We now run our regression again, but adding wpc_lag into the regression equation:

wpcsal2_lm = smf.ols(formula = 'wpc ~wpc_lag + relsal', data=Hockey).fit()
print(wpcsal2_lm.summary())

Once again adding the lagged dependent variable is justified both in terms of the statistical significance of the variable and the addition to R-squared (whether adjusted or not). However, the impact on our main variable of interest, relsal, is relatively small. Its value has fallen from 10.8 to 8.4, which while it does suggest that there was some omitted variable bias, it is not as great as in the NBA case, while in the MLB case the coefficient of relsal was much smaller to begin with.

Let's now see what changes if we include fixed effects:

wpcsal3_lm = smf.ols(formula = 'wpc ~wpc_lag + relsal +C(Team)', data=Hockey).fit()
print(wpcsal3_lm.summary())

Here we find ten fixed effects that are statistically significant. The fixed effects add considerably to the R-squared of the regression, and only marginally reduce the value of relsal. However, the most striking impact of the fixed effects is to reduce the value of of the lagged dependent variable to the point where it is statistically insignificant. This is in contrast to what we found in all three of the other other leagues. Because it is statistically insignificant in this version, and since we want to keep the fixed effects, we can drop the lagged dependent variable, which we do in this regression:

wpcsal4_lm = smf.ols(formula = "wpc ~  relsal +C(Team)", data=Hockey).fit()
print(wpcsal4_lm.summary())

This model has a very simple interpretation: wpc = 0.256 + 8.76 x relsal + fixed effects. If we ignore fixed effects, we can identify the expected win percentage for low, average and high relative spending:

print(0.256 + 8.76*0.02)
print(0.256 + 8.76*0.0325)
print(0.256 + 8.76*0.045)

Self Test¶

Based on the fixed effects regression, calculate the win percentage of:

(a) The Calgary Flames assuming the value of relsal for the team is 0.03 (b) The Edmonton Oilers assuming the value of relsal for the team is 0.04 (c) The Montreal Canadiens assuming the value of relsal for the team is 0.05

Looking at this, we can see that the numbers are slightly skewed- the performance levels are higher than one might expect at each level, and this most likely reflects the impacts of the fixed effects. It's clear that most of the fixed effects are negative, and this would bring down teams to a lower level of performance. It suggests that those teams that are able to dominate, are capable of doing so (or at least were capable of doing so during this era) because of factors other than wage spending.

Conclusion¶

This week we have looked at four different leagues and used salary data to assess the impact of wage spending on team performance. In every case we found it had a significant impact, but that impact varied depending on the league. The league system also mattered, as we saw when contrasting the cases of the EPL with the NBA, MLB and NHL.

We also introduced two issues which should always be considered when running regressions: omitted variable bias and heterogeneity.

Finally, we should mention one issue which arises in the context of this type of exercise. The data we study here is "observational", meaning that we collect the data based on what actually happened, during events over which we had no control. This raises the question, "How would outcomes have been different if some particular variable had had a different value?" The regression coefficients produced some answers for us, but how can we be sure that there was not some other factor which we have omitted, which was what really mattered? We can't be sure.

Scientists in laboratories typically don't have this problem - they use "experimental data" which they create in a controlled environment, so that they can control all observable factors. You can use that kind of data to measure the aerobic capacity of an athlete but, since you can't directly control the game, you can never use it to analyze game outcomes.

Some experimental scientists would go so far as to say that we can infer nothing from observational data. This is the logic of a phrase you may have encountered: "correlation is not causation". We have observed correlations in the data using regression analysis, but that does not prove that the links were causal (you could go so far as to argue that win percentage causes salaries to increase, and not vice versa).

We think that is too pessimistic a view. Observational data is certainly more challenging to work with, but it is possible for us to gain insight into the underlying relationships through careful study. It is important to be aware of the pitfalls, and it is important to focus on the logical coherence of the analysis rather than just running regressions. Another way to say this is that one should always have in mind a theory that one is trying to test, and be willing to discard that theory if the data renders it untenable. With careful thought and attention to details, it is possible to generate results which can enhance our understanding.