Import useful libraries¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In this section, we will use the NHL_Team_Stats dataset we compiled and cleaned up in the assignment for Week 2.¶

Import both NHL_Team_Stats and NHL_Team_R_Stats data from Week 2 assignment into Python.

NHL_Team_Stats=pd.read_csv("../../Data/Week 4/NHL_Team_Stats.csv")
NHL_Team_R_Stats=pd.read_csv("../../Data/Week 4/NHL_Team_R_Stats.csv")
NHL_Team_Stats.head()

Regression analyses in Python¶

To run regressions in Python, we will introduce a new library, “statsmodels,” which is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.¶

import statsmodels.formula.api as sm

At the end of the assignment in week 2, we observed that there is a linear relationship between total goals for and winning percentage. Let’s run a regression where winning percentage is the dependent variable and total goals for is the explanatory variable.

We can use the command “ols()” to indicate an ordinary least squared regression.
The “fit()” function would allow us to obtain the estimated coefficient of our regression model.

reg1 = sm.ols(formula = 'win_pct ~ goals_for', data= NHL_Team_R_Stats).fit()

After we run a regression, we can use the “summary()” command to obtain a number of statistics from our regression model.

print(reg1.summary())

Interpreting results¶

From the result table, we can see that the dependent variable is winning percentage ("win_pct") and there are 181 observations in this regression. The independent variable is the number of goals for the team ("goals_for"). An intercept is also included in the regression.

The estimated coefficient on goals_for is 0.003. This means that an additional goal scored by the team will increase the team's winning percentage of 0.003. The estimate on the intercept is -0.1781. This means that without scoring any goal, the winning percentage for the team would be -0.1781. As we know, the winning percentage cannot be negative. The reason we get a negative estimate on the intercept is because in our sample, there is not a single game where a team scored zero goals.

t-statistics and p-value

t-statistics is defined as the estimated coefficient divided by its standard error. If the estimated coefficient is large compared to its standard error, then it is likely to be different than zero.

p-value is defined as the probability of obtaining a result as extreme as the result actually observed, in this case, the t-statistics we have in the regression analysis. Comparing the t-statistics with the student t distribution, if 95% of the t distribution is closer to the mean than the t-statistics, we will have p-value of 0.05, which is also referred to a 5% significance level. A p-value no more than 0.05 (5%) is generally accepted in rejecting the null hypothesis. We say that the estimated coefficient is statistically significant at the 5% level.

In this regression, the p-value of the goals_for variable is 0.000 which suggests that the estimate is statistically significant at the 1% level.

R-squared

R-squared measures the goodness of fit of the model. The R-squared of a regression is the fraction of the variation in the dependent variable that is accounted for by the independent variables. R-squared is always between 0 and 1. The larger the R-squared, the more variation is accounted for by the regression model.

In this regression, the R-squared is 0.591 which means that approximately 59.1% of the variation of the winning percentage is accounted for by the model.

Let's explore the relationship between goals_against and winning percentage in the regular season.¶

Create a scatter plot to depict the relationship between total goals against and winning percentage without seperating the data by competition

import seaborn as sns
sns.lmplot(x='goals_against', y='win_pct',  data=NHL_Team_R_Stats)
plt.xlabel('Total Goals against')
plt.ylabel('Winning Percentage')
plt.title("Relationship between Goals against and Winning Percentage", fontsize=20)

Calculate the correlation coefficient between goals against and winning percentage

NHL_Team_R_Stats['goals_against'].corr(NHL_Team_R_Stats['win_pct'])

Run a simple linear regression to find NHL team winning percentage as a function of total goals against

reg2 = sm.ols(formula = 'win_pct ~ goals_against', data= NHL_Team_R_Stats).fit()
print(reg2.summary())

Self Test - 1¶

Use the regular season data, create a scatterplot and a regression line to demonstrate the relationship between average goals for per game and winning percentage.
Run a linear regression where winning percentage is the dependent variable and average goals for is the explanatory variable
Interpret the coefficient on the average goals for, is this estimate statistically significant?
How well does this regression do in fitting the data?

#Your Code Here

Multiple Regression - more than one explanatory variables.¶

Often times, the outcome variable of interest is affected by multiple factors. We can specify a regression equation where the outcome is function of more than one explanatory variables.

Let's run a linear regression where winning percentage is a function of both average number of goals for per game and average number of goals against per game.

reg4 = sm.ols(formula = 'win_pct ~ avg_gf+avg_ga', data= NHL_Team_R_Stats).fit()
print(reg4.summary())

Interpret the coefficients

Average goals for: for the same average number of goals against, scoring one more goal per game will increase the winning percentage by 0.1909 (19.09%)
Average goals against: having the same average number of goals for per game, conceding one more goal per game will decrease the winning percentage by 0.1874 (18.74%)

Regression with categorical variables¶

In the above regressions, we focus on using quantitative variables as explanatory variables. We could also include categorical variables as explanatory variables in the regression as well.

Essentially, when we incorporate a categorical variable, we first transform it into dummy variable(s) that carry value of either 0 or 1. We then use the dummy variable(s) into our regression.

Let's consider the dataset that includes both regular season and playoff. In this dataset, the variable "type" captures whether a game is a regular season game or playoff game. type=2 means it is regular season competition while type=3 means it is a playoff game.

We will first convert variable "type" into categorical variable.

NHL_Team_Stats['type']=NHL_Team_Stats['type'].astype(object)

Now we can run a regression where winning percentage is a function of average goals for and the type of competition.

reg5 = sm.ols(formula = 'win_pct ~ avg_gf+type', data= NHL_Team_Stats).fit()
print(reg5.summary())

A dummy variable = 1 if type= 3 (playoff) and = 0 if type = 2 (regular season) is included in the regression.

Interpretation: with the same average goals for per game, the winning percentage in the playoff games is 0.0160 (1.6%) lower than the winning percentage in the regular season games.

Self Test - 2¶

Run a regression where winning percentage is a function of average goals for, average goals against, and control for the different competitions.
Interpret the coefficients.

#Your Code Here

Regression with an interaction term¶

What if the impact of an independent variable depends on the value of another variable? We can use interaction terms to allow for different impact of a variable based on one or more levels of another categorical variable.

Let's consider the possibility that the average goals for may have different impact on winning percentage depending on the type of the game. We can run a regression of winning percentage on the average goals for, the type of the game, as well as the interaction between average goals for and type.

reg7 = sm.ols(formula = 'win_pct ~ avg_gf+type+avg_gf*type', data= NHL_Team_Stats).fit()
print(reg7.summary())

Interpretations

For regular season games (type =2), scoring one more goal per game can increase the winning percentage by 0.2365 (23.65%);
For the playoff games (type =3), scoring one more goal per game will increase the winning percentage by 0.2365-0.0802=0.1563 (15.63%).

Self Test - 3¶

Perform a similar exercise to find the relationship between the actual winning percentage and pythagorean winning percentage¶

In the NHL_Team_Stats data, create the pythagorean winning percentage=goals_for^2/(goals_for^2+goals_against^2), call this new variable "pyth_pct" (In Python, is the operator for exponentiation. For example, the square of x would be x2 in Python.)

#Your Code Here

Create a scatter plot to show the relationship between Pythagorean winning percentage and the actual winning percentage

#Your Code Here

Run a linear regression (reg8) where winning percentage is the dependent variable and Pythagorean winning percentage is the explanatory variable.
Interpret the estimate on the Pythagorean winning percentage and the goodness of fit of the regression model.

#Your Code Here

Create a scatter plot to show the relationship between winning percentage and Pythagorean winning percentage, seperate the data points by the type of competition.

#Your Code Here

Run a regression (reg9) where winning percentage is the dependent variable and Pythagorean winning percentage is the explanatory variable, controlling for the different competitions.
Interpret the estimate on the Pythagorean winning percentage and the goodness of fit of the regression model.

#Your Code Here

Run a regression (reg10) where winning percentage is the dependent variable and Pythagorean winning percentage, competition, and the interaction between competition and Pythagorean are the explanatory variables
Interpret the estimate on the Pythagorean winning percentage and the goodness of fit of the regression model

#Your Code Here

Discussion question: how well does Pythagorean winning percentage predicts the actual winning percentage based on our data?