import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Import both NHL_Team_Stats and NHL_Team_R_Stats data from Week 2 assignment into Python.
NHL_Team_Stats=pd.read_csv("../../Data/Week 4/NHL_Team_Stats.csv")
NHL_Team_R_Stats=pd.read_csv("../../Data/Week 4/NHL_Team_R_Stats.csv")
NHL_Team_Stats.head()
import statsmodels.formula.api as sm
At the end of the assignment in week 2, we observed that there is a linear relationship between total goals for and winning percentage. Let’s run a regression where winning percentage is the dependent variable and total goals for is the explanatory variable.
reg1 = sm.ols(formula = 'win_pct ~ goals_for', data= NHL_Team_R_Stats).fit()
After we run a regression, we can use the “summary()” command to obtain a number of statistics from our regression model.
print(reg1.summary())
From the result table, we can see that the dependent variable is winning percentage ("win_pct") and there are 181 observations in this regression. The independent variable is the number of goals for the team ("goals_for"). An intercept is also included in the regression.
The estimated coefficient on goals_for is 0.003. This means that an additional goal scored by the team will increase the team's winning percentage of 0.003. The estimate on the intercept is -0.1781. This means that without scoring any goal, the winning percentage for the team would be -0.1781. As we know, the winning percentage cannot be negative. The reason we get a negative estimate on the intercept is because in our sample, there is not a single game where a team scored zero goals.
t-statistics is defined as the estimated coefficient divided by its standard error. If the estimated coefficient is large compared to its standard error, then it is likely to be different than zero.
p-value is defined as the probability of obtaining a result as extreme as the result actually observed, in this case, the t-statistics we have in the regression analysis. Comparing the t-statistics with the student t distribution, if 95% of the t distribution is closer to the mean than the t-statistics, we will have p-value of 0.05, which is also referred to a 5% significance level. A p-value no more than 0.05 (5%) is generally accepted in rejecting the null hypothesis. We say that the estimated coefficient is statistically significant at the 5% level.
In this regression, the p-value of the goals_for variable is 0.000 which suggests that the estimate is statistically significant at the 1% level.
R-squared measures the goodness of fit of the model. The R-squared of a regression is the fraction of the variation in the dependent variable that is accounted for by the independent variables. R-squared is always between 0 and 1. The larger the R-squared, the more variation is accounted for by the regression model.
In this regression, the R-squared is 0.591 which means that approximately 59.1% of the variation of the winning percentage is accounted for by the model.
import seaborn as sns
sns.lmplot(x='goals_against', y='win_pct', data=NHL_Team_R_Stats)
plt.xlabel('Total Goals against')
plt.ylabel('Winning Percentage')
plt.title("Relationship between Goals against and Winning Percentage", fontsize=20)
NHL_Team_R_Stats['goals_against'].corr(NHL_Team_R_Stats['win_pct'])
reg2 = sm.ols(formula = 'win_pct ~ goals_against', data= NHL_Team_R_Stats).fit()
print(reg2.summary())
#Your Code Here
Often times, the outcome variable of interest is affected by multiple factors. We can specify a regression equation where the outcome is function of more than one explanatory variables.
Let's run a linear regression where winning percentage is a function of both average number of goals for per game and average number of goals against per game.
reg4 = sm.ols(formula = 'win_pct ~ avg_gf+avg_ga', data= NHL_Team_R_Stats).fit()
print(reg4.summary())
Interpret the coefficients
In the above regressions, we focus on using quantitative variables as explanatory variables. We could also include categorical variables as explanatory variables in the regression as well.
Essentially, when we incorporate a categorical variable, we first transform it into dummy variable(s) that carry value of either 0 or 1. We then use the dummy variable(s) into our regression.
Let's consider the dataset that includes both regular season and playoff. In this dataset, the variable "type" captures whether a game is a regular season game or playoff game. type=2 means it is regular season competition while type=3 means it is a playoff game.
We will first convert variable "type" into categorical variable.
NHL_Team_Stats['type']=NHL_Team_Stats['type'].astype(object)
Now we can run a regression where winning percentage is a function of average goals for and the type of competition.
reg5 = sm.ols(formula = 'win_pct ~ avg_gf+type', data= NHL_Team_Stats).fit()
print(reg5.summary())
A dummy variable = 1 if type= 3 (playoff) and = 0 if type = 2 (regular season) is included in the regression.
Interpretation: with the same average goals for per game, the winning percentage in the playoff games is 0.0160 (1.6%) lower than the winning percentage in the regular season games.
Run a regression where winning percentage is a function of average goals for, average goals against, and control for the different competitions.
Interpret the coefficients.
#Your Code Here
What if the impact of an independent variable depends on the value of another variable? We can use interaction terms to allow for different impact of a variable based on one or more levels of another categorical variable.
Let's consider the possibility that the average goals for may have different impact on winning percentage depending on the type of the game. We can run a regression of winning percentage on the average goals for, the type of the game, as well as the interaction between average goals for and type.
reg7 = sm.ols(formula = 'win_pct ~ avg_gf+type+avg_gf*type', data= NHL_Team_Stats).fit()
print(reg7.summary())
Interpretations
#Your Code Here
#Your Code Here
#Your Code Here
#Your Code Here
#Your Code Here
#Your Code Here