Linear Regression¶
Find the ‘best-fit’ line to better understand the relationship between 2 variables $$ y = \textcolor{hotpink}{ \underbrace{a+bx}\text{Systematic Component} } + \textcolor{orange}{ \underbrace{\epsilon}\text{Error Term} } $$ | Term | Meaning | | | ---------- | -------------------------------------- | ------------------------------------------------------------ | | \(y\) | Output | | | \(a\) | Vertical Intercept | What is the value of \(y\), if \(x=0\) | | \(b\) | Slope | What is the change on \(y\), for every unit increase of \(x\) | | \(X\) | Input | | | \(\epsilon\) | Error | We need \(E[\epsilon] = 0\) |
OLS¶
Ordinary Least Squares
We try to find parameters that minimize the square of errors. $$ \min \sum_{i=1}^n \epsilon_i^2 \ \epsilon = y-(a+bx) $$
\(R^2\)¶
% of variation in \(Y\) explained by \(X\)
It shows the relative contribution of systematic component to values of \(y\) $$ 0 \le R^2 \le 1 \ \text{Good Fit} \propto R^2 $$
Higher the \(R^2\), better the fit.
Statistical Significance¶
How confident that \(x\) has an impact on \(y\). (Doesn’t necessarily mean causal impact)
Using a null hypothesis test
- we take
- \(H_0: b = 0\)
- \(H_1: b \ne 0\)
- \(t \text{-statistics} = \frac{\hat b}{\epsilon}\)
- Using normal distribution table, find \(P(Z > t \text{-statistics})\)
- Usually
- Confidence interval = 95%
- Level of significance \(\alpha = 0.05\)
- If \(p < \alpha\), \(b\) is statistically significant at \(\alpha\)
statsmodes
¶
regression = sm.ols(
formula = "Salary ~ batsman + bowler + batsman*bowler",
data = IPLPlayer,
missing = "drop"
).fit()
regression.summary()
batsman*bowler
m.eans all-rounder (batsman and bowler)