Skip to content

Linear Regression

Find the ‘best-fit’ line to better understand the relationship between 2 variables $$ y = \textcolor{hotpink}{ \underbrace{a+bx}\text{Systematic Component} } + \textcolor{orange}{ \underbrace{\epsilon}\text{Error Term} } $$ | Term | Meaning | | | ---------- | -------------------------------------- | ------------------------------------------------------------ | | \(y\) | Output | | | \(a\) | Vertical Intercept | What is the value of \(y\), if \(x=0\) | | \(b\) | Slope | What is the change on \(y\), for every unit increase of \(x\) | | \(X\) | Input | | | \(\epsilon\) | Error | We need \(E[\epsilon] = 0\) |

OLS

Ordinary Least Squares

We try to find parameters that minimize the square of errors. $$ \min \sum_{i=1}^n \epsilon_i^2 \ \epsilon = y-(a+bx) $$

\(R^2\)

% of variation in \(Y\) explained by \(X\)

It shows the relative contribution of systematic component to values of \(y\) $$ 0 \le R^2 \le 1 \ \text{Good Fit} \propto R^2 $$

Higher the \(R^2\), better the fit.

Statistical Significance

How confident that \(x\) has an impact on \(y\). (Doesn’t necessarily mean causal impact)

Using a null hypothesis test

  1. we take
  2. \(H_0: b = 0\)
  3. \(H_1: b \ne 0\)
  4. \(t \text{-statistics} = \frac{\hat b}{\epsilon}\)
  5. Using normal distribution table, find \(P(Z > t \text{-statistics})\)
  6. Usually
  7. Confidence interval = 95%
  8. Level of significance \(\alpha = 0.05\)
  9. If \(p < \alpha\), \(b\) is statistically significant at \(\alpha\)

statsmodes

regression = sm.ols(
    formula = "Salary ~ batsman + bowler + batsman*bowler",
  data = IPLPlayer,
  missing = "drop"
).fit()

regression.summary()

batsman*bowler m.eans all-rounder (batsman and bowler)

Notebooks