Statistics¶
Statistical concepts such as Parameter estimation, Bias, Variance help in the aspects of generalization, over-fitting and under-fitting
IID: Independent & identically distributed
Estimation Types¶
Estimation Type | Regression Output | Classification Output |
---|---|---|
Point | \(E(y \vert X)\) | \(E(c_i \vert X)\) |
Probabilistic | \(E(y \vert X)\) \(\sigma^2(y \vert X)\) | \(P(c_i \vert X)\) This is not the model confidence! This is the likelihood |
Likelihood vs Confidence¶
- Likelihood is the probability of classification being of a certain class
- Unreliable if input is unlike anything from training
- Confidence is the model’s confidence that the likelihood is correct
IDK¶
Expected deviation | |
---|---|
Bias | from the true value |
Variance | caused by any particular sample |
Function Estimation¶
Estimation of relationship b/w input & target variables, ie predict a variable \(y\) given input \(x\)
where \(\epsilon\) is Bayes Error
Statistical Learning Theory¶
Helps understand performance when we observe only the training set, through assumptions about training and test sets
-
i.i.d
-
Training & test data arise from same process
- Observations in each data set are independent
- Training set and testing set are identically distributed
- \(X\) values are fixed in repeated sampling
-
No specification bias
-
We need to use the correct functional form, which is theoretically consistent
-
No Unbiasedness
-
Independent vars should not be correlated with each other
- If |correlation| > 0.5 between 2 independent vars, then we drop one of the variables
-
High DOF
-
Degree of freedom \(= n - k\), where
- \(n =\) number of observations
- \(k =\) no of independent variables
- DOF \(\to\) 0 leads to overfitting
-
High coefficient of variation in \(X\)
-
We need more variation in values of \(X\)
- Indian stock market is very volatile. But not in UAE; so it's hard to use it an independent var. Similarly, we cant use exchange rate in UAE as a regressor, as it is fixed to US dollars
- No variable interaction
-
Variable interaction: effect of \(x_i\) on \(y\) depends on \(x_j\)
-
Solution: add interaction terms
-
No collinearity
- No multicollinearity
-
Homoskedasticity
-
Constant variance
-
\(\sigma^2 (y_i|x_i) = \text{constant}\) should be same \(\forall i\)
-
There is no measurement error \(\delta_i\) in \(X\) or \(Y\)
-
\(X_\text{measured} = X_\text{true}\)
- \(y_\text{measured} = y_\text{true}\)
- \(E(\delta_i)=0\)
- \(\text{var}(\delta_i | x_i) = \sigma^2 (\delta_i|x_i) = \text{constant}\) should be same \(\forall i\)
- \(\text{Cov}(\delta_i, x_i) = 0, \text{Cov}(\delta_i, u_i) = 0\)
If there is measurement error, we need to perform correction
-
If there exists autocorrelation in time series, then we have to incorporate the lagged value of the dependent var as an explanatory var of itself
-
For the Variance of distribution of potential outcomes, the range of distribution stays same over time
-
\(\sigma^2 (x) = \sigma^2(x-\bar x)\): else, the variable is volatile; hard to predict; we cannot use OLS and hence have to use weighted regression
-
if variance decreases, value of \(y\) is more reliable as training data
-
if variance increases, value of \(y\) is less reliable as training data
- We use volatility modelling (calculating variance) to predict the pattern in variance
But rarely used in practice with deep learning, as
- bounds are loose
- difficult to determine capacity of deep learning algorithms
Input Error¶
For higher order model, errors in \(x\) will look like heteroskedasticity
Attenuation Bias¶
High measurement error \(\delta\) and random noise \(u\) causes our estimated coefficients to be lower than the true coefficient
Hence, for straight line model, error in \(x\) will bias the OLS estimate of slope towards zero $$ \begin{aligned} \lim_{n \to \infty} \hat \beta &= \beta \times \text{SNR} \ \text{Signal-Noise Ratio: SNR} &= \dfrac{\sigma2_x}{\sigma2_x \textcolor{hotpink}{+ \sigma^2_u + \sigma^2_\delta}} \end{aligned} $$
Errors-in-Measurement Correction¶
This can be applied to
- any learning algorithm
- for regressors or response variables(s)
Let’s say true values of a regressor variable \(X_1\) was measured as \(X_1^*\) with measurement error \(\delta_1\), where \(\delta_1 \ne N(0, 1)\). Here, we cannot ignore the error.
Step 1: Measurement Error¶
Use an appropriate distribution to model the measurement error. Not necessary that the error is random.
For eg, if we assume that the error is a skewed normal-distributed with variance \(\sigma^2_{X_1}\) signifying the uncertainty.
Step 2: Measurement¶
Model the relationship between the error and the measured value.
For eg, If we assume that the error is additive
Step 3: Model¶
Since \(X_1^*\) is what we have, but we want the mapping with \(X_1\),
Example¶
Example: Modelling with linear regression using a regressor with measurement error $$ \begin{aligned} \implies \hat y &= \theta_0 + \theta_1 X_1 \ &= \theta_0 + \theta_1 (X_1^* - \delta_1) \ &= \theta_0 + \theta_1 X_1^* \textcolor{hotpink}{- \theta_1 \delta_1} \end{aligned} $$
IIA¶
Independence of Irrelevant Alternatives
The IIA property is the result of assuming that errors are independent of each other in a classification task
The probability of \(y = j\) relative to \(y = k\) depends is not affected by the existence and the properties of other classes
$$ \begin{aligned} p(y=j \vert x, z) &= \dfrac{ \exp(\beta_j x + \textcolor{hotpink}{\gamma_j z}) }{ \sum_k^K \exp(\beta_k x + \textcolor{hotpink}{\gamma_k z} ) } \ \implies p(y=j \vert x) &= \int \dfrac{ \exp(\beta_j x + \textcolor{hotpink}{\gamma_j z}) }{ \sum_k^K \exp(\beta_k x + \textcolor{hotpink}{\gamma_k z} ) } f(z) \cdot dz \end{aligned} $$ where \(\gamma\) is the effect of the other classes (substitutes/complementary goods)
For IIA, \(\gamma=0\)
IIA property should be a desirable property for well-specified models
- the error for one alternative provides no information about the error for another alternative. This should be the property of a well-specified model such that the unobserved portion of utility is essentially “white noise.
- However, when a model omits important unobserved variables that explain individual choice patterns, however, the errors can become correlated over alternatives
Heteroskedascity¶
Causes of Heteroskedascity¶
- Misspecified model
- If output is \(\bar y\), but the sample size is different for each calculated mean
- \(s_{\bar y} = \sigma_y/ \sqrt{n}\)
- Eg: Average income vs years of college
- Variance/standard error is relative to the \(y\)
- Eg: Precision of tool is relative to the observed value, such as weighing scale
- Variance has been experimentally determined for each \(y\) value
- Some distributions naturally have variance that is a function of the
- Mean: Poisson
- Mean & Variance: Gamma
Statistical Test¶
- Sort residuals \(u_i\) wrt corresponding \(\vert y_i \vert\)
- Divide residuals (esr for fits) into \(g\) subgroups
- Test to see if sub-groups share same variance
- \(H_0:\) all groups have same variance
Distribution of statistic | Null Hypothesis | Formula | ||
---|---|---|---|---|
Barlett | Assumes normal distribution (sensitive to deviations from normality) | \(\chi^2\) distributed with \((g-1)\) DOF | \(k\) sub-groups have equal variance | \(\dfrac{(n-g) \ln s^2_\text{pool} - \sum\limits_{j=1}^g (n_j - 1) \ln s^2_j }{ 1 + \Big[ 1/[3(g-1)] \Big] \left[ \Big( \sum\limits_{j=1}^g \dfrac{1}{(n_j - 1)} \Big) - \dfrac{1}{n-g} \right] }\) |
Brown-Forsythe/ Modified Levene | compares deviations from median; it is robust to deviations from normality, but has lower power \(n_j > 25 \quad \forall j \in k\) | \(t\) distribution with DOF = \(n-g\) | Constant variance | \(\dfrac{\vert \bar d_1 - \bar d_2 \vert}{s_\text{pool} \sqrt{\dfrac{1}{n_1} + \dfrac{1}{n_2}}}\) \(d_{ij}=\vert x_{ij} - \text{med}_j \vert\) |
White Test | Perform linear regression of \(u_i^2\) with \(x\) and test \(nR^2\) as \(X^2_{k-1}\) | |||
Breusch-Pagan | Variation of white test where \(x\) is replaced with any variable of interest | |||
Park | Perform linear regression of \(\ln \vert u_i^2 \vert\) vs \(\ln \vert x \vert\) and test significance of slope different from 0 |
where
- \(n=\) total number of data points
- \(k=\) number of subgroups
- \(n_j=\) sample size of \(j\)th sub-group
- \(s^2_j =\) variance of \(j\)th sub-group
- \(s^2_\text{pool} = \dfrac{1}{n-k} \sum\limits_{j=1}^k (n_j - 1) s^2_j\)
- \(\text{med}_j =\) median of \(j\)th sub-group
Correcting¶
Dependence of variance on \(y_i\) | Solution |
---|---|
Known | Weighted regression |
Data transformation | |
Unknown | GMM, generalized methods of moments estimation |
Collinearity¶
2 variables are correlated
- Can be inspected through correlation matrix of 2 variables
Implication¶
- Adding/removing predictor variables changes the estimated effect of the vars (for eg: regression coefficients)
- Standard errors of coefficients become larger
- Individual regression coefficients may not be significant, even if the overall model is significant
- Some regression coefficients may be significantly different than expected (even opposing sign)
- There can be multiple solutions for \(\beta\)
-
Both variables will be insignificant if both are included in the regression model
-
Dropping one will likely make the other significant
-
Hence we can’t remove two (or more) supposedly insignificant predictors simultaneously: significance depends on what other predictors are included
Causes¶
Cause | ||
---|---|---|
No data: Inappropriate sampling | We only sample regions where predictors are correlated | |
Inappropriate model | If range of predictors is small: \(r(x, x^2) \ne 0\) | |
True Population | Collinearity indeed exists in the true population (for eg, height and weight) |
Multicollinearity¶
Collinearity between 3 or more variables, even if no pair of variables are correlated
eg: \(r(x_1, x_2) = r(x_1, x_3) = 0\), but \(r(x_1, x_2+x_3) \ne 0\)
Detection¶
Correlation matrix | |
VIF Variance Inflation Factor | How much is the variance of the \(k\)th model coefficient inflated compared to case of no inflation \(\text{VIF}(\hat \beta_j) = \dfrac{1}{1 - R^2_{x_j \vert x_{j'} }} = (\tilde X^T \tilde X)_{jj}^{-1} \\ j' \in [0, k] - \{ j \}\) \(R^2_{x_j \vert x_{j'}}\) is \(R^2\) when \(x_j\) is regressed against all other predictor vars \(1/\text{VIF}_{x_j \vert x_{j'}}=\) “tolerance” \(\text{VIF}_{x_j \vert x_{j'}} \ge 4 \implies\) Investigate \(\text{VIF}_{x_j \vert x_{j'}} \ge 10 \implies\) Act \(E[\text{VIF}_{x_j \vert x_{j'}}] \quad \forall j > 1 \implies\) Problematic |
Eigensystem Analysis | Find eigenvalues of correlation matrix, ie \(\tilde X^T \tilde X\) If all eigenvalues are about the same magnitude, no multicollinearity Else calculate condition number \(\kappa = \lambda_\max/\lambda_\min\) If \(\kappa > 100 \implies\) problem |
Solution¶
- Derive theoretical constraints relating input vars: helps simplify model; can be linear/non-linear
- If we only care about prediction, restrict scope of model for interpolation only, ie new inputs should coincide with range of predictor vars that exhibit the same pattern of multicollinearity
- Drop problematic variables, ie ones with highest VIF
- Collect more data that breaks pattern of multicollinearity
- Measure coefficients in separate experiment (then fix those coefficients)
- Regularization: Even for perfect multicollinearity, the ridge regression solution will always exist
- PCA
- Separates the high SE of coefficients from multicollinearity into components with low SE and high SE; you’d only include the low SE components
- Helps identify unknown linear constraints
- Limitation: cannot help with non-linear relationship