Hypothesis Testing¶
Objective¶
Hypothesis testing involves testing the following question
Is the estimated value sufficiently close to stated value?
Hypothesis¶
- Simple Hypothesis
- Single Tailed
- \(\beta_2 > 0.3\)
- Composite Hypothesis
- 2 Tailed
- Bi-Directional
- Useful when not sure about the direction
- \(\beta_2 \ne 0.3\)
IDK¶
Null Hypothesis \((H_0)\)¶
Your initial statement
Alternative Hypothesis \((H_1)\)¶
[Usually] complement of your initial statement
Also called as Maintained Hypothesis. It acts as the fallback in case null hypothesis is proven to be false.
Then the value of \(y\) is taken to be the value obtained from the sample
Number of Hypotheses¶
If \(n =\) no of independent variables, the number of hypothesis is \(2n + 1\)
\(+1\) is due to intercept(constant)
Steps¶
-
Formula hypotheses
-
Determine if one/two tailed test
-
Construct a \(100(1-\alpha) \ \%\) confidence interval for \(\beta_2\)
-
Determine critical values
-
Determine rules to accept/reject null hypothesis
-
Compare estimate-value with critical region
-
Conclusion
- If it lies within critical region, accept null hypothesis
- If it lies outside critical region, reject null hypothesis
- accept alternate hypothesis
- \(\beta_2\) will take the sample value
Confidence Interval¶
- \(\alpha =\) level of significance
- \((1-\alpha) =\) Confidence coefficient
Construct the confidence interval for \(t\) distribution, with \((n-2)\) degrees of freedom. This is because we have 2 unknowns.
Level of Significance¶
Tolerance level for error
This is
- probability of committing type 1 error
- probability of rejecting null hypothesis, and then getting sample value as the actual value just by chance
Field | Conventional \(\alpha\) | Conventional \((1 - \alpha) \%\) |
---|---|---|
Pure Sciences | 0.01 | 99% |
Social Sciences | 0.05 | 95% |
Psychology | 0.10 | 90% |
Normal Distribution¶
- \(95 \%\) values lies within 1 standard deviation on each side from the center
- \(2.5 \%\) values lies outside 1 standard deviation on left side
- \(2.5 \%\) values lies outside 1 standard deviation on right side
Errors¶
Type 1 | Type 2 | |
---|---|---|
Error of | Rejecting correct null hypothesis | Accepting incorrect null hypothesis |
Meaning | False Negative | False Positive |
Measured by | \(\alpha\) (Level of Significance) | |
Happens when | Sample is not a good representation of population |
Statistical Equivalence¶
0.5 can be statistically = 0, or not; depends on the context
p-Value¶
Observed level of significance
\(\text{p value} \le \alpha \implies\) Reject null hypothesis
\(t=2\) Rule¶
For degree of freedom \(\ge 20\)
2 Tailed¶
\(H_0: \beta_2 = 0, H_0: \beta_2 \ne 0\)
If \(|t| > 2 \implies p \le 0.05 \implies\) reject \(H_0\)
1 Tailed¶
\(H_0: \beta_2 = 0, H_0: \beta_2 > 0\) or \(H_0: \beta_2 = 0, H_0: \beta_2 < 0\)
If \(|t| > 1.73 \implies p \le 0.05 \implies\) reject \(H_0\)
Why \(t\) distribution?¶
\(t\) distribution is a variant of \(z\) distribution
- For small samples, we use \(t\) dist
- For large sample, we use \(z\) dist
IDK¶
\(\alpha\)
- level of significance
- size of critical region
Confidence level = \((1-\alpha) \times 100 \%\)
The entire distribution is divided into 2 regions
- Critical Region Region of rejection of \(H_0\) it is decided based on \(H_1\)
- Acceptance Region Region of acceptance of \(H_0\)
- Rejection of \(H_0 \centernot \implies \beta\) significantly different from 0. Therefore, to assess magnitude of \(\beta\), confidence intervals are more useful than \(p\)-values
- Rejection of \(H_0\) does not mean that \(x\) has a significant causal effect on \(y\). Statistical significance \(\centernot \implies\) scientific, real-world significance. The most important variables are not those with the smallest p-values.
- The t−test can be thought of as checking whether adding \(x_j\) really improves predictions in a model that contains other specified predictors
- 95% CI = \(\text{LL}, \text{UL} \centernot \implies \text{Pr}(\beta \in [\text{LL, UL}]) = 0.95\)
- Correct interpretation: a 95% CI for \(\beta\) means that if we estimate our model on many independent random samples drawn from the same population and construct \(\text{CI}_m = [\text{LL}_m, \text{UL}_m]\) on each sample, then 95% of these \(\{ CI_m \}\) will contain \(\beta\)
P-Value¶
P-value is not the conditional probability of \(H_0\). It is actually the probability of \(H_0\) being true based only on the observed data set (without incorporating prior knowledge) $$ \begin{aligned} p \text{-value} &\ne P(H_0 = \text{True} \vert D) \ p \text{-value} &= P(D \vert H_0 = \text{True} )\ &= \text{Pr}(\vert t \vert \ge \vert t(\hat \beta) \vert H_0) \end{aligned} $$
where \(D\) is the data
When \(P(H_1) < 0.1\), we may need the p−value to be much smaller than the conventional threshold of \(\alpha = 0.05\) in order to “confidently” reject \(H_0\)
- For example, concluding that a coin is biased would require a significant number of one-sided results
Hypothesis tests are only valid for large sample size, as they are based on the asymptotic properties of test statistics. Hence, Bootstrapping can be used to obtain more accurate p−value estimates
Population Mean¶
\(\sigma^2\) | \(n\) | Test Statistic/Probability Distribution |
---|---|---|
known | any | \(z_c = \frac{\bar x - \mu_0}{\sigma/\sqrt n}\) |
unknown | \(>30\) | \(z_c = \frac{\bar x - \mu_0}{s/ \sqrt n}\) |
unknown | \(\le 30\) | \(t_c = \frac{\bar x - \mu_0}{s / \sqrt n}\) |
Critical Region¶
Left-Tailed | Two-Tailed | Right-Tailed | |
---|---|---|---|
\(H_1\) | \(\mu < \mu_0\) | \(\mu \ne \mu_0\) | \(\mu > \mu_0\) |
p-value | \(F(z_c)\) \(\alpha(t-\text{dist})\) | \(2[ F(-z_c) ]\) \(2 \alpha(t-\text{dist})\) | \(F(-z_c)\) \(\alpha(t-\text{dist})\) |
Cases | Accept \(H_1\) if \(\begin{aligned} z_c & \le -z_\alpha \\ t_c &\le -t_{(n-1), \alpha} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) | Accept \(H_1\) if \(\begin{aligned} z_c \le -z_{\alpha/2} &\text{ or } z_c \ge +z_{\alpha/2}\\ t_c \le -t_{(n-1), (\alpha/2)} &\text{ or } t_c \ge +t_{(n-1), (\alpha/2)} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) | Accept \(H_1\) if \(\begin{aligned} z_c &\ge +z_\alpha \\ t_c &\ge +t_{(n-1), \alpha} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) |
Proportion¶
Critical Region¶
Left-Tailed | Two-Tailed | Right-Tailed | |
---|---|---|---|
\(H_1\) | \(p < p_0\) | \(p \ne p_0\) | \(p > p_0\) |
p-value | \(F(z_c)\) | \(2[ F(-z_c) ]\) | \(F(-z_c)\) |
Cases | Accept \(H_1\) if \(\begin{aligned}z_c &\le -z_\alpha \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) | Accept \(H_1\) if \(\begin{aligned} z_c \le -z_{\alpha/2} &\text{ or } z_c \ge +z_{\alpha/2} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) | Accept \(H_1\) if \(\begin{aligned} z_c &\ge +z_\alpha \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) |
Variance/SD¶
Critical Region¶
Left-Tailed | Two-Tailed | Right-Tailed | |
---|---|---|---|
\(H_1\) | \(p < p_0\) | \(p \ne p_0\) | \(p > p_0\) |
p-value | 1 - \(\alpha\)(table) | 1 - \(\alpha\)(table) | 1 - \(\alpha\)(table) |
Cases | Accept \(H_1\) if \(\begin{aligned}\chi_c^2 &\le \chi^2_{(n-1), (1-\alpha)} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) | Accept \(H_1\) if \(\begin{aligned}\chi_c^2 \le \chi^2_{(n-1), (1-\alpha/2)} &\text{ or } \chi_c^2 \ge \chi^2_{(n-1), (\alpha/2)} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) | Accept \(H_1\) if \(\begin{aligned}\chi_c^2 &\ge \chi^2_{(n-1), \alpha} \\ p &\le \alpha \end{aligned}\) else accept \(H_0\) |
Errors¶
\(H_0\) is true | \(H_0\) is false | \(H_0\) is incorrect | |
---|---|---|---|
Reject \(H_0\) | Type 1 Error = \(\alpha\) | Correct | Type 3 Error Right answer to the wrong question |
Accept \(H_0\) | Correct | Type 2 Error = \(\beta\) |
Type 1 error is alright, but Type 2 error is dangerous
- \(\alpha\) = P(reject \(H_0\) | \(H_0\) is true)
- \(\beta\) = P(accept \(H_0\) | \(H_0\) is false)
Power of Test¶
Greater the power of test, the better means that we can more accurately detect when \(H_0\) is false
Information Content of Statistical (Non)Significance¶
Statistical result is informative only when it has the potential to substantially change our beliefs. The discrepancy between a prior and a posterior distribution thus provides a basic measure of the informativeness of a statistical result.
Using this measure, non-significant results are often more informative than significant results in scenarios common in empirical economics.
Hence, null need not always be \(H_0: \beta = 0\). It can be what is prior known. This can be implemented in ridge regression by using a prior known value
- Beliefs on the causal effect of a policy intervention are usually better described by a continuous distribution rather than a distribution with significant probability mass at point zero.
When \(P(H_0)\) is low, statistical significance often carries little information; non-significance is highly informative, because in this case, non-significance is more “surprising” and induces a larger change in the posterior belief $$ \underbrace{1 - \dfrac{p(\beta \vert R=0)}{p(\beta)}}\text{INS} \ = \dfrac{P(R=1)}{P(R=0)} \times \underbrace{1 - \dfrac{p(\beta \vert R=1)}{p(\beta)}}\text{IS} $$ where
- \(R=H_0 \text{ rejected}\) at given significance level
- \(P(R = 1)\) is the prior probability of rejection of the null
- \(P(R = 1) = \int P(R = 1 \vert \beta) \cdot p(\beta) \cdot d\theta\)
- \(\text{INS}\) = Informativeness of non-significance
- \(\text{IS}\) = Informativeness of significance
Takeaways¶
- Non-significance is more informative than significance as long as \(P(R = 1) > 0.5\)
- As \(n\) inc and \(p(\beta=0)\) dec, \(p(R=1)\) increases
- Thus, as datasets get larger, and because there are rarely reasons to put significant priors on \(\theta=0\), non-significant results will be more informative in empirical studies in economics
- When \(n\) is very large, without prior probability mass at the point null, significance carries no information
Statistical Significance Filter¶
Publication Bias
Only the extreme significant cases of the study make it through to the publication, and hence are not a representative sample of all empirical findings.
The power of test is low; The null hypothesis is false, but fails to be rejected \((1-\alpha \% )\) of the time
Lower power leads to high exaggeration ratios, ie if the estimate is statistically significant, it must be at least \(a\) times higher than the true effect size
Type \(S\) error probability: if the error is statistically-significant, but has the wrong sign
Multiple Testing¶
Multiple comparisons
If you perform multiple hypothesis tests, the probability of at least one producing a statistically-significant result at the significance level \(\alpha\) due to chance, is necessarily greater than \(\alpha\)
Multiple hypothesis fallacy: "Strategy" in rolling dice that maximizes the number
When testing \(m\) hypotheses
Meaning | Preferred for | Formula | |
---|---|---|---|
FWER (Family-wise Error Rate) Joint Type 1 Error | Probability of \(\ge 1\) False Positives among all tested hypotheses | Decision of single choice - only one car produced - only one investment strategy chosen | \(1 - (1-\alpha)^m\) |
FPR (False Positive Rate) FDR (False Discovery Rate) | Expected proportion of FP among accepted hypotheses | FDR is preferred for large quantity decision: many cars produced |
FWER¶
$$ \begin{aligned} \text{FWER} &= P(\ge 1 \text{ FP}) \ &= 1 - P(0 \text{ FP}) \ &= 1 - P(\text{not falsely rejecting any null hypothesis}) \ &= 1 - P(\cap_{j=1}^m \text{not falsely rejecting } H_{0j}) \end{aligned} $$ where \(m\) is the number of tests conducted (ie model specifications tested)
Assuming tests are independent $$ \begin{aligned} \text{FWER} & \approx 1 - \prod_{j=1}^m P (\text{not falsely rejecting } H_{0j}) \ & \approx 1 - \prod_{j=1}^m (1-\alpha) \ & \approx 1 - (1-\alpha)^m \end{aligned} $$
Corrections¶
Correction | \(p_{k, \text{adjusted}} = \text{Correction Factor} \times p_{k, }\) Correction Factor | Disadvantages | |
---|---|---|---|
FWER | Bonferroni | \(m\) | Very strict in avoiding False Positives May lead to False Negatives |
Holm | \(m - (k-1)\) | Very strict in avoiding False Positives May lead to False Negatives | |
FDR | Benjamini-Hochberg | \(m \times \dfrac{1}{k}\) | |
Benjamini-Yekutieli | \(m \times \dfrac{1}{k} \times \sum\limits_{i=1}^m \left(\dfrac{1}{j}\right)\) |
\(k=\) index in sorted list
Bonferroni Correction¶
Bounds the FWER at below \(\alpha\) by setting the significance threshold for each individual test as \(\alpha/m\)
$$ 1 - \left(1 - \dfrac{\alpha}{m} \right)^m \le \alpha $$ It is conservative, as it is assumes independent tests.
For large \(m\), it leads to a significant loss of power, ie higher probability of false negative
Holm Correction¶
- Compute \(p\)-values \(p_1, p_2, \dots, p_m\) for the \(m\) null hypotheses \(H_{01}, H_{02}, \dots, H_{0m}\)
- Order the \(m\) \(p\)-values in ascending order of magnitude to obtain \(p = \{ p_{(1)}, p_{(2)}, \dots, p_{(m)} \}\) such that \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\)
- Adjust \(p\) values where \(k=\) index in sorted list
- Reject \(H_{0k} \ \forall k\) that satisfy \(p_{k, \text{adjusted}} \le \alpha\)
from statsmodels.stats.multitest import multipletests
method = "holm" # "bonferroni", "fdr_bh", "fdr_by"
multipletests(p_values, method = method).pvals_corrected