Sampling¶
Used when it is not feasible to analyze the entire population
Estimation: Using the sample to estimate population parameter(s)
Population v Sample¶
Property | Population | Sample |
---|---|---|
Definition | comprises of all units pertaining to a particular characteristic under study | is a part of a population, which is selected such that it is representative of the entire population |
Size | \(N\) | \(n\) |
Mean | \(\mu\) | \(\bar x = \dfrac {\sum_i^n x_i}{n}\) |
Variance | \(\sigma^2\) | \(s^2 = \dfrac {\sum_i^n (x_i-\bar x)^2}{n \textcolor{hotpink}{-1}}\) |
Standard Deviation | \(\sigma\) | \(s\) |
Relations¶
Bessel’s Correction¶
$$ \begin{aligned} \text{Var}(x) &= E[(x)^2] - (E[x])^2 \ \implies E[(x)^2] &= \sigma^2 + \mu^2 \ \ \text{Var}(\bar x) &= E[(\bar x)^2] - (E[\bar x])^2 \ \implies E[(\bar x)^2] &= \dfrac{\sigma^2}{n} + \mu^2 \ \ \implies \sigma^2 &= s^2_\text{uncorrected} + \text{Bias} \ &= s^2_\text{uncorrected} + \dfrac{\sigma^2}{n} \
\implies \sigma^2 &= s^2_\text{uncorrected} \times \dfrac{n}{\text{DOF}} \ &= s^2_\text{uncorrected} \times \underbrace{\dfrac{n}{n-1} }_{\mathclap{\text{Bessel's Correction}}} \end{aligned} $$
Reasoning
- Degrees of freedom: We lose a degree of freedom when estimating \(\bar x\)
- Bias correction: While sampling with small sample size, less probable elements don’t show up which gives us an underestimated sample dispersion
Sample vs Population Standard Deviation¶
For Different Distributions¶
Higher the skew of population distribution, larger the sample size required to approximate the sample size to the population
For the different population size¶
Sample vs Population SD does not depend on population size
Interval Estimation¶
Confidence % \(= 1- \alpha\)
Most common is \(95\%\) confidence interval estimate
Population mean¶
\(\sigma^2\) | \(n\) | statistic | \(\mu\) |
---|---|---|---|
known | any | \(z = \dfrac {\bar x - \mu}{\sigma / \sqrt n}\) | \(\bar x \pm z_{\alpha/\small 2} \cdot \dfrac \sigma {\sqrt n}\) |
unknown | \(>30\) | \(z = \dfrac {\bar x - \mu}{s/ \sqrt n}\) | \(\bar x \pm z_{\alpha/\small 2} \cdot \dfrac s {\sqrt n}\) |
unknown | \(\le 30\) | \(t = \dfrac {\bar x - \mu}{s / \sqrt n}\) | \(\bar x \pm t_{\small n-1, \alpha/\small 2} \cdot \dfrac s {\sqrt n} \\(n-1) \to \text{deg of freedom}\) |
where
- \(n\) is sample size
- \(w\) is distance from \(\mu\) = \(\frac{\text{interval width}}{2}\)
Proportion¶
Population Variance / SD¶
Inequalities¶
Let \(x\) be a random variable such that \(x_i \in [a, b]\)
Consider
- sample size \(n\)
- \(\epsilon > 0\)
Hoeffding’s Inequality¶
where
- \(\mu\) is any parameter and \(\hat \mu\) is its estimate
- \(n>0\)
- \(\epsilon > 0\)
- \(B=\) no of ‘bins’
Notes
- We want low \(P (\vert \hat \mu − \mu \vert > \epsilon)\)
- Even though \(P (\vert \hat \mu − \mu \vert > \epsilon)\) will depend on \(\mu\), the bound is independent of \(\mu\)
Vapnik-Chervonenkis Inequality¶
Where \(m_h(n) = 2^n\)