Estimation¶
\(\theta\) is thought of as | Maximize | |
---|---|---|
Frequentist | Unknown constant | \(p(D \vert \hat \theta)\) Likelihood |
Bayesian | Unknown random variable with PDF | \(p(\hat \theta \vert D)\) |
MLE¶
Maximum Likelihood Estimation
- Predict a probability distribution \(\hat p(y \vert x)\)
- Get the likelihood of \(\hat p\) wrt the data
- Update \(\hat p\)Â that maximizes the likelihood
- Go to step 1
IDK¶
To minimize the KL divergence between \(\hat p\) and \(p\), and we maximize the likelihood of \(\hat p\) $$ \begin{aligned} \arg \min_{\hat p} D_\text{KL}(p \vert \vert \hat p) &= \arg \min_{\hat p} E_{D \sim p} \ln \left \vert \dfrac{\hat p(x, y)}{p(x, y)} \right \vert \ &= \arg \min_{\hat p} \underbrace{E_{D \sim p} \ln p(x, y)}{\mathclap {\text{Constant}}} - \underbrace{E} \ln \hat p(x, y){\approx \ln L(\hat p), n \to \infty} \ & \approx \arg \min - L(\hat p) = \arg \max_{\hat p} L(\hat p) \end{aligned} $$ Since we do not know \(p\), we can estimate \(E_{D \sim p} \ln \hat p(x, y)\), using Monte-Carlo estimation and Law of Large numbers $$ E_{D \sim p} \ln \hat p(x, y) {\approx \ln L(\hat p), n \to \infty} $$
Likelihood¶
Probability of observing data \(x\) according to pdf \(p(x)\)
Estimation¶
Chooses a distribution \(p(x)\) that maximizes the (log) likelihood function for \(x\)
Below example shows MLE for a single point
MLE for Regression¶
If we assume that the data is normally distributed, and we want unbiased prediction, then \(u_i \sim N(0, \sigma^2_i)\) $$ \begin{aligned} \mathcal{L} &= P(u_1, u_2, \dots, u_n \vert \hat \theta) \ &= \prod_i^n P(u_i) \ &= \prod_i^n \dfrac{1}{\sqrt{2 \pi \sigma^2_i}} \times \exp \left[\dfrac{-1}{2} \left(\dfrac{u_i-\mu_i}{\sigma_i} \right)^2 \right]\ &= \prod_i^n \dfrac{1}{\sqrt{2 \pi \sigma^2_i}} \times \exp \left[\dfrac{-1}{2} \left(\dfrac{u_i }{\sigma_i} \right)^2 \right] & (\mu_u=0) \end{aligned} $$
$$ \begin{aligned} \ln \vert \mathcal{L} \vert &= \dfrac{-1}{2} \Big[ n \ln \vert 2 \pi \sigma^2_i \vert + \chi^2 \Big] \ \implies -2 \ln \vert \mathcal{L} \vert &= n \ln \vert 2 \pi \sigma^2_i \vert + \chi^2 \
&= n \ln \vert 2 \pi \vert + n \ln \vert \text{MSE} \vert + n + \sum_i^n \ln \vert w_i \vert \ &\approx n \ln \vert \text{MSE} \vert \end{aligned} $$
Optimization¶
By setting derivative to 0, we can also use this to derive expression for Matrix Normal Expression
Note¶
$$ \begin{aligned} E[\chi^2] &= n-p \
\implies E[-2 \ln \vert \mathcal{L} \vert] &= n \ln \vert 2 \pi \sigma^2_i \vert + (n-p) \end{aligned} $$
MLE for Classification¶
Assume that \(y \sim \text{Bernoulli}(p)\) $$ \mathcal L = $$
M-Estimation¶
Minimize some other loss function weighted compare to MLE, such as MAE, Huber, etc
Bayesian¶
When is it justified?
- Prior is valid: Better than MLE
- Prior is irrelevant: Just a computational catalyst
- \(D = y \vert x\)
- \(P(\hat f = f)\) is the prior belief of our understanding of
- Distribution of model parameters
- Distribution of residuals
Usually we need to solve Bayes’ equation numerically using MCMC: Markov Chain Monte Carlo sampling. Result is set of points from posterior distribution that we summarize
Disadvantage: We need to calculate a lot of probabilities -> Computationally-expensive
Hypothesis | ||
---|---|---|
Maximum Likelihood | Model that best explains the training data | \(h_\text{ML} = \underset{h_i \in H}{\arg \max} \ P(D \vert h_i)\) |
Maximum A Posteriori Probability | Model that is most probable given the training data | \(h_\text{MAP} = something\) |
Steps¶
- Pick prior distribution
- Calculate likelihood function, similar to MLE
- Calculate posterior distribution, usually numerically
- Summarize posterior distribution
- MAP estimate
- Credible interval
Prior Distribution¶
\(P(\theta) = P(\theta \vert I)\), where \(I=\) all info we have before data collection
Prior | |
---|---|
Uninformative/ Objective/ Baseline | If we have no prior knowledge, then \(P(\theta \vert I)=\) constant Hence, \(P(\hat \theta \vert D) = P(D \vert \hat \theta)\), so might as well perform MLE instead eg: Uniform dist over expected range of possible values |
Informative/ Substantive | Based on previous data, experiments, knowledge One can assume that the prior for each parameter is independent of others \(P(\theta)=P(\beta) P(\sigma^2)\), but usually a joint dist is required Setting prior to delta function fixes parameter independent of data (never done in practice, as it ignores the point of data) |
Reparametrization¶
Helps make prior assignment easy
For eg: $$ \begin{aligned} \hat y_i &= \beta_0 + \beta_1 x_i \ \implies \hat y_i &= \beta_0' + \beta_1 (x_i - \bar x) \ \ \text{Hence } \beta_0 &= \hat y_i \vert (x_i=0) \ \implies \beta_0' &= \hat y_i \vert (x_i = \bar x) \end{aligned} $$
- This makes it easier to specify the prior for the intercept
- We can assume that \(\beta_0'\) is independent of \(\beta_1\)’s prior
Conjugate Priors¶
Special cases of priors only for which analytical solutions of the posterior distribution are possible with given likelihood distribution
For eg:
- iid normal errors
- conjugate prior for \(\beta\) is normal
- conjugate prior for \(\sigma^2_u\) is inverse gamma
Posterior Distribution¶
Output of Bayesian estimation is not best fit parameters, it is the posterior distribution: probability distribution for each parameter and \(\sigma_e\)
We need to summarize the distribution
- Best estimate: Summary statistic such as
- mode (Maximum a posteriori)
- mean
- median
- etc
- Credible interval: Quantiles
Posterior distribution describes how much the data has changed our prior beliefs
Bernstein-von Mises Theorem¶
For very large \(n\), posterior distribution becomes independent of the prior distribution, as long as the prior \(\not \in \{0, 1 \}\)
Posterior tends towards normal distribution equal to MLE (assuming iid)
Problems¶
- Computationally-expensive
- Choosing appropriate prior
- Is it reasonable to treat every parameter as a random variable
MLE + Bayesian¶
Relationship¶
MLE = Bayesian with Jeffreys Prior
Jeffreys Prior¶
It is improper as it adds upto \(\infty\), not \(1\)
The resulting posterior distribution is \(t\) distributed about MLE parameter estimates
IDK¶
Taking \(-\ln\) of Bayes’ equation $$ \begin{aligned} - \ln p(\hat \theta \vert D) &= - \ln L - \ln p(\hat \theta) + \underbrace{\ln p(D)}\text{constant} \ \min { - \ln p(\hat \theta \vert D) } &= \min { - \ln L - \ln p(\hat \theta) + \cancel{\ln p(D)} } \ &= \min { \chi^2 + \sum \left( \dfrac{\hat \beta - \mu_{\beta}}{\sigma_\beta} \right)^2 } \ &= \text{Regularized Regression} \end{aligned} $$}^{k
Likelihood¶
def ll(X, y, pred):
# return log likelihood
mse = np.mean(
(y - pred)
**2
)
n = float(X.shape[0])
n_2 = n/2
return -n_2*np.log(2*np.pi) - n_2*np.log(mse) - n_2
def aic(X, y, pred):
p = X.shape[1]
return -2*ll(X, y, pred) + 2*p
print(aic(X, y, pred))