Ensemble Learning¶

Reduce variance of models

Why do they work

Statistical: Average predictions
Computational: Average local optima
Representational: Extend hypothesis space

Advantages¶

Improved accuracy
Reduced variance
Noisy useless signals will average out and have no effect

Disadvantages¶

Not interpretable
Do not work with unstructured data (images, audio)
Computationally-expensive

Steps¶

Divide dataset into subsets
For each subset, apply a model
- This model is usually decision tree
Aggegrate the results

Stability of Classifier¶

For unstable models, we have to change model when adding new point

For stable models, not required

Learning Techniques¶

	Single	Bagging (Boostrap aggregation)	Boosting	Blending/ Stacking
Training sequence	N/A	Parallel	Sequential	Parallel/Sequential
			Forward stage-wise also to fit an adaptive additive model (adaptive basis functions)	$\hat f = \sum_{m=1}^{M} \alpha_i \hat f_i$
Individual Learners		Overfitting	Underfitting Slightly better than average
No of learners	1	$n$	$n$	$n$
Training	Complete training	Random sampling with replacement	Random sampling with replacement over weighted data
		Aggregage the results at the end
			Only pass over the mis-classified points We boost the probability of mis-classified points to be picked again
Preferred for		Linear Data	Non-Linear Data
Example		Random forest	XGBoost
Comment		- Only effective for low-bias, high-variance models - Only effective if misclassification rate of individual classifiers <0.5
Advantages		Fast (parallel training)
Disadvantages			Overfitting Slow to train

Bagging¶

“Wisdom of the crowds”

Bagged classifier’s misclassification rate behaves like a binomial distribution

Bagging a good classifier can improve predictive accuracy, but bagging a bad one hurts $$ \text{MSE}' = \dfrac{1}{k} \text{MSE} + \dfrac{k-1}{k} C $$ where $C=$ covariance between each bagging classifier

Classification¶

	$\hat y_i$
Majority/Hard Vote
Soft Voting

Training Speed¶

It cannot be said that boosting is slower than bagging, just because it is sequential and bagging is parallel.

This is because, boosting may end in just 10 iterations, but you may need 1000 classifiers for bagging.

Random Forest¶

Bagging with reduced correlation b/w sampled trees, through random selection of input variables $m<<k$ for each split

Usually $m = \sqrt{p}$

Boosting¶

$\lambda$ is the learning rate

Regression¶

Set $\hat f(x) = 0 \implies u_i = y_i \quad \forall i$
For $b=1, \dots, B:$
Fit a regression model $\hat f_b(x)$ to the training data to obtain $\hat y_b$
Update $\hat y$ with a shrunken version of $\hat f_b$: $\hat y = \hat y + \lambda \hat y_b$,
Update the residuals: $u_i = u_i - \lambda \hat y_b$
Output: $\hat y = \sum_{b=1}^B \lambda \hat y_b$

In each iteration, we fit the model to residuals: this enables re-weighting training data so that obs that did not fit well ($r_i$ large) become more imp in next iteration.

Classification¶

**Ada**ptive **Boost**ing

The boosted classifier is a weighted sum of individual classifiers, with weights proportional to each classifier’s accuracy on the training set (good classifiers get more weight)

In AdaBoost, if an individual classifier has accuracy < 50%, we flip the sign of its predictions and turn it into a classifier with accuracy > 50%. This is achieved by making $\alpha_b$ < 0 so that the classifier enters negatively into the final hypothesis.

In each iteration, we re-weight the obs in the training data such that misclassified points in the previous round see their weights increase compared to correctly classified points. Hence, successive classifiers focus more on misclassified points.

Steps¶

Let $y \in \{ -1, 1 \}$
Let $w_i = 1/n \quad \forall i$
For $b= 1, \dots, B$
Fit a classifier $\hat f_b$ to the training data by minimizing the weighted error

$\dfrac{\sum_{i=1}^n w_i (\hat y_b \ne y_i)}{\sum_{i=1}^n w_i}$
Let $\alpha_b = \log \vert (1-\epsilon_b)/\epsilon_b \vert$ where $\epsilon_b$ is the weighted error of $\hat f_b (x)$
$L_i = \exp \Big( \alpha_b (\hat y_b \ne y_i) \Big)$
Update $w_i$

$w_i = w_i \cdot L_i$
Output: $\hat y = \text{sign} (\sum_{b=1}^B \alpha_b \cdot \hat y_b)$

Optimization¶

Instead of doing a global minimization, the boosting strategy follows a forward stage-wise procedure by adding basis functions one by one

Stage-wise vs step-wise¶

	Stage-wise	Step-wise
Coefficients updated at each step	One	All
Optimality	Worse	Better
Computation Cost	Low	High

Additive Boosting¶

\[ \hat f = \sum_{l=1}^L \alpha_l \hat f_l(x; \theta_l) \]

Where $\hat f_l$ is linear/non-linear

FSAM¶

Forward Stage-wise Additive Modeling

At each iteration for learner $l$

\[ \arg \min \limits_{\alpha_l, \theta_l} \sum_{i=1}^n \mathcal L \Big( y_i, \underbrace{\hat f_{[\small {1, l-1}]}(x_i)}_{\mathclap{\text{Constant}}} + \alpha_l \hat f_{l}(x_i, \theta_l) \Big) \]

Limitations¶

Greedy search: local search; We may miss something better
May overfit
Optimization is computationally expensive

ExtraTrees¶

Extremely Randomized Trees

Similar to Random Forest, but results in trees that are less correlated with each other, by selecting splits randomly from the range of feature values

	Random Forest	Extra Trees
Computational Speed	Slower due to optimal splits	Faster due to random splits
Data Sampling	Bootstrapped samples	Entire dataset without replacement
Node Splitting	Optimal split	Random split
Bias	Higher bias potential	Lower bias potential
Variance	Medium	Low

Gradient Boosting¶

Performs functional optimization of the cost function: Functional gradient descent with approximate gradients $$ \begin{aligned} \hat f_e(x) & \leftarrow \hat f_{e-1}(x) - \eta \nabla_f J(f) \ \hat f_0(x) &= 0 \end{aligned} $$ Works with any differentiable loss function

Neural Networks¶

Average predictions of multiple model checkpoints across epochs of single model
Model averaging across epochs

Last Updated: 2024-12-26 ; Contributors: AhmedThahir, web-flow

Ensemble Learning¶

Advantages¶

Disadvantages¶

Steps¶

Stability of Classifier¶

Learning Techniques¶

Bagging¶

Classification¶

Training Speed¶

Random Forest¶

Boosting¶

Regression¶

Classification¶

Steps¶

Optimization¶

Stage-wise vs step-wise¶

Additive Boosting¶

FSAM¶

Limitations¶

ExtraTrees¶

Gradient Boosting¶

Neural Networks¶

Comments