Skip to content

Ensemble Learning

Reduce variance of models

Why do they work

  • Statistical: Average predictions
  • Computational: Average local optima
  • Representational: Extend hypothesis space

image-20240303223125520

Advantages

  • Improved accuracy
  • Reduced variance
  • Noisy useless signals will average out and have no effect

Disadvantages

  • Not interpretable
  • Do not work with unstructured data (images, audio)
  • Computationally-expensive

Steps

  1. Divide dataset into subsets
  2. For each subset, apply a model
    • This model is usually decision tree
  3. Aggegrate the results

Stability of Classifier

For unstable models, we have to change model when adding new point

For stable models, not required

Learning Techniques

Single Bagging
(Boostrap aggregation)
Boosting Blending/
Stacking
Training sequence N/A Parallel Sequential Parallel/Sequential
Forward stage-wise also to fit an adaptive additive model (adaptive basis functions) \(\hat f = \sum_{m=1}^{M} \alpha_i \hat f_i\)
Individual Learners Overfitting Underfitting
Slightly better than average
No of learners 1 \(n\) \(n\) \(n\)
Training Complete training Random sampling with replacement Random sampling with replacement over weighted data
Aggregage the results at the end
Only pass over the mis-classified points
We boost the probability of mis-classified points to be picked again
Preferred for Linear Data Non-Linear Data
Example Random forest XGBoost
Comment - Only effective for low-bias, high-variance models
- Only effective if misclassification rate of individual classifiers <0.5
Advantages Fast (parallel training)
Disadvantages Overfitting
Slow to train

Bagging

“Wisdom of the crowds”

Bagged classifier’s misclassification rate behaves like a binomial distribution

Bagging a good classifier can improve predictive accuracy, but bagging a bad one hurts $$ \text{MSE}' = \dfrac{1}{k} \text{MSE} + \dfrac{k-1}{k} C $$ where \(C=\) covariance between each bagging classifier

Classification

\(\hat y_i\)
Majority/Hard Vote
Soft Voting

Training Speed

It cannot be said that boosting is slower than bagging, just because it is sequential and bagging is parallel.

This is because, boosting may end in just 10 iterations, but you may need 1000 classifiers for bagging.

Random Forest

Bagging with reduced correlation b/w sampled trees, through random selection of input variables \(m<<k\) for each split

Usually \(m = \sqrt{p}\)

Boosting

\(\lambda\) is the learning rate

Regression

  1. Set \(\hat f(x) = 0 \implies u_i = y_i \quad \forall i\)

  2. For \(b=1, \dots, B:\)

  3. Fit a regression model \(\hat f_b(x)\) to the training data to obtain \(\hat y_b\)

  4. Update \(\hat y\) with a shrunken version of \(\hat f_b\): \(\hat y = \hat y + \lambda \hat y_b\),

  5. Update the residuals: \(u_i = u_i - \lambda \hat y_b\)

  6. Output: \(\hat y = \sum_{b=1}^B \lambda \hat y_b\)

In each iteration, we fit the model to residuals: this enables re-weighting training data so that obs that did not fit well (\(r_i\) large) become more imp in next iteration.

Classification

**Ada**ptive **Boost**ing

The boosted classifier is a weighted sum of individual classifiers, with weights proportional to each classifier’s accuracy on the training set (good classifiers get more weight)

In AdaBoost, if an individual classifier has accuracy < 50%, we flip the sign of its predictions and turn it into a classifier with accuracy > 50%. This is achieved by making \(\alpha_b\) < 0 so that the classifier enters negatively into the final hypothesis.

In each iteration, we re-weight the obs in the training data such that misclassified points in the previous round see their weights increase compared to correctly classified points. Hence, successive classifiers focus more on misclassified points.

Steps

  1. Let \(y \in \{ -1, 1 \}\)

  2. Let \(w_i = 1/n \quad \forall i\)

  3. For \(b= 1, \dots, B\)

  4. Fit a classifier \(\hat f_b\) to the training data by minimizing the weighted error

    \(\dfrac{\sum_{i=1}^n w_i (\hat y_b \ne y_i)}{\sum_{i=1}^n w_i}\)

  5. Let \(\alpha_b = \log \vert (1-\epsilon_b)/\epsilon_b \vert\) where \(\epsilon_b\) is the weighted error of \(\hat f_b (x)\)

  6. \(L_i = \exp \Big( \alpha_b (\hat y_b \ne y_i) \Big)\)

  7. Update \(w_i\)

    \(w_i = w_i \cdot L_i\)

  8. Output: \(\hat y = \text{sign} (\sum_{b=1}^B \alpha_b \cdot \hat y_b)\)

Optimization

Instead of doing a global minimization, the boosting strategy follows a forward stage-wise procedure by adding basis functions one by one

Stage-wise vs step-wise
Stage-wise Step-wise
Coefficients updated at each step One All
Optimality Worse Better
Computation Cost Low High

Additive Boosting

\[ \hat f = \sum_{l=1}^L \alpha_l \hat f_l(x; \theta_l) \]

Where \(\hat f_l\) is linear/non-linear

FSAM

Forward Stage-wise Additive Modeling

At each iteration for learner \(l\)

\[ \arg \min \limits_{\alpha_l, \theta_l} \sum_{i=1}^n \mathcal L \Big( y_i, \underbrace{\hat f_{[\small {1, l-1}]}(x_i)}_{\mathclap{\text{Constant}}} + \alpha_l \hat f_{l}(x_i, \theta_l) \Big) \]

Limitations

  • Greedy search: local search; We may miss something better
  • May overfit
  • Optimization is computationally expensive

Gradient Boosting

Performs functional optimization of the cost function: Functional gradient descent with approximate gradients $$ \begin{aligned} \hat f_e(x) & \leftarrow \hat f_{e-1}(x) - \eta \nabla_f J(f) \ \hat f_0(x) &= 0 \end{aligned} $$ Works with any differentiable loss function

Last Updated: 2024-05-14 ; Contributors: AhmedThahir

Comments