Optimization¶
Find a set of parameters that minimizes the loss function for the given data and algorithm
IDK¶
Always test optimization procedure on known solution
Training Process¶
flowchart LR
i[Initialize θ] -->
j[Calculate cost function] -->
a{Acceptable?} -->
|Yes| stop([Stop])
a -->
|No| change[Update θ] -->
j
Steps¶
- Forward pass
- Backward pass
- Weights update
Optimization Parameters¶
Objective Function¶
An objective function has a unique global minimum if it is
- differentiable
- convex
Hard Constraints and Bounds¶
Useful if you know the underlying systematic differential equation
$$ \text{DE} = 0 $$ Refer to PINNs for more information
When it is not possible to use a discontinuous hard constraint/bound (such as \(\beta \ge k\)), you can add a barrier function to the cost function $$ J' = J + B $$ where \(B\) can be
- Exponential barrier: \(\pm \exp \{ m (\beta - k) \}\)
- where \(m=\) barrier coefficient
Soft Constraints: Regularization¶
Encourages (not guaranteed) certain parameters to end in range values, through penalizing deviation from prior/preferred values
Weights Initialization Algorithm¶
- Zero (bad)
- Random
- Glorot (Xavier)
Optimization Algorithms¶
Batch Size¶
Optimizer | Meaning | Comment | Gradient-Free | Weight Update Rule \(w_{t+1}\) | Advantages | Disadvantages |
---|---|---|---|---|---|---|
BGD (Batch Gradient Descent) | Update weights after viewing the entire dataset: \(n\) sample points | ❌ | Guaranteed convergence to local minimum | Computationally-expensive for large dataset Prone to getting stuck at non-optimal local minima for non-convex cost functions | ||
SGD (Stochastic Gradient Descent) | Update weights after viewing every sample point | ❌ | \(w_t - \eta g(w_t)\) | Cheaper computation Faster updates Randomization helps escape ‘shallow’ local minima | May not converge to global minima for non-convex cost functions Noisy/Oscillating/Erratic convergence | |
MBGD (Mini-Batch Gradient Descent) | Update weights after viewing the \(b\) sample points, where \(b < n\) Usually \(b=32\) | Middle ground between BGD and SGD Generalizes better than Adam | ❌ |
Speed Up Training¶
- Subsetting
- Feature-scaling
- Pruning
- Good Weight initialization
- Good Activation functions
- Transfer learning: Re-use parts of pre-trained network
- Using mini-batch updates
- Learning rate scheduling
- Faster optimization algorithm
- Use GPU/TPU
Subsetting¶
- Sample Size
- Mini-Batch
- Stochastic
- Input Features
You can do either
- drop with both approaches
- Bagging with each sub-model using the subset
Feature Scaling¶
Helps to speed up gradient descent by making it easier for the algorithm to reach minimum faster
Get every feature to approx \(-1 \le x_i \le 1\) range
Atleast try to get \(-3 \le x_i \le 3\) or \(-\frac13 \le x_i \le \frac13\)
Standardization¶
Learning Rate¶
Convex Function¶
Convex function is one where $$ \begin{aligned} f(\alpha x + \beta y) &\le \alpha f(x) + \beta f(y) \ \alpha + \beta &= 1; \alpha, \beta \ge 0 \end{aligned} $$
Robust Optimization¶
Limitations¶
- Parameters must be independent
- Cannot handle equality constraints
- Hard to estimate min and max value of parameter
- Method is extremely conservative
Batch Size¶
Resources¶
Let \(b=\) batch size $$ \begin{aligned} \text{Space Requirement} &\propto \dfrac{1}{b} \ \text{Time Requirement} &\propto b \end{aligned} $$
- Larger batch size means larger memory required to train a single batch at one time
- Larger batch size means fewer updates per epoch
Generalization¶
The following is only empirically-proven $$ \begin{aligned} \text{Generalization} &\propto \dfrac{1}{b} \end{aligned} $$ The noise from smaller batch size helps escape suboptimal local minimum
Learning Rate¶
Should be scaled according to batch size $$ \text{LR}' = \text{LR} \times (b/32) $$
Batching¶
When training a neural network, we usually divide our data in mini-batches and go through them one by one. The network predicts batch labels, which are used to compute the loss with respect to the actual targets. Next, we perform backward pass to compute gradients and update model weights in the direction of those gradients.
- Full dataset does not fit in memory
- Faster convergence due to stochasticity
Approaches to obtain gradient¶
- Exact: Use matrix differential calculus, Jacobians, Kronecker products, & vectorization
- Approximation
- Pretend everything everything is a scalar
- use typical chain rule
- rearrange/transpose matrices/vectors to make the sizes work
- verify result numerically
Initialization¶
Initialization is very important: Weights don’t move “that much”, so weights tend often stay much closer to initial points than to the “final” point after optimization from different initial point
If you initialize all the weights as 0, ANN will not learn anything - all your gradients will be the same - all the parameter updates will be the same Ideally this is what we want
Kaiming Normal Initialization: based on central limit theorem, we want the entire distribution to become \(N(0, 1)\)
Poor initialization can lead to vanishing gradients
Problem | Visualization |
---|---|
Vanishing | |
Exploding |
The choice of \(\sigma^2\) will affect
- Magnitude/Norm of forward activations
- Magnitude/Norm of gradients
\(\sigma^2\) | Norms |
---|---|
Too low | Vanishing |
Optimal | Right |
Too high | Exploding |
Here \(n=\) no of neurons
\(\sigma = \sqrt{\dfrac{\text{Gain}}{n}}\) Gain = | ||
---|---|---|
Linear | \(1\) | |
Sigmoid | \(1\) | |
Tanh | \(5/3\) | |
RELU | \(\sqrt{2}\) | Because ReLU will cause half the components of the activations to be set to 0, so we need twice the variance to achieve the same final variance |
Even when trained successfully, the effects/scales present at initialization persist throughout training
Solution¶
Normalization just before activation function
Batch Normalization | Layer Normalization | Group Normalization | |
---|---|---|---|
Normalize activations of | each feature across all inputs at a layer in a mini-batch | each input across all features at a layer in a mini-batch | |
\(\hat w_{i}\) | \(\tilde w_i = \dfrac{w_{i} - E_b[w_{i}]}{\sqrt{\sigma^2_b(w_{i}) + \epsilon}}\) \(\gamma_i \tilde w_i + \beta_i\): allows network to learn optimal distribution for activation function, and/or undo batch norm | \(\dfrac{w_{i} - E_i[w_{i}]}{\sqrt{\sigma^2_i(w_{i}) + \epsilon}}\) | |
Visualization | |||
Advantages | - Improves gradient flow through network - Allows for higher learning rates - Reduces strong dependence on initialization - Acts as a form of regularization due to stochastic inter-dependence of samples, and slightly reduces need for dropout | ||
Limitation | Inter-dependence of training samples causes unwanted effects (Soln: below) | Harder to train standard FCN to low loss, because the relative sizes between activations is lost |
Where
- \(i=\) layer number
- \(\epsilon={10}^{-5}\)
Soln¶
- Training: Compute running average of mean \(\hat \mu_{i+1}\) & variance \(\hat \sigma^2_{i+1}\) for all features at each layer
- Inference: Normalize by these quantities
Stopping Criteria¶
Use an or
combination of the following
- \(J(\theta) \le\) Cost Threshold
- \(\vert J(\theta)_e - J(\theta)_{e-1} \vert \le\) Convergence threshold
- where \(e=\) epochs
-
Moving average of the previous 5?
-
Evaluation metric \(\le\) Evaluation threshold
- This may be different from the cost function
- MSE for cost; MAPE for evaluation
- \(n_{\text{iter}} \ge\) Iter Threshold
- Time taken \(\ge\) Duration threshold
IDK¶
For each epoch, you can subsample the training set and then create batches
- Cheaper epochs
- More stochastic
Hyperparameters¶
- Batch size
- Initialization
- Optimizer algorithm
- Learning rate
- No of epochs