Optimization Algorithms¶
Need not be just gradient-based
- The algorithms from AI course can also be used
- Even brute force is fine for small datasets; since this evaluates all possibilities and does not take any approximations/gradients, this would be robust to noisy data
Gradient-based algorithms are preferred for large datasets
Optimizer | Meaning | Comment | Gradient-Free | Weight Update Rule \(w_{t+1}\) | Advantages | Disadvantages |
---|---|---|---|---|---|---|
L-BGFS | Newton’s method | Not commonly used | ❌ | \(w_{t+1} = w_t - \eta H^{-1} g(w_t)\) | Full-step will optimize quadratic functions in one step No hyperparameters (learning rate) | Computationally-expensive: Can’t efficiently solve for Newton step, even using automatic differentiation For non-convex optimization, it is very unclear if we even want to use Newton direction Does not work well for mini-batch training |
Nelder-Mead | Simplex method | âś… | ||||
Gradient Descent | Generalizes better than Adam | ❌ | \(w_t - \eta g(w_t)\) | |||
GD + Momentum | “Averaging” of step directions “global” structure similar to BGFS | ❌ | \(w_t - \eta u_{t}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t); u_0 = 0\) \(\beta \in [0, 1):\) momentum averaging parameter | Smoothens out steps | Can introduce oscillation/non-descent behavior | |
GD + Unbiased Momentum | ❌ | \(w_t - \dfrac{\eta u_{t}}{1 - \beta^{t+1}}\) Dividing by \(1-\beta^{t+1}\) unbiases the update | Ensures updates have equal expected magnitude across all iterations | Sometimes you want the initial steps to be smaller than the later states | ||
Nag Nesterov Accelerated Gradient GD + Nesterov Momentum | Lookahead gradient from momentum step | ❌ | \(w_t - \eta u_{t\textcolor{hotpink}{-1}}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t \textcolor{hotpink}{- \eta u_t}); u_0 = 0\) | |||
AdaDelta | ❌ | \(w_t + v_{t+1}\) \(v_{t+1} = \rho v_t - \eta g(w_t)\) or \(v_{t+1} = \rho v_t - \eta g(w_t + \rho v_t)\) | ||||
AdaGrad | Decreases the momentum for each parameter, based on how much that parameter has made progress Can only decrease the moment | ❌ | \(w_{i, t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{i, t+1}}} g(w_{i, t})^2\) \(v_{i, t+1} = v_{i, t} + g(w_{i, t})^2\) \(\epsilon > 0\) | Learning rate tends to zero after a long time | ||
RMSProp | Keeps a memory of previous gradients Can increase/decrease the moment | ❌ | \(w_{t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{t+1}}} g(w_{t})^2\) \(v_{t+1} = \beta v_{t} + (1-\beta) g(w_t)^2\) \(\epsilon > 0, \beta \in [0, 1]\) | |||
Adam (Adaptive Moment Estimation) | Basically AdaGrad with Momentum Scales the updates for each parameter differently | ❌ | \(w_{t+1} = w_{t} - \dfrac{\eta}{\epsilon + \sqrt{\hat v_{t+1}}} \hat m_{t+1}\) \(\hat m_{t+1} = \dfrac{m_{t+1}}{1-{\beta_1}^{t+1}}\) \(m_{t+1} = \beta_1 m_t + (1-\beta_1) g(w_t)\) \(\hat v_{t+1} = \dfrac{v_{t+1}}{1-{\beta_2}^{t+1}}\) \(v_{t+1} = \beta_2 v_t + (1-\beta_2) g(w_t)^2\) \(\epsilon > 0; \beta_1, \beta_2 \in [0, 1]\) | |||
AdamW | AdamW is Adam with weight decay rather than L2-regularization AdamW seems to consistently outperform Adam in terms of both the error achieved and the training time. |
Rule of thumb: recommended learning rate \(\eta = 3e^{-4}\)
Newton’s Method¶
Integrates more “global” structure into optimization methods, which scales gradients according to the inverse of the Hessian Equivalent to approximating the function as quadratic using second-order Taylor expansion, then solving for optimal solution
- \(\eta=1:\) Full step
- o.w: Damped Newton method
Brute-Force Regression¶
Imports¶
Generate Synthetic Data¶
\(y = mx + c\)
m_hat_range = np.arange(m_hat[0], m_hat[1]+m_hat[2], m_hat[2])
c_hat_range = np.arange(c_hat[0], c_hat[1]+c_hat[2], c_hat[2])
data = (
np.column_stack(
(
x,
y
),
)
)
m, c = (
np.meshgrid(
m_hat_range,
c_hat_range
)
)
params = (
np.column_stack(
(
m.flatten(),
c.flatten()
),
)
)
df = pd.DataFrame(
data = np.column_stack(
(
np.tile(data, (params.shape[0], 1)),
np.tile(params, (data.shape[0], 1))
)
),
columns = ["x", "y", "m", "c"]
)
Forward Pass¶
Backward Pass¶
Results¶
results = (
df
.groupby(["m", "c"])
["loss"]
.mean()
.reset_index()
.rename(columns = {
"loss": "cost"
})
)
Loss Landscape¶
fig, ax = plt.subplots()
ax = plt.tricontourf(
results["m"],
results["c"],
results["cost"].pow(0.1),
levels = 1_000,
cmap="Reds_r"
)
fig.colorbar(ax)
plt.show()
Most Optimal Parameters¶
m | c | cost |
---|---|---|
2.0 | 3.0 | 0.00 |
2.0 | 2.9 | 0.01 |
2.0 | 3.1 | 0.01 |
2.0 | 2.8 | 0.04 |
2.0 | 3.2 | 0.04 |
Gradient Descent¶
Similar to trial and error
- Start with some \(\theta\) vector
- Keep changing \(\theta_0, \theta_1, \dots, \theta_n\) using derivative of cost function, until minimum for \(J(\theta)\) is obtained - Simultaneously
\[ \theta_{\text{new}} = \theta_{\text{prev}} - \eta \ {\nabla J} \]
Meaning | |
---|---|
\(\theta_{\text{new}}\) | Coefficients obtained from current iteration (Output of current iteration) |
\(\theta_{\text{old}}\) | Coefficients obtained from previous iteration (Output of previous iteration) |
\(\eta\) | Learning Rate |
\(\nabla J\) | Gradient vector of \(J (\theta)\) |
Gradients of the Loss Function¶
Learning Rate \(\eta\)¶
\(0 < \eta < 1\)
- Large value may lead to underfitting/overfitting
- Small value will lead to more time taken
Rule of thumb: recommended learning rate \(\eta = 3 e {-4}\)
Can be
- constant
- decay
- Step: \(\alpha_t = \alpha_{t-a}/2\)
- Decay by half every few epochs
- Exponential: \(\alpha_t = \alpha_0 e^{-kt}\)
- 1/t: \(\alpha_t = \alpha_0/(1+kt)\)
- Step: \(\alpha_t = \alpha_{t-a}/2\)
Iterative vs Normal Equation¶
Iterative | Normal Equation | |
---|---|---|
\(\alpha\) not required | ❌ | ✅ |
Feature scaling not required | ❌ | ✅ |
Time Complexity | \(O(kn^2)\) | \(O(n^3)\) |
Performance | Fast even for large \(n\) | Slow if \(n > 10^4\) |
Compatibility | Works for all algorithms | Doesn’t work for classification |
No of features | Works for all algorithms | Doesn't work when \(X^TX\) is non-invertible |
Stop criteria | None | |
Convergence | Slow | |
Global Optimal guaranteed | ❌ | ✅ |
Loss Function | Should be double-differentiable |
Gradient-based methods find min of a function by moving in the direction in which the function decreases most steeply