Optimization Algorithms¶
Need not be just gradient-based
- The algorithms from AI course can also be used
- Even brute force is fine for small datasets; since this evaluates all possibilities and does not take any approximations/gradients, this would be robust to noisy data
Gradient-based algorithms are preferred for large datasets
Optimizer | Meaning | Comment | Gradient-Free | Weight Update Rule \(w_{t+1}\) | Advantages | Disadvantages |
---|---|---|---|---|---|---|
L-BGFS | Newton’s method | Not commonly used | ❌ | \(w_{t+1} = w_t - \eta H^{-1} g(w_t)\) | Full-step will optimize quadratic functions in one step No hyperparameters (learning rate) | Computationally-expensive: Can’t efficiently solve for Newton step, even using automatic differentiation For non-convex optimization, it is very unclear if we even want to use Newton direction Does not work well for mini-batch training |
Nelder-Mead | Simplex method | ✅ | ||||
Gradient Descent | Generalizes better than Adam | ❌ | \(w_t - \eta g(w_t)\) | |||
GD + Momentum | “Averaging” of step directions “global” structure similar to BGFS | ❌ | \(w_t - \eta u_{t}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t); u_0 = 0\) \(\beta \in [0, 1):\) momentum averaging parameter | Smoothens out steps | Can introduce oscillation/non-descent behavior | |
GD + Unbiased Momentum | ❌ | \(w_t - \dfrac{\eta u_{t}}{1 - \beta^{t+1}}\) Dividing by \(1-\beta^{t+1}\) unbiases the update | Ensures updates have equal expected magnitude across all iterations | Sometimes you want the initial steps to be smaller than the later states | ||
Nag Nesterov Accelerated Gradient GD + Nesterov Momentum | Lookahead gradient from momentum step | ❌ | \(w_t - \eta u_{t\textcolor{hotpink}{-1}}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t \textcolor{hotpink}{- \eta u_t}); u_0 = 0\) | |||
AdaDelta | ❌ | \(w_t + v_{t+1}\) \(v_{t+1} = \rho v_t - \eta g(w_t)\) or \(v_{t+1} = \rho v_t - \eta g(w_t + \rho v_t)\) | ||||
AdaGrad | Decreases the momentum for each parameter, based on how much that parameter has made progress Can only decrease the moment | ❌ | \(w_{i, t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{i, t+1}}} g(w_{i, t})^2\) \(v_{i, t+1} = v_{i, t} + g(w_{i, t})^2\) \(\epsilon > 0\) | Learning rate tends to zero after a long time | ||
RMSProp | Keeps a memory of previous gradients Can increase/decrease the moment | ❌ | \(w_{t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{t+1}}} g(w_{t})^2\) \(v_{t+1} = \beta v_{t} + (1-\beta) g(w_t)^2\) \(\epsilon > 0, \beta \in [0, 1]\) | |||
Adam (Adaptive Moment Estimation) | Basically AdaGrad with Momentum Scales the updates for each parameter differently | ❌ | \(w_{t+1} = w_{t} - \dfrac{\eta}{\epsilon + \sqrt{\hat v_{t+1}}} \hat m_{t+1}\) \(\hat m_{t+1} = \dfrac{m_{t+1}}{1-{\beta_1}^{t+1}}\) \(m_{t+1} = \beta_1 m_t + (1-\beta_1) g(w_t)\) \(\hat v_{t+1} = \dfrac{v_{t+1}}{1-{\beta_2}^{t+1}}\) \(v_{t+1} = \beta_2 v_t + (1-\beta_2) g(w_t)^2\) \(\epsilon > 0; \beta_1, \beta_2 \in [0, 1]\) | |||
AdamW | AdamW is Adam with weight decay rather than L2-regularization AdamW seems to consistently outperform Adam in terms of both the error achieved and the training time. |
Rule of thumb: recommended learning rate
Newton’s Method¶
Integrates more “global” structure into optimization methods, which scales gradients according to the inverse of the Hessian Equivalent to approximating the function as quadratic using second-order Taylor expansion, then solving for optimal solution
- Full step
- o.w: Damped Newton method
Brute-Force Regression¶
Imports¶
Generate Synthetic Data¶
m_hat_range = np.arange(m_hat[0], m_hat[1]+m_hat[2], m_hat[2])
c_hat_range = np.arange(c_hat[0], c_hat[1]+c_hat[2], c_hat[2])
data = (
np.column_stack(
(
x,
y
),
)
)
m, c = (
np.meshgrid(
m_hat_range,
c_hat_range
)
)
params = (
np.column_stack(
(
m.flatten(),
c.flatten()
),
)
)
df = pd.DataFrame(
data = np.column_stack(
(
np.tile(data, (params.shape[0], 1)),
np.tile(params, (data.shape[0], 1))
)
),
columns = ["x", "y", "m", "c"]
)
Forward Pass¶
Backward Pass¶
Results¶
results = (
df
.groupby(["m", "c"])
["loss"]
.mean()
.reset_index()
.rename(columns = {
"loss": "cost"
})
)
Loss Landscape¶
fig, ax = plt.subplots()
ax = plt.tricontourf(
results["m"],
results["c"],
results["cost"].pow(0.1),
levels = 1_000,
cmap="Reds_r"
)
fig.colorbar(ax)
plt.show()
Most Optimal Parameters¶
m | c | cost |
---|---|---|
2.0 | 3.0 | 0.00 |
2.0 | 2.9 | 0.01 |
2.0 | 3.1 | 0.01 |
2.0 | 2.8 | 0.04 |
2.0 | 3.2 | 0.04 |
Gradient Descent¶
Similar to trial and error
- Start with some \(\theta\) vector
- Keep changing \(\theta_0, \theta_1, \dots, \theta_n\) using derivative of cost function, until minimum for \(J(\theta)\) is obtained - Simultaneously
\[ \theta_{\text{new}} = \theta_{\text{prev}} - \eta \ {\nabla J} \]
Meaning | |
---|---|
\(\theta_{\text{new}}\) | Coefficients obtained from current iteration (Output of current iteration) |
\(\theta_{\text{old}}\) | Coefficients obtained from previous iteration (Output of previous iteration) |
\(\eta\) | Learning Rate |
\(\nabla J\) | Gradient vector of \(J (\theta)\) |
Gradients of the Loss Function¶
Learning Rate \(\eta\)¶
\(0 < \eta < 1\)
- Large value may lead to underfitting/overfitting
- Small value will lead to more time taken
Rule of thumb: recommended learning rate \(\eta = 3 e {-4}\)
Can be
- constant
- decay
- Step: \(\alpha_t = \alpha_{t-a}/2\)
- Decay by half every few epochs
- Exponential: \(\alpha_t = \alpha_0 e^{-kt}\)
- 1/t: \(\alpha_t = \alpha_0/(1+kt)\)
- Step: \(\alpha_t = \alpha_{t-a}/2\)
Iterative vs Normal Equation¶
Iterative | Normal Equation | |
---|---|---|
\(\alpha\) not required | ❌ | ✅ |
Feature scaling not required | ❌ | ✅ |
Time Complexity | \(O(kn^2)\) | \(O(n^3)\) |
Performance | Fast even for large \(n\) | Slow if \(n > 10^4\) |
Compatibility | Works for all algorithms | Doesn’t work for classification |
No of features | Works for all algorithms | Doesn't work when \(X^TX\) is non-invertible |
Stop criteria | None | |
Convergence | Slow | |
Global Optimal guaranteed | ❌ | ✅ |
Loss Function | Should be double-differentiable |
Gradient-based methods find min of a function by moving in the direction in which the function decreases most steeply