Optimization Algorithms¶
Optimizer | Meaning | Comment | Gradient-Free | Weight Update Rule \(w_{t+1}\) | Advantages | Disadvantages |
---|---|---|---|---|---|---|
L-BGFS | Newton’s method | Not commonly used | ❌ | \(w_{t+1} = w_t - \eta H^{-1} g(w_t)\) | Full-step will optimize quadratic functions in one step | Can’t efficiently solve for Newton step, even using automatic differentiation For non-convex optimization, it is very unclear if we even want to use Newton direction |
Nelder-Mead | Simplex method | âś… | ||||
Gradient Descent | Generalizes better than Adam | ❌ | \(w_t - \eta g(w_t)\) | |||
GD + Momentum | “Averaging” of step directions “global” structure similar to BGFS | ❌ | \(w_t - \eta u_{t}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t); u_0 = 0\) \(\beta \in [0, 1):\) momentum averaging parameter | Smoothens out steps | Can introduce oscillation/non-descent behavior | |
GD + Unbiased Momentum | ❌ | \(w_t - \dfrac{\eta u_{t}}{1 - \beta^{t+1}}\) Dividing by \(1-\beta^{t+1}\) unbiases the update | Ensures updates have equal expected magnitude across all iterations | Sometimes you want the initial steps to be smaller than the later states | ||
GD + Nesterov Momentum | ❌ | \(w_t - \eta u_{t\textcolor{hotpink}{-1}}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t \textcolor{hotpink}{- \eta u_t}); u_0 = 0\) | ||||
AdaDelta | ❌ | \(w_t + v_{t+1}\) \(v_{t+1} = \rho v_t - \eta g(w_t)\) or \(v_{t+1} = \rho v_t - \eta g(w_t + \rho v_t)\) | ||||
AdaGrad | Decreases the momentum for each parameter, based on how much that parameter has made progress Can only decrease the moment | ❌ | \(w_{i, t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{i, t+1}}} g(w_{i, t})^2\) \(v_{i, t+1} = v_{i, t} + g(w_{i, t})^2\) \(\epsilon > 0\) | |||
RMSProp | Keeps a memory of previous gradients Can increase/decrease the moment | ❌ | \(w_{t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{t+1}}} g(w_{t})^2\) \(v_{t+1} = \beta v_{t} + (1-\beta) g(w_t)^2\) \(\epsilon > 0, \beta \in [0, 1]\) | |||
Adam (Adaptive Moment Estimation) | Scales the updates for each parameter differently | ❌ | \(w_{t+1} = w_{t} - \dfrac{\eta}{\epsilon + \sqrt{\hat v_{t+1}}} \hat m_{t+1}\) \(\hat m_{t+1} = \dfrac{m_{t+1}}{1-{\beta_1}^{t+1}}\) \(m_{t+1} = \beta_1 m_t + (1-\beta_1) g(w_t)\) \(\hat v_{t+1} = \dfrac{v_{t+1}}{1-{\beta_2}^{t+1}}\) \(v_{t+1} = \beta_2 v_t + (1-\beta_2) g(w_t)^2\) \(\epsilon > 0; \beta_1, \beta_2 \in [0, 1]\) |
Newton’s Method¶
Integrates more “global” structure into optimization methods, which scales gradients according to the inverse of the Hessian Equivalent to approximating the function as quadratic using second-order Taylor expansion, then solving for optimal solution
- \(\eta=1:\) Full step
- o.w: Damped Newton method