Skip to content

Optimization Algorithms

Optimizer Meaning Comment Gradient-Free Weight Update Rule
\(w_{t+1}\)
Advantages Disadvantages
L-BGFS Newton’s method Not commonly used ❌ \(w_{t+1} = w_t - \eta H^{-1} g(w_t)\) Full-step will optimize quadratic functions in one step Can’t efficiently solve for Newton step, even using automatic differentiation
For non-convex optimization, it is very unclear if we even want to use Newton direction
Nelder-Mead Simplex method âś…
Gradient Descent Generalizes better than Adam ❌ \(w_t - \eta g(w_t)\)
GD + Momentum “Averaging” of step directions
“global” structure similar to BGFS
❌ \(w_t - \eta u_{t}\)
\(u_{t+1} = \beta u_t + (1-\beta) g(w_t); u_0 = 0\)
\(\beta \in [0, 1):\) momentum averaging parameter
Smoothens out steps Can introduce oscillation/non-descent behavior
GD + Unbiased Momentum ❌ \(w_t - \dfrac{\eta u_{t}}{1 - \beta^{t+1}}\)
Dividing by \(1-\beta^{t+1}\) unbiases the update
Ensures updates have equal expected magnitude across all iterations Sometimes you want the initial steps to be smaller than the later states
GD + Nesterov Momentum ❌ \(w_t - \eta u_{t\textcolor{hotpink}{-1}}\)
\(u_{t+1} = \beta u_t + (1-\beta) g(w_t \textcolor{hotpink}{- \eta u_t}); u_0 = 0\)
AdaDelta ❌ \(w_t + v_{t+1}\)
\(v_{t+1} = \rho v_t - \eta g(w_t)\)
or
\(v_{t+1} = \rho v_t - \eta g(w_t + \rho v_t)\)
AdaGrad Decreases the momentum for each parameter, based on how much that parameter has made progress
Can only decrease the moment
❌ \(w_{i, t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{i, t+1}}} g(w_{i, t})^2\) \(v_{i, t+1} = v_{i, t} + g(w_{i, t})^2\)
\(\epsilon > 0\)
RMSProp Keeps a memory of previous gradients
Can increase/decrease the moment
❌ \(w_{t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{t+1}}} g(w_{t})^2\)
\(v_{t+1} = \beta v_{t} + (1-\beta) g(w_t)^2\)
\(\epsilon > 0, \beta \in [0, 1]\)
Adam
(Adaptive Moment Estimation)
Scales the updates for each parameter differently ❌ \(w_{t+1} = w_{t} - \dfrac{\eta}{\epsilon + \sqrt{\hat v_{t+1}}} \hat m_{t+1}\)
\(\hat m_{t+1} = \dfrac{m_{t+1}}{1-{\beta_1}^{t+1}}\)
\(m_{t+1} = \beta_1 m_t + (1-\beta_1) g(w_t)\)
\(\hat v_{t+1} = \dfrac{v_{t+1}}{1-{\beta_2}^{t+1}}\)
\(v_{t+1} = \beta_2 v_t + (1-\beta_2) g(w_t)^2\)

\(\epsilon > 0; \beta_1, \beta_2 \in [0, 1]\)

Newton’s Method

Integrates more “global” structure into optimization methods, which scales gradients according to the inverse of the Hessian Equivalent to approximating the function as quadratic using second-order Taylor expansion, then solving for optimal solution

  • \(\eta=1:\) Full step
  • o.w: Damped Newton method
Last Updated: 2024-05-14 ; Contributors: AhmedThahir

Comments