Skip to content

Optimization Algorithms¶

Optimizer	Meaning	Comment	Gradient-Free	Weight Update Rule \(w_{t+1}\)	Advantages	Disadvantages
L-BGFS	Newton’s method	Not commonly used	❌	\(w_{t+1} = w_t - \eta H^{-1} g(w_t)\)	Full-step will optimize quadratic functions in one step	Can’t efficiently solve for Newton step, even using automatic differentiation For non-convex optimization, it is very unclear if we even want to use Newton direction
Nelder-Mead	Simplex method		✅
Gradient Descent		Generalizes better than Adam	❌	\(w_t - \eta g(w_t)\)
GD + Momentum	“Averaging” of step directions “global” structure similar to BGFS		❌	\(w_t - \eta u_{t}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t); u_0 = 0\) \(\beta \in [0, 1):\) momentum averaging parameter	Smoothens out steps	Can introduce oscillation/non-descent behavior
GD + Unbiased Momentum			❌	\(w_t - \dfrac{\eta u_{t}}{1 - \beta^{t+1}}\) Dividing by \(1-\beta^{t+1}\) unbiases the update	Ensures updates have equal expected magnitude across all iterations	Sometimes you want the initial steps to be smaller than the later states
GD + Nesterov Momentum			❌	\(w_t - \eta u_{t\textcolor{hotpink}{-1}}\) \(u_{t+1} = \beta u_t + (1-\beta) g(w_t \textcolor{hotpink}{- \eta u_t}); u_0 = 0\)
AdaDelta			❌	\(w_t + v_{t+1}\) \(v_{t+1} = \rho v_t - \eta g(w_t)\) or \(v_{t+1} = \rho v_t - \eta g(w_t + \rho v_t)\)
AdaGrad	Decreases the momentum for each parameter, based on how much that parameter has made progress Can only decrease the moment		❌	\(w_{i, t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{i, t+1}}} g(w_{i, t})^2\) \(v_{i, t+1} = v_{i, t} + g(w_{i, t})^2\) \(\epsilon > 0\)
RMSProp	Keeps a memory of previous gradients Can increase/decrease the moment		❌	\(w_{t+1} = w_{i, t} - \dfrac{\eta}{\epsilon + \sqrt{v_{t+1}}} g(w_{t})^2\) \(v_{t+1} = \beta v_{t} + (1-\beta) g(w_t)^2\) \(\epsilon > 0, \beta \in [0, 1]\)
Adam (Adaptive Moment Estimation)	Scales the updates for each parameter differently		❌	\(w_{t+1} = w_{t} - \dfrac{\eta}{\epsilon + \sqrt{\hat v_{t+1}}} \hat m_{t+1}\) \(\hat m_{t+1} = \dfrac{m_{t+1}}{1-{\beta_1}^{t+1}}\) \(m_{t+1} = \beta_1 m_t + (1-\beta_1) g(w_t)\) \(\hat v_{t+1} = \dfrac{v_{t+1}}{1-{\beta_2}^{t+1}}\) \(v_{t+1} = \beta_2 v_t + (1-\beta_2) g(w_t)^2\) \(\epsilon > 0; \beta_1, \beta_2 \in [0, 1]\)

Newton’s Method¶

Integrates more “global” structure into optimization methods, which scales gradients according to the inverse of the Hessian Equivalent to approximating the function as quadratic using second-order Taylor expansion, then solving for optimal solution

\(\eta=1:\) Full step
o.w: Damped Newton method

Last Updated: 2024-05-14 ; Contributors: AhmedThahir

Comments