Gradient Estimation¶
Types¶
Error | Disadvantage | Accurate Estimate | Low Computational-Cost | Easy to write | ||
---|---|---|---|---|---|---|
Numerical | \(\lim \limits_{\epsilon \to 0} \dfrac{f(x + \epsilon) - f(x)}{\epsilon}\) | \(O(\epsilon)\) | Less accurate | β | β | β |
Numerical Type 2 | \(\lim \limits_{\epsilon \to 0} \dfrac{f(x + \epsilon) - f(x-\epsilon)}{2\epsilon}\) | \(O(\epsilon^2)\) | Numerical error | β | β | β |
Manual Analytic | Derive gradient by sum, product, chain rules | Tedious | β | β | β | |
Backpropagation | Run backward operations the same forward graph | β | β | |||
Forward mode automatic | Output: Computational graph Define \(\dot v_i = \dfrac{\partial v_i}{\partial x_j}\) where \(v_i\) is an intermediate result | \(n\) forward passes required to get gradient of each input | β | β | ||
Reverse mode automatic | Output: Computational graph Define adjoint \(\bar v_i = \dfrac{\partial y}{\partial v_i}\) where \(v_i\) is an intermediate result \(\overline{v_{k \to i}} = \bar v_i \dfrac{\partial v_i}{\partial v_k}\) |
Numerical gradient checking¶
Always use analytic gradient, but check implementation with numeric gradient
\[ \Delta^T \nabla_x f(x) = \dfrac{f(x + \epsilon \delta) - f(x - \epsilon \delta)}{2 \epsilon} + O(\epsilon^2) \]
Pick \(\delta\) from unit ball
Backpropagation¶
Steps¶
- Forward pass
- Compute result of operations
- Save intermediates required for gradient computation
- Calculate loss
- Backward pass
- Compute gradient of loss function wrt inputs, using chain rule recursively
IDK¶
- Gradients accumulate at branches
- Gradients are accumulated as the same node may be referenced multiple times in the forward pass
- Clear gradients before backward pass
- Sum and broadcasting are duals of each other in terms of forward and backward pass
- IDK
- add/sum: gradient distributor
- max: gradient router
- very rare for multiple values to be max
- mul: gradient swapper
IDK¶
Basically just chain rule + intelligent caching of intermediate results
Computationally-cheap, but large storage required for caching intermediate results
Each layer needs to be able to perform vector Jacobian product: multiply the βincoming backward gradientβ by its derivatives
\[ \begin{aligned} \dfrac{\partial J}{\partial \theta_l} &= \dfrac{\partial J}{\partial y_L} \left( \prod_{i=l}^L \dfrac{\partial y_i}{\partial y_{i-1}} \right) \dfrac{\partial y_l}{\partial \theta_l} \end{aligned} \]