Transformers¶
Spatial Transformer¶
Differentiable (and hence, learnable) way of cropping images, allowing attention models to attend to arbitrary regions
- Function mapping pixel coordinates \((x_t, y_t)\) of output to pixel coordinates \(x_s, y_s\) of input
- Repeat for all pixels in output to get a sampling grid
- Use bilinear interpolation to compute output