03
Worst-First Backpropagation¶
Backpropagation is expensive, so only focus on Top k
Gradient Accumulation¶
- Use a small batch size
- Save the gradients at each batch
- Update network weights once every couple of batches
Purpose
- Helps to imitate a larger batch size
- For large GPU memory intensive architectures
Notes
- Some network architectures have batch-specific operations. For instance, batch normalization is performed on a batch level and therefore may yield slightly different results when using the same effective batch size with and without gradient accumulation
- It is important to also update weights on the last batch, to ensure that the last batches are not discarded and used for optimizing the network
Performance Improvement¶
Evaluation Frequency¶
train_eval_every
dev_eval_every
10 is a good number
1. Consider using another learning rate schedule¶
The learning rate (schedule) you choose has a large impact on the speed of convergence as well as the generalization performance of your model.
Cyclical Learning Rates and the 1Cycle learning rate schedule are both methods introduced by Leslie N. Smith (here and here), and then popularised by fast.ai's Jeremy Howard and Sylvain Gugger (here and here). Essentially, the 1Cycle learning rate schedule looks something like this:
Sylvain writes:
[1cycle consists of] two steps of equal lengths, one going from a lower learning rate to a higher one than go back to the minimum. The maximum should be the value picked with the Learning Rate Finder, and the lower one can be ten times lower. Then, the length of this cycle should be slightly less than the total number of epochs, and, in the last part of training, we should allow the learning rate to decrease more than the minimum, by several orders of magnitude.
In the best case this schedule achieves a massive speed-up – what Smith calls Superconvergence – as compared to conventional learning rate schedules. Using the 1Cycle policy he needs ~10x fewer training iterations of a ResNet-56 on ImageNet to match the performance of the original paper, for instance). The schedule seems to perform robustly well across common architectures and optimizers.
PyTorch implements both of these methods torch.optim.lr_scheduler.CyclicLR
and torch.optim.lr_scheduler.OneCycleLR,
see the documentation.
One drawback of these schedulers is that they introduce a number of additional hyperparameters. This post and this repo, offer a nice overview and implementation of how good hyper-parameters can be found including the Learning Rate Finder mentioned above.
Why does this work? It doesn't seem entirely clear but one possible explanation might be that regularly increasing the learning rate helps to traverse saddle points in the loss landscape more quickly.
2. Use multiple workers and pinned memory in DataLoader¶
When using torch.utils.data.DataLoader, set num_workers > 0
, rather than the default value of 0, and pin_memory=True
, rather than the default value of False. Details of this are explained here.
Szymon Micacz achieves a 2x speed-up for a single training epoch by using four workers and pinned memory.
A rule of thumb that people are using to choose the number of workers is to set it to four times the number of available GPUs with both a larger and smaller number of workers leading to a slow down.
Note that increasing num_workerswill increase your CPU memory consumption.
3. Max out the batch size¶
This is a somewhat contentious point. Generally, however, it seems like using the largest batch size your GPU memory permits will accelerate your training (see NVIDIA's Szymon Migacz, for instance). Note that you will also have to adjust other hyperparameters, such as the learning rate, if you modify the batch size. A rule of thumb here is to double the learning rate as you double the batch size.
OpenAI has a nice empirical paper on the number of convergence steps needed for different batch sizes. Daniel Huynh runs some experiments with different batch sizes (also using the 1Cycle policy discussed above) where he achieves a 4x speed-up by going from batch size 64 to 512.
One of the downsides of using large batch sizes, however, is that they might lead to solutions that generalize worse than those trained with smaller batches.
4. Use Automatic Mixed Precision (AMP)¶
The release of PyTorch 1.6 included a native implementation of Automatic Mixed Precision training to PyTorch. The main idea here is that certain operations can be run faster and without a loss of accuracy at semi-precision (FP16) rather than in the single-precision (FP32) used elsewhere. AMP, then, automatically decide which operation should be executed in which format. This allows both for faster training and a smaller memory footprint.
In the best case, the usage of AMP would look something like this:
import torch
# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()
for data, label in data_iter:
optimizer.zero_grad()
# Casts operations to mixed precision
with torch.cuda.amp.autocast():
loss = model(data)
# Scales the loss, and calls backward()
# to create scaled gradients
scaler.scale(loss).backward()
# Unscales gradients and calls
# or skips optimizer.step()
scaler.step(optimizer)
# Updates the scale for next iteration
scaler.update()
Benchmarking a number of common language and vision models on NVIDIA V100 GPUs, Huang and colleagues find that using AMP over regular FP32 training yields roughly 2x – but upto 5.5x – training speed-ups.
Currently, only CUDA ops can be autocast in this way. See the documentation here for more details on this and other limitations.
u/SVPERBlA points out that you can squeeze out some additional performance (~ 20%) from AMP on NVIDIA Tensor Core GPUs if you convert your tensors to the Channels Last memory format. Refer to this section in the NVIDIA docs for an explanation of the speedup and more about NCHW versus NHWC tensor formats.
5. Consider using another optimizer¶
AdamW is Adam with weight decay (rather than L2-regularization) which was popularized by fast.ai and is now available natively in PyTorch as torch.optim.AdamW
. AdamW seems to consistently outperform Adam in terms of both the error achieved and the training time. See this excellent blog post on why using weight decay instead of L2-regularization makes a difference for Adam.
Both Adam and AdamW work well with the 1Cycle policy described above.
There are also a few not-yet-native optimizers that have received a lot of attention recently, most notably LARS (pip installable implementation) and LAMB.
NVIDA's APEX implements fused versions of a number of common optimizers such as Adam. This implementation avoid a number of passes to and from GPU memory as compared to the PyTorch implementation of Adam, yielding speed-ups in the range of 5%.
6. Turn on cudNN benchmarking¶
If your model architecture remains fixed and your input size stays constant, setting torch.backends.cudnn.benchmark = True
might be beneficial (docs). This enables the cudNN autotuner which will benchmark a number of different ways of computing convolutions in cudNN and then use the fastest method from then on.
For a rough reference on the type of speed-up you can expect from this, Szymon Migacz achieves a speed-up of 70% on a forward pass for a convolution and a 27% speed-up for a forward + backward pass of the same convolution.
One caveat here is that this autotuning might become very slow if you max out the batch size as mentioned above.
7. Beware of frequently transferring data between CPUs and GPUs¶
Beware of frequently transferring tensors from a GPU to a CPU using tensor.cpu()
and vice versa using tensor.cuda()
as these are relatively expensive. The same applies for .item()
and .numpy()
– use .detach()
instead.
If you are creating a new tensor, you can also directly assign it to your GPU using the keyword argument device=torch.device('cuda:0')
.
If you do need to transfer data, using .to(non_blocking=True)
, might be useful as long as you don't have any synchronization points after the transfer.
If you really have to, you might want to give Santosh Gupta's SpeedTorch a try, although it doesn't seem entirely clear when this actually does/doesn't provide speed-ups.
8. Use gradient/activation checkpointing¶
Quoting directly from the documentation:
Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.
Specifically, in the forward pass, function will run in torch.no_grad() manner, i.e., not storing the intermediate activations. Instead, the forward pass saves the inputs tuple and the functionparameter. In the backwards pass, the saved inputs and function is retrieved, and the forward pass is computed on function again, now tracking the intermediate activations, and then the gradients are calculated using these activation values.
So while this will might slightly increase your run time for a given batch size, you'll significantly reduce your memory footprint. This in turn will allow you to further increase the batch size you're using allowing for better GPU utilization.
While checkpointing is implemented natively as torch.utils.checkpoint
(docs), it does seem to take some thought and effort to implement properly. Priya Goyal has a good tutorial demonstrating some of the key aspects of checkpointing.
9. Use gradient accumulation¶
Another approach to increasing the batch size is to accumulate gradients across multiple .backward()
passes before calling optimizer.step().
Following a post by Hugging Face's Thomas Wolf, gradient accumulation can be implemented as follows:
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad() # Reset gradients tensors
if (i+1) % evaluation_steps == 0: # Evaluate the model when we...
evaluate_model() # ...have no gradients accumulate
This method was developed mainly to circumvent GPU memory limitations and I'm not entirely clear on the trade-off between having additional .backward()
loops. This discussion on the fastai forum seems to suggest that it can in fact accelerate training, so it's probably worth a try.
10. Use Distributed Data Parallel for multi-GPU training¶
Methods to accelerate distributed training probably warrant their own post but one simple one is to use torch.nn.DistributedDataParallel
rather than torch.nn.DataParallel
. By doing so, each GPU will be driven by a dedicated CPU core avoiding the GIL issues of DataParallel.
In general, I can strongly recommend reading the documentation on distributed training.
11. Set gradients to None rather than 0¶
Use .zero_grad(set_to_none=True)
rather than .zero_grad()
.
Doing so will let the memory allocator handle the gradients rather than actively setting them to 0. This will lead to yield a modest speed-up as they say in the documentation, so don't expect any miracles.
Watch out, doing this is not side-effect free! Check the docs for the details on this.
12. Use .as_tensor() rather than .tensor()¶
torch.tensor()
always copies data. If you have a numpy array that you want to convert, use torch.as_tensor()
or torch.from_numpy()
to avoid copying the data.
13. Turn on debugging tools only when actually needed¶
PyTorch offers a number of useful debugging tools like the autograd.profiler, autograd.grad_check, and autograd.anomaly_detection. Make sure to use them to better understand when needed but to also turn them off when you don't need them as they will slow down your training.
14. Use gradient clipping¶
Originally used to avoid exploding gradients in RNNs, there is both some empirical evidence as well as some theoretical support that clipping gradients (roughly speaking: gradient = min(gradient, threshold)
) accelerates convergence.
Hugging Face's Transformer implementation is a really clean example of how to use gradient clipping as well as some of the other methods such as AMP mentioned in this post.
In PyTorch this can be done using torch.nn.utils.clip_grad_norm_
(documentation).
It's not entirely clear to me which models benefit how much from gradient clipping but it seems to be robustly useful for RNNs, Transformer-based and ResNets architectures and a range of different optimizers.
15. Turn off bias before BatchNorm¶
This is a very simple one: turn off the bias of layers before BatchNormalization layers. For a 2-D convolutional layer, this can be done by setting the bias keyword to False: torch.nn.Conv2d(..., bias=False, ...)
. (Here's a reminder why this makes sense.)
You will save some parameters, I would however expect the speed-up of this to be relatively small as compared to some of the other methods mentioned here.
17. Use input and batch normalization¶
You're probably already doing this but you might want to double-check:
- Are you normalizing your input?
- Are you using batch-normalization?
And here's a reminder of why you probably should.
Bonus tip from the comments: Use JIT to fuse point-wise operations.¶
If you have adjacent point-wise operations you can use PyTorch JIT to combine them into one FusionGroup which can then be launched on a single kernel rather than multiple kernels as would have been done per default. You'll also save some memory reads and writes.
Szymon Migacz shows how you can use the @torch.jit.script
decorator to fuse the operations in a GELU, for instance:
In this case, fusing the operations leads to a 5x speed-up for the execution of fused_gelu
as compared to the unfused version.
See also this post for an example of how Torchscript can be used to accelerate an RNN.
Hat tip to u/Patient_Atmosphere45 for the suggestion.