Boost PyTorch Performance with Multi-GPU and Accelerate Library

Introduction

Running deep learning models on multiple GPUs or machines can be complex, but Hugging Face’s Accelerate library makes it much easier. Designed for PyTorch users, Accelerate streamlines device management, allowing you to scale models from single-GPU to multi-GPU setups without major code changes. Whether you’re leveraging multi-CPU configurations, mixed-precision training, or integrating DeepSpeed, this library simplifies the entire process. In this article, we explore how Accelerate enhances PyTorch workflows and makes distributed machine learning more accessible.

What is Accelerate?

Accelerate is a library that helps simplify the process of running machine learning models on multiple GPUs or machines. It allows users to keep their original code intact while making it easier to scale the model across different devices. This tool helps users avoid complex setup processes by automating many steps, like managing devices and distributing tasks, making it simpler for anyone to work with powerful machine learning setups.

What is Accelerate?

Accelerate is this awesome library created by Hugging Face that takes your PyTorch code, which is usually designed for just one GPU, and turns it into code that works on multiple GPUs. And the best part? It doesn’t matter if you’re working on one machine or many, the library has got your back. It simplifies everything about distributed machine learning, letting you keep all your original PyTorch code intact. So, if you want to scale up your models to use more devices, you don’t need to make crazy changes to your code.

The reason Accelerate even exists is because modern deep learning models are getting more complex, and so is the data we use to train them. As AI keeps pushing the limits of what’s possible, we need more powerful hardware, like GPUs, to train these models. But running models across several GPUs can be a total headache. Traditional ways of scaling PyTorch to work on multiple GPUs often mean making complex changes to your code or having to learn a whole new API.

What makes Accelerate stand out is that it gives you a super easy way to scale up your PyTorch code without losing control over the important details. You can still write your general PyTorch code and run it on multiple GPUs, no matter if you’re running it on one machine or many. And here’s the cool part: you can run the same code both in a distributed and a non-distributed setup without needing to tweak the main logic. This is a big deal compared to the traditional PyTorch distributed launches, which require you to make big changes to switch between setups.

Thanks to Accelerate’s simplicity, developers can spend more time focusing on their models and less time dealing with all the complicated infrastructure stuff.

Read more about the Accelerate library and its capabilities in the official documentation Hugging Face Accelerate Documentation.

Code changes to use Accelerate

If you’re working with general PyTorch code, chances are you’re writing your own training loop for the model. So, here’s how a typical PyTorch training loop might go:

Import libraries: This is the part where you load all the necessary libraries and modules for your task. This will include PyTorch, of course, but also any other libraries needed for things like data processing or model evaluation.

Set device: This is where you decide which hardware you want to run your model on, such as a GPU or CPU. This is a key step because you need to make sure your model is running on the right hardware. If you’re using a GPU, for example, you’ll need to point your model and data to that device.

Point model to device: After setting up the model, you explicitly assign it to the device (such as GPU). This is how you ensure that your computations happen on the right hardware.

Choose optimizer: Now, you define which optimizer you want to use. The optimizer is responsible for adjusting the model’s weights during training. A popular choice is the Adam optimizer, which works well for most deep learning tasks.

Load dataset using DataLoader: The DataLoader in PyTorch helps load and batch your datasets so that you can feed data to your model in small batches during training.

Train model in loop (one round per epoch): This is where the magic happens! Your model will loop through the data, train on each batch, calculate the outputs, figure out the loss, and adjust its weights. This loop looks like this:

Point source data and targets to device: Your input data and labels need to be sent to the right device, whether it’s the CPU or GPU.

Zero the network gradients: Gradients build up during backpropagation, so you need to clear them before each new training step.

Calculate output from model: You’ll then pass your data through the model to get the predicted output.

Calculate loss: The loss function (like cross-entropy for classification tasks) will tell you how far off the model’s predictions are from the real labels.

Backpropagate the gradient: This is where the model updates its weights, based on the gradients, to reduce the loss and improve its accuracy.

In addition to these main steps, you might also have other things going on, like preparing the data or testing the model on test data, depending on your specific task.

Now, when you check out the Accelerate GitHub repository, you’ll see how the code changes compared to regular PyTorch. These changes are visually shown with color-coded highlights: green for new lines, and red for removed ones.

At first glance, these changes might not look like they’re simplifying things all that much. But if you pay attention to the red lines (which are the ones that got removed), you’ll notice that a lot of the complicated device management code, like explicitly telling the code which device to use, is no longer needed. This means your code is a lot cleaner, and you can focus on the core training process without dealing with all the messy device management stuff. Accelerate makes it easier to scale your code to work on multiple GPUs or in distributed environments.

Here’s a breakdown of what’s happening in the code changes:

Import the Accelerator library: You’ll start by importing the Accelerate library at the beginning of your script.

Use the accelerator as the device: Instead of manually managing devices like CPU or GPU, Accelerate uses the Accelerator object to take care of that for you.

Instantiate the model without specifying a device: You don’t need to manually assign the model to a device (whether it’s GPU or CPU). The Accelerator handles that automatically.

Setup the model, optimizer, and data to be used by Accelerate: Now, you just need to configure your model, optimizer, and data to work seamlessly with Accelerate. It’ll run everything on the device you chose earlier.

No need to point source data and targets to the device: One of the cool things about Accelerate is that it automatically sends your data to the right device, so you don’t have to manually set the device for every batch.

Accelerator handles the backpropagation step: Even the backpropagation step is taken care of! You no longer have to manually write the code to update the model’s weights—Accelerate does that for you.

This whole process reduces a lot of the repetitive code you’d typically write. And even though it simplifies things, you still get to keep control over the key aspects of the training loop. By handling device management and all the extra steps for you, Accelerate allows you to focus on what matters most—developing and training your model.

For more detailed information on implementing code changes with the Accelerate library, check out the official guide Accelerate Documentation.

Single-GPU

The code provided above is designed for running on a single GPU, which means it assumes all the calculations will be done by one processing unit—usually a Graphics Processing Unit (GPU). This setup is perfect for smaller models or datasets, where the complexity and resource demands don’t require multiple GPUs. But, here’s the thing: as the need to scale models grows—especially in deep learning tasks involving large datasets or complex models—switching from a single GPU to a multi-GPU setup becomes pretty much necessary.

In a blog post by the Accelerate team over at Hugging Face, they compare the traditional way of scaling PyTorch code to multi-GPU systems with how things work using the Accelerate library. When you go the traditional route, you need to make pretty detailed changes to your original code, which adds complexity and increases the chances of errors popping up. The multi-GPU setup with the traditional method is a lot more code-heavy. You need extra lines of code to manage how tasks get distributed across the GPUs. It’s a bit of a pain, but here’s how the code changes look:


import os
from torch.utils.data import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
local_rank = int(os.environ.get(“LOCAL_RANK”, -1))
device = torch.device(“cuda”, local_rank)
model = DistributedDataParallel(model)
sampler = DistributedSampler(dataset)
data = torch.utils.data.DataLoader(dataset, sampler=sampler)
sampler.set_epoch(epoch)

Each of these lines plays a role in setting up a multi-GPU system using the traditional approach. For instance, the DistributedSampler makes sure the data is split and sent to the right GPUs, while DistributedDataParallel takes care of splitting the work of training the model across the GPUs. The local_rank variable helps figure out which GPU should handle which part of the job.

However, once you add all these lines, your code will no longer work with just a single GPU. That’s a big drawback, right? The code now becomes too tailored for a multi-GPU setup, and if you want to switch back to using just one GPU, you’ll need to make a bunch of changes. This is where Accelerate really shines.

With Accelerate , you can keep your PyTorch code exactly the same for both single and multi-GPU setups, and you don’t need to make any special adjustments for each case. The same code you use for a single GPU will just work on multiple GPUs too, without any extra hassle. This simplifies everything and takes away the headache of managing separate code paths for different types of hardware.

To learn more about optimizing PyTorch for Single-GPU setups, you can visit the PyTorch Official Tutorials.

Running Accelerate

The Accelerate GitHub repository gives you a solid set of examples showing how to run the library in different scenarios. To get started with Accelerate, you first need to launch a Jupyter Notebook. Jupyter is a great tool for running Python code interactively, you know? Once your notebook is all set up, just follow these simple steps to install the libraries you’ll need:

pip install accelerate

pip install datasets

pip install transformers

pip install scipy

pip install sklearn

Once you’ve got the necessary dependencies installed, head on over to the examples directory, where you’ll find some sample scripts. For example, Hugging Face has provided a Natural Language Processing (NLP) example, which makes a lot of sense since Hugging Face has always been all about simplifying NLP. So, if you’re looking to dive into NLP tasks, this is a pretty good starting point.

Next, in the examples directory, you’ll run this Python script:

cd examples

python ./nlp_example.py

This script fine-tunes the BERT transformer model in its base configuration, using the GLUE MRPC dataset. If you’re wondering, the GLUE MRPC dataset is a widely recognized benchmark for determining if two sentences are paraphrases of one another. The model gets trained to understand sentence similarity, which is super important for many NLP applications.

While the model is being trained in this example, it’ll output an accuracy of about 85% and an F1 score just under 90%. Now, if you’re not familiar with the F1 score, it’s a handy metric that combines precision and recall. It’s especially useful when you’re working with imbalanced datasets. So, seeing an F1 score near 90% is pretty solid—it shows the fine-tuned BERT model does a great job with this NLP task.

By running through this example, you’ll get a

To get started with running Accelerate on your machine, check out the Hugging Face Accelerate Documentation for a comprehensive guide.

Multi-GPU

When you’re working with multi-GPU setups, that’s where the real magic of the Accelerate library shines. One of the best things about Accelerate is that it lets you use the same code you wrote for training on a single GPU, and it’ll just run on multiple GPUs without you having to make a ton of changes. It makes scaling your models much easier because you don’t get stuck dealing with the headache of manually tweaking the code to fit different hardware setups.

Now, if you want to run your script in a multi-GPU setup with Accelerate, here’s what you need to do. First things first, make sure you’ve got the necessary libraries installed. You can do this by running these commands:


$ pip install accelerate


$ pip install datasets


$ pip install transformers


$ pip install scipy


$ pip install sklearn

Once that’s all done, you can move on to the configuration step. Run the following command to set things up:


$ accelerate config

When you run that command, you’ll be prompted to configure your environment. Here’s what you’ll need to fill in:

Compute environment: You’ll specify where the code is running. For example, you can pick the local machine (0), a cloud provider like AWS (1), or a specialized machine (3).
Machine type: Choose your hardware, like multi-CPU (1), multi-GPU (2), or TPU (3).
Multi-node training: Decide if you’re training on one machine or across multiple machines.
DeepSpeed integration: You can choose to use DeepSpeed, which is a library that helps optimize distributed training.
FullyShardedDataParallel: This is another distributed training approach you can opt for.
GPU count: Tell it how many GPUs you want to use for training. For instance, if you’re using two GPUs on the same machine, you’d select 2.

So, if you’re using a machine with two GPUs, you’ll configure it like this:


How many GPU(s) should be used for distributed training?: [1]: 2


Do you wish to use FP16 or BF16 (mixed precision)?: [NO/fp16/bf16]: no

After the configuration, you’re all set to launch your script with this command:


$ accelerate launch ./nlp_example.py

This will kick off the training process, and Accelerate will automatically take care of distributing the tasks across the GPUs for you. If you want to double-check that both GPUs are being used, you can run this command in the terminal to see the GPU usage:


$ nvidia-smi

This will show you how much each GPU is being utilized, so you’ll know that both are actively working during the training. By using Accelerate, setting up for multi-GPU training becomes way easier because it abstracts away a lot of the technical complexity. T

To dive deeper into multi-GPU setups with Accelerate, explore the detailed Hugging Face Multi-GPU Guide for additional configurations and best practices.

More features

So, as we mentioned earlier with the configuration steps, there’s a lot more to the Accelerate library than meets the eye. The setup we talked about is just scratching the surface, and there are several other cool features that make managing distributed machine learning tasks a whole lot easier. These extra features are designed to help you optimize and scale your models with a lot less hassle. Let’s take a look at some of them:

A range of arguments for the launched script: Accelerate gives you the flexibility to tweak a variety of settings for the scripts you run. This means you can adjust things to fit your needs, making the library more adaptable to different environments and use cases. If you’re looking for examples of how to fine-tune things, you can find a whole bunch of them on the Accelerate GitHub repository.
Multi-CPU support: In addition to supporting multi-GPU setups, Accelerate also lets you take advantage of multiple CPU cores. This is especially handy if you’re working with machines that don’t have GPUs or if you prefer training on CPUs. It’s great for running large models, even when your hardware isn’t super high-end.
Multi-GPU across several machines: One of the most powerful features of Accelerate is its ability to train models across multiple machines, not just GPUs. This is perfect for large-scale training when a single machine just doesn’t cut it. The best part? Accelerate makes it super easy to manage all those machines, so you can focus more on building your model rather than stressing over infrastructure.
Launcher from .ipynb Jupyter notebooks: If you’re working in a Jupyter notebook, you’ll love this. Accelerate lets you launch your scripts directly from there, making it so much easier to play around with your models in real time. You can change parameters, observe the results instantly, and keep everything within the notebook interface—no need to switch back and forth.
Mixed-precision floating point support: If you’re aiming for speed and efficiency, mixed-precision training is a real game changer. This technique uses both 16-bit and 32-bit floating-point numbers, which reduces memory usage and boosts performance without sacrificing accuracy. Accelerate has built-in support for this, making it a fantastic choice for large models or multi-GPU training.
DeepSpeed integration: Accelerate works seamlessly with DeepSpeed, which is an optimization library that supercharges the performance of deep learning models, especially when you’re working with very large-scale tasks. DeepSpeed helps with advanced optimization tricks, like model parallelism and gradient accumulation, so you get faster training with less resource consumption.
Multi-CPU with MPI (Message Passing Interface): For those of you working with more advanced multi-CPU setups, Accelerate supports MPI, which is widely used in high-performance computing. This allows multiple CPUs to communicate efficiently, so you can scale your models even further without needing GPUs.

Computer vision example
And if you think Accelerate is just for NLP tasks, think again! There’s also a computer vision example that you can run, which shows off how easy it is to use Accelerate for image-related tasks. This example uses the Oxford-IIT Pet Dataset, which is full of images of different pet breeds, and shows you how to train a ResNet50 network for image classification. It’s just like the NLP task, but here you’re classifying images instead of analyzing text.

If you want to run this example in a Jupyter notebook, here’s how you can quickly set things up:

First, you’ll need to install the dependencies. Run the following commands:


$ pip install accelerate
$ pip install datasets
$ pip install transformers
$ pip install scipy
$ pip install sklearn
$ pip install timm
$ pip install torchvision

Then, go to the examples directory and download the pet image dataset:


$ cd examples
$ wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
$ tar -xzf images.tar.gz

Finally, run the example script for computer vision:


$ python ./cv_example.py –data_dir images

This script will use the pet image dataset you just downloaded and train a ResNet50 model to classify the images by pet breed. You’ll see how the model performs as it works its way through the dataset. The flexibility of Accelerate in handling both NLP and computer vision tasks really shows how useful it is for a variety of machine learning applications.

For more details on advanced features of Accelerate, check out the full documentation on Hugging Face Accelerate Features.

Computer vision example

So, if you thought Accelerate was just for natural language processing (NLP), think again! There’s also a super useful machine learning example designed specifically for computer vision tasks. This example follows a similar structure to the NLP task, but instead of dealing with text data, it dives into teaching a model how to recognize images. The goal? To train a ResNet50 network on the Oxford-IIT Pet Dataset—a well-known dataset used for image classification. In this case, the goal is to classify images of different pet breeds. It’s a fun way to see how Accelerate can be used for image-related machine learning tasks.

To get this computer vision example up and running in a Jupyter notebook, just follow these simple steps:

First, install all the necessary dependencies. Just run these commands in your terminal or directly in your Jupyter notebook:


$ pip install accelerate
$ pip install datasets
$ pip install transformers
$ pip install scipy
$ pip install sklearn
$ pip install timm
$ pip install torchvision

These commands will set you up with Accelerate, the datasets library, transformers, and other essential packages like scipy, sklearn, timm, and torchvision. These are all the building blocks you’ll need to work with machine learning models and image data.

Once those dependencies are installed, go ahead and navigate to the examples directory. Then, download the Oxford-IIT Pet Dataset with these commands:


$ cd examples
$ wget https://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
$ tar -xzf images.tar.gz

The dataset contains images of various pet breeds, which will be used to train your ResNet50 model.

Now that you’ve downloaded and extracted the dataset, it’s time to run the computer vision example script with the following command:


$ python ./cv_example.py –data_dir images

This will kick off the training process using the ResNet50 network. The model will start learning to classify images based on the pet breeds in the dataset. Thanks to Accelerate, this whole process is pretty smooth, and you can leverage distributed training, even if you’re working with just a local machine that has GPUs or other hardware setups.

This example is a great illustration of how Accelerate simplifies machine learning workflows. Whether you’re working on image classification or other tasks, it retains the flexibility needed for scaling and optimizing your models across different setups.

For more insights into how Accelerate simplifies computer vision task

Conclusion

In conclusion, Hugging Face’s Accelerate library is a game-changer for PyTorch users looking to simplify multi-GPU and multi-CPU setups. By abstracting device management and reducing the complexity of distributed machine learning, Accelerate allows you to scale your models efficiently with minimal code changes. Whether you’re working with GPUs or leveraging mixed-precision training, this tool makes it easier to integrate cutting-edge technologies like DeepSpeed. As the demand for powerful deep learning models grows, tools like Accelerate will continue to evolve, helping developers achieve greater performance and flexibility with less effort.For more on how to leverage multi-GPU configurations and optimize your PyTorch workflow, Accelerate is the solution you need.

Optimize GPU Memory in PyTorch: Boost Performance with Multi-GPU Techniques

Alireza Pourmahdavi

I’m Alireza Pourmahdavi, a founder, CEO, and builder with a background that combines deep technical expertise with practical business leadership. I’ve launched and scaled companies like Caasify and AutoVM, focusing on cloud services, automation, and hosting infrastructure. I hold VMware certifications, including VCAP-DCV and VMware NSX. My work involves constructing multi-tenant cloud platforms on VMware, optimizing network virtualization through NSX, and integrating these systems into platforms using custom APIs and automation tools. I’m also skilled in Linux system administration, infrastructure security, and performance tuning. On the business side, I lead financial planning, strategy, budgeting, and team leadership while also driving marketing efforts, from positioning and go-to-market planning to customer acquisition and B2B growth.

Boost PyTorch Performance with Multi-GPU and Accelerate Library

Boost PyTorch Performance with Multi-GPU and Accelerate Library

Table of Contents

Introduction

What is Accelerate?

What is Accelerate?

Code changes to use Accelerate

Single-GPU

Running Accelerate

Multi-GPU

More features

Computer vision example

Conclusion

Alireza Pourmahdavi

Any Cloud Solution, Anywhere!

Navigation

Useful Links

Contact us