Introduction
Efficiently managing GPU memory is crucial for optimizing performance in PyTorch, especially when working with large models and datasets. By leveraging techniques like data parallelism and model parallelism, you can distribute workloads across multiple GPUs, speeding up training and inference times. Additionally, practices such as using torch.no_grad(), emptying the CUDA cache, and utilizing 16-bit precision help to reduce memory overhead and prevent out-of-memory errors. In this article, we’ll walk you through the best practices for optimizing GPU memory and utilizing multi-GPU setups to boost your PyTorch performance.
What is Multiple GPUs in PyTorch?
This solution focuses on optimizing the use of multiple GPUs in deep learning tasks. It includes methods for distributing workloads across GPUs to speed up training and inference. By using techniques like data parallelism and model parallelism, and automating GPU selection, it helps prevent memory issues and out-of-memory errors. The goal is to make the most of GPU resources to enhance performance and ensure efficient model training.
Moving tensors around CPU / GPUs
Every tensor in PyTorch has a to() function that allows you to move the tensor to a specific device, like the CPU or a particular GPU. This function accepts a torch.device object as input, and you can initialize it with either of the following options:
cpufor using the CPU,cuda:0for putting the tensor on GPU number 0.
By default, when you create a tensor, it starts off on the CPU. But you can easily move it to the GPU by calling the to() function. To check if a GPU is available, you can use torch.cuda.is_available(), which gives you a true/false response based on whether CUDA-enabled GPUs are available.
Here’s an example:
if torch.cuda.is_available():
dev = “cuda:0”
else:
dev = “cpu”
device = torch.device(dev)
a = torch.zeros(4, 3) # Initialize a tensor of zeros
a = a.to(device) # Move the tensor to the selected device (CPU or GPU)
Alternatively, you can specify the device directly by passing the device index to the to() function. This makes your code device-agnostic, meaning you don’t have to change anything if you switch between CPU and GPU. For instance:
a = a.to(0) # Move tensor ‘a’ to GPU 0
cuda() function
Another way to transfer tensors to GPUs is using the cuda(n) function, where n specifies the index of the GPU. If you use cuda() without an argument, it will put the tensor on GPU 0 by default. You can also use the to() and cuda() methods provided by the torch.nn.Module class to move the entire neural network to a specific device. When using these methods on a neural network, you don’t need to assign the returned value; just call the function directly. For example:
clf = myNetwork()
clf.to(torch.device(“cuda:0”)) # Move the network to GPU 0
# or
clf = clf.cuda() # Equivalent to the previous line
Automatic selection of GPU
While it’s helpful to manually choose which GPU a tensor should go to, we often work with many tensors during operations. And we want these tensors to automatically be created on the right device to avoid unnecessary transfers between devices, which can slow things down. PyTorch gives us a way to automate this. One handy function is torch.get_device(). This function only works for GPU tensors, and it tells you the index of the GPU where the tensor currently resides. You can use this to figure out where a tensor is located and ensure any new tensor is created on the same device. Here’s an example:
a = t1.get_device() # Get the device index of tensor ‘t1’
b = torch.tensor(a.shape).to(dev) # Create tensor ‘b’ on the same device as ‘t1’
You can also use the cuda(n) function to create tensors directly on a specified device. By default, all tensors created with cuda() are placed on GPU 0, but you can change that with:
torch.cuda.set_device(0) # Set the default GPU to 0
# or
torch.cuda.set_device(1) # Set the default GPU to 1, or any other number
If an operation involves two tensors on the same device, the resulting tensor will also be placed on that device. But if the tensors are on different devices, you’ll get an error. So, it’s crucial to make sure that all tensors involved in an operation are on the same device before you perform it.
new_* functions
In PyTorch version 1.0, a set of new_* functions were introduced to help create new tensors that share the same data type and device as the tensor they’re called on. For example:
ones = torch.ones((2,)) .cuda(0) # Create a tensor of ones of size (2, ) on GPU 0
newOnes = ones.new_ones((3, 4)) # Create a new tensor of ones with shape (3, 4) on the same device as ‘ones’
randTensor = torch.randn(2, 4) # Create a random tensor with shape (2, 4) on the same device as ‘ones’
These functions are great for keeping your tensors device-agnostic, especially when working with multiple GPUs or handling large datasets. There’s a detailed list of new_* functions in the PyTorch documentation, so if you want to dive deeper into the specifics of creating tensors and managing memory across devices, that’s a great resource to check out.
Read more about managing GPU memory and tensor placement in PyTorch PyTorch CUDA Documentation.
cuda() function
So, if you want to move tensors to GPUs in PyTorch, one easy way is by using the cuda(n) function. Here, n is the index of the GPU you want to move your tensor to. If you don’t provide an argument to cuda(), it’ll just default to GPU 0. This is super helpful if you have more than one GPU available for processing. It ensures that your tensor lands on the right GPU automatically.
Now, PyTorch doesn’t stop there. It also gives you the to() and cuda() methods, which you can use within the torch.nn.Module class to move your whole neural network (or model) to a specific device, like a GPU. The cool thing about the to() method is that when you use it on an nn.Module object, you don’t have to assign the returned value back to the object, because the method changes the model in place.
Let’s say you want to move your model, myNetwork(), to GPU 0. You’d do it like this:
clf = myNetwork()
clf.to(torch.device(“cuda:0”)) # Move the model to GPU 0
Or you could use the cuda() method instead, which is basically the same thing:
clf = clf.cuda() # Equivalent to the previous line
This whole approach is great because it makes handling your model’s device placement super easy. You don’t have to manually move each tensor around when you’re dealing with big models or when you’re shifting the whole network to a GPU for training or inference. It just simplifies everything!
Read more about managing tensor operations across multiple GPUs and using the cuda() function in PyTorch PyTorch CUDA Documentation.
Automatic selection of GPU
So, here’s the thing: when you’re working with PyTorch, picking which GPU a tensor goes to can give you a lot of control and help you optimize your setup. But, if you’re dealing with large models or datasets, manually choosing which GPU to assign each tensor can get pretty exhausting and, honestly, not the most efficient way to go about it. That’s when it’s much better to let PyTorch handle things automatically for you. It makes sure your tensors are placed on the right device without you having to micromanage them, which means less work for you and a smoother process overall.
You see, PyTorch has some built-in functionality to automatically assign devices to tensors. A super useful function for this is torch.get_device(). It’s especially great for GPU tensors. When you use this function, it gives you the GPU index where the tensor is located, so you can not only figure out where a tensor is, but also move any new tensors to the right device without doing it manually.
Let’s look at an example to make this clearer:
# Ensuring t2 is on the same device as t1
a = t1.get_device() # Get the device index of t1
b = torch.tensor(a.shape).to(dev) # Automatically move tensor b to the same device as t1
Here, what’s happening is that a = t1.get_device() grabs the device index of tensor t1, and then we create a new tensor b on the same device by using the .to() method. This means no more worrying about moving tensors around manually—PyTorch does the heavy lifting for you.
Another option you’ve got is the cuda(n) function, which can also help you control where your tensors get created. Normally, if you use cuda(), it’ll place your tensor on GPU 0. But if you want it to go somewhere else, you just need to tell PyTorch which GPU you want by passing the index number to cuda(). For example:
torch.cuda.set_device(0) # Set the current device to GPU 0
# or alternatively
torch.cuda.set_device(1) # Set the current device to GPU 1
The cool thing here is that if you perform an operation between two tensors on the same device, the resulting tensor will also end up on that same device. But—just a heads up—if the tensors are on different devices, you’ll get an error. PyTorch needs the tensors to be on the same device to operate correctly.
All of this is pretty handy, right? It makes memory management easier and keeps things running smoothly, especially in multi-GPU setups. Plus, it helps you avoid the hassle of manually managing devices, making sure everything stays where it’s supposed to and avoiding unnecessary data transfers between devices.
For more information on efficiently managing GPU usage and automatic selection, check out PyTorch CUDA Documentation.
new_* functions
In PyTorch, the new_* functions, introduced in version 1.0, are super handy when you need to create new tensors based on another tensor’s properties, like its data type and which device it’s placed on. These functions come in handy when you want your new tensors to match an existing tensor’s shape, device, and type—making things easier and ensuring consistency in your tensor operations across different devices.
Let’s take the new_ones() function as an example. This function creates a new tensor, filled with ones, while keeping the same data type and device as the tensor it’s called on. This is especially useful when you need to create tensors that should be compatible with others in terms of shape, device, and type. Here’s how you can use it:
ones = torch.ones((2,)).cuda(0) # Create a tensor of ones of size (2,) on GPU 0
newOnes = ones.new_ones((3,4)) # Create a new tensor of ones of size (3,4) on the same device as “ones”
In this example, ones is a tensor of ones created on GPU 0. Then, by using new_ones(), we create newOnes, which is a new tensor of ones with a size of (3,4), and it lives on the same GPU (GPU 0) as the original ones tensor.
PyTorch also has other new_* functions like new_zeros(), new_full(), and new_empty(). These allow you to create tensors filled with zeros, a specific value, or uninitialized values—while making sure they’re placed on the same device as the tensor they’re based on. These functions are especially helpful in multi-device setups and when your tensors are involved in complex operations that need them to be on the same device.
For example:
randTensor = torch.randn(2,4) # Create a tensor with random values of size (2,4)
These new_* functions are pretty powerful when it comes to avoiding mistakes in device placement and ensuring that your new tensors share the same properties as the original tensor. And if you want to dig deeper, there’s a detailed list of all the new_* functions in the PyTorch documentation.
For more details on efficient tensor management and initialization in PyTorch, visit the PyTorch Tensor Documentation.
Using Multiple GPUs
When you’re working with large models or datasets in PyTorch, using multiple GPUs can really speed things up. There are two main ways to use multiple GPUs: Data Parallelism and Model Parallelism.
Data Parallelism
Data Parallelism is probably the most common way to split up work across multiple GPUs in PyTorch. Basically, this method takes a big batch of data and splits it into smaller mini-batches, which are then processed at the same time on different GPUs. After each GPU works on its chunk, the results are gathered together and combined on one device—usually the device that originally held the data.
In PyTorch, you can implement Data Parallelism using the nn.DataParallel class. This class helps to manage splitting the data and processing it on multiple GPUs while keeping everything synced up. Here’s how you might use it:
parallel_net = nn.DataParallel(myNet, gpu_ids=[0, 1, 2])
predictions = parallel_net(inputs) # Forward pass on multi-GPUs
loss = loss_function(predictions, labels) # Compute the loss
loss.mean().backward() # Average GPU losses + backward pass
optimizer.step() # Update the model
In this example, myNet is the neural network you’re working with, and gpu_ids=[0, 1, 2] means the model will be spread out across GPUs 0, 1, and 2. After the forward pass, the predictions are computed in parallel on these GPUs, and the loss is calculated and sent back through the network.
But here’s the thing: Even though the data is split across multiple GPUs, it still needs to be loaded onto a single GPU to start with. You also need to make sure the DataParallel object is on that same GPU. Here’s how to handle that:
input = input.to(0) # Move the input tensor to GPU 0
parallel_net = parallel_net.to(0) # Make sure the DataParallel object is on GPU 0
This way, both the model and the data are on the same GPU for the initial processing. Essentially, the nn.DataParallel class works by breaking the input data into smaller chunks, copying the neural network to the available GPUs, doing the forward pass, and then collecting the results back on the original GPU.
Now, one challenge with Data Parallelism is that it can lead to one GPU doing more work than the others, which isn’t ideal. To fix this, you can do a couple of things. First, you could calculate the loss during the forward pass. This way, the loss calculation is parallelized too. Another option is to implement a parallel loss function layer to optimize how the workload is split. Implementing this parallel loss function layer might be a bit tricky, but it could help if you’re really looking to squeeze out more performance.
Model Parallelism
Model Parallelism is another way to split up the workload across multiple GPUs. Unlike Data Parallelism, where the data gets split up and processed at the same time, Model Parallelism divides the model itself into smaller pieces, or subnetworks, and places each one on a different GPU. This approach works great when the model is too big to fit into the memory of a single GPU.
However, there’s a catch. Model Parallelism tends to be slower than Data Parallelism because the subnetworks are dependent on each other. This means each GPU has to wait for data from another GPU, which can slow things down. Still, the big win here is that you can train models that would be too large for just one GPU.
Here’s a diagram showing the basic idea:
[Subnet 1] —> [Subnet 2] (with wait times during forward and backward passes)
So yeah, while Model Parallelism might be a bit slower in terms of processing speed, it’s still a game changer when you need to work with models that are too large to fit on just one GPU.
Model Parallelism with Dependencies
Implementing Model Parallelism in PyTorch isn’t too complicated as long as you remember two important things:
- The input and the network need to be on the same device to avoid unnecessary device transfers.
- PyTorch’s
to()andcuda()functions support autograd, so gradients can be passed between GPUs during the backward pass.
Here’s an example of how you can set up Model Parallelism in PyTorch with two subnetworks placed on different GPUs:
class model_parallel(nn.Module):
def __init__(self):
super().__init__()
self.sub_network1 = …
self.sub_network2 = …
self.sub_network1.cuda(0) # Place the first sub-network on GPU 0
self.sub_network2.cuda(1) # Place the second sub-network on GPU 1
def forward(self, x):
x = x.cuda(0) # Move input to GPU 0
x = self.sub_network1(x) # Process input through the first sub-network
x = x.cuda(1) # Transfer output to GPU 1
x = self.sub_network2(x) # Process input through the second sub-network
return x
In this example, model_parallel defines two subnetworks: sub_network1 and sub_network2. sub_network1 is placed on GPU 0, and sub_network2 is placed on GPU 1. During the forward pass, the input tensor is first moved to GPU 0, where it’s processed by sub_network1. Then, the output is moved to GPU 1, where it’s processed by sub_network2.
Since PyTorch’s autograd system is handling things, the gradients from sub_network2 will automatically be sent back to sub_network1 during the backward pass, making sure the model is trained properly across multiple GPUs. This approach lets you take full advantage of multiple GPUs, even if the model is too big to fit on one.
To learn more about optimizing multi-GPU workflows in deep learning, check out the PyTorch Distributed Data Parallel (DDP) Tutorial.
Data Parallelism
Data Parallelism in PyTorch is a great way to split up the work when you need to process a ton of data, especially if you’ve got a few GPUs lying around. The idea is to distribute the workload across multiple GPUs, which speeds up the whole process, especially when you’re dealing with big datasets. This technique is all about splitting your data into smaller chunks, running them in parallel across several GPUs, and then merging the results. It’s super handy for making the most of your GPU resources.
To use Data Parallelism in PyTorch, you set it up with the nn.DataParallel class. This class takes care of splitting your data and running the job on multiple GPUs. You just need to pass in your neural network (nn.Module object) and a list of GPU IDs that the data will be split across. Here’s a simple example of how to get it going:
parallel_net = nn.DataParallel(myNet, gpu_ids=[0, 1, 2])
In this case, myNet is your neural network, and gpu_ids=[0, 1, 2] tells PyTorch to spread the workload across GPUs 0, 1, and 2. This way, your model can handle bigger batches of data, which speeds up training a lot.
Once you’ve got your DataParallel object set up, you can treat it just like a regular nn.Module object. For example, during the forward pass, you just call it like this:
predictions = parallel_net(inputs) # Forward pass on multi-GPUs
Now, the model is processing input data across the GPUs. After that, you can compute the loss and do the backward pass like you normally would:
loss = loss_function(predictions, labels) # Compute loss function
loss.mean().backward() # Average GPU losses + backward pass
optimizer.step() # Update the model
However, here’s something to keep in mind. Even though your data is split across multiple GPUs, it has to start on a single GPU. You also need to make sure the DataParallel object is on the correct GPU, just like you would with any regular nn.Module. Here’s how you make sure the model and input data are on the same device:
input = input.to(0) # Move the input tensor to GPU 0
parallel_net = parallel_net.to(0) # Ensure the DataParallel object is on GPU 0
This is super important to make sure everything syncs up properly when training. The nn.DataParallel class works by taking your input data, splitting it into smaller batches, making copies of your neural network on all the GPUs, doing the forward pass on each GPU, and then collecting everything back on the original GPU.
Here’s a quick overview of how it all works:
- [Input Data] → [Split into smaller batches] → [Replicate Network on GPUs] → [Forward pass on each GPU] → [Gather results on original GPU]
Now, one issue with Data Parallelism is that it can lead to one GPU doing more work than the others, which can mess with performance. This usually happens because the main GPU is the one collecting the results from all the other GPUs, making it take on more work.
To avoid this, you can use a couple of tricks:
- Compute the loss during the forward pass: This ensures that the loss calculation is parallelized too, so the workload gets distributed a bit more evenly across the GPUs.
- Implement a parallel loss function layer: This would spread the loss computation across the
To explore more about leveraging Data Parallelism in deep learning, check out the PyTorch Data Parallelism Tutorial.
Model Parallelism
Model parallelism is a handy trick in deep learning, especially when your neural network is just too big for one GPU to handle. The idea is to split the network into smaller subnetworks and distribute them across multiple GPUs. This way, you can work with massive models that wouldn’t fit into a single GPU’s memory.
But here’s the catch—model parallelism is usually slower than data parallelism. Why? Well, when you break up a single neural network and spread it across GPUs, the GPUs have to communicate with each other. During the forward pass, one subnetwork might have to wait for data from another, and during the backward pass, the gradients need to be shared between GPUs. These dependencies can slow things down because the GPUs aren’t running totally independently like they would in data parallelism. But even with the slowdowns, model parallelism is still a winner when your model is too big to fit into one GPU. It allows you to work with larger models that would otherwise be impossible.
For example, imagine this: Subnet 2 has to wait for the output from Subnet 1 during the forward pass. Then, Subnet 1 has to wait for Subnet 2’s gradients during the backward pass. See how that can slow down the process? But that’s the price you pay for handling bigger models.
Model Parallelism with Dependencies
Implementing model parallelism in PyTorch is pretty straightforward, as long as you remember two key things:
- The input and the network need to be on the same device—this helps avoid unnecessary device transfers.
- PyTorch’s
to()andcuda()functions support autograd, meaning gradients can be transferred between GPUs during the backward pass, helping backpropagate across devices.
Now, let’s take a look at how to implement this in code:
class model_parallel(nn.Module):
def __init__(self):
super().__init__()
self.sub_network1 = …
self.sub_network2 = …
self.sub_network1.cuda(0) # Move sub-network 1 to GPU 0
self.sub_network2.cuda(1) # Move sub-network 2 to GPU 1def forward(self, x):
x = x.cuda(0) # Move input to GPU 0
x = self.sub_network1(x) # Process input through sub-network 1
x = x.cuda(1) # Move output of sub-network 1 to GPU 1
x = self.sub_network2(x) # Process through sub-network 2
return xHere’s what’s happening:
- In the
__init__method, we assignsub_network1to GPU 0 andsub_network2to GPU 1. - During the forward pass, the input first goes to GPU 0 to be processed by
sub_network1. Then, the output moves over to GPU 1, where it’s processed bysub_network2.

Caasify Team contributes technical and infrastructure-focused content to the Caasify blog.
Cloud VPS
High-performance NVMe instances with instant global deployment.
Web Hosting
Secure & Managed
VPN Services
Pay-As-You-Go Privacy
