How to train an LLM with multiple GPUs?

Basic Premise

HuggingFace itself provides accelerate which allows one to run Pytorch code on a distributed Configuration.

You can setup accelerate and the configuration of your GPUs in the command line here:

$ accelerate config

The next main feature of Accelerate is the Accelerator class which adapts your PyTorch code to run on different distributed setups.

You only need to add a few lines of code to your training script to enable it to run on multiple GPUs or TPUs.

+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ device = accelerator.device
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for batch in training_dataloader:
      optimizer.zero_grad()
      inputs, targets = batch
-     inputs = inputs.to(device)
-     targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
+     accelerator.backward(loss)
      optimizer.step()
      scheduler.step()

Import and instantiate the Accelerator class at the beginning of your training script. The Accelerator class initializes everything necessary for distributed training, and it automatically detects your training environment (a single machine with a GPU, a machine with several GPUs, several machines with multiple GPUs or a TPU, etc.) based on how the code was launched.

from accelerate import Accelerator

accelerator = Accelerator()

Remove calls like .cuda() on your model and input data. The Accelerator class automatically places these objects on the appropriate device for you.

device = accelerator.device

Pass all relevant PyTorch objects for training (optimizer, model, dataloader(s), learning rate scheduler) to the prepare() method as soon as they’re created. This method wraps the model in a container optimized for your distributed setup, uses Accelerates version of the optimizer and scheduler, and creates a sharded version of your dataloader for distribution across GPUs or TPUs.

model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
    model, optimizer, train_dataloader, lr_scheduler
)

Replace loss.backward() with backward() to use the correct backward() method for your training setup.

accelerator.backward(loss)

Sharding training across multiple GPUs

As long as you have followed the above steps and kept everything in a main() function you can now shard your training across multiple GPUs.

You need to specify your config with accelerate config and then, you need to launch it with accelerate launch.

Sharding the model itself across multiple GPUs

To allow this to work we will need to enable Fully Sharded Data Parallel which is available in pytorch in accelerate.

Setup your FSDP as following:

from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=False, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

And run with Accelerate.

References

This project uses code from the following source:

HuggingFace Accelerate: Available at: URL to the original source
HuggingFace FSDP: Available at: URL to the original source

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu.md

multi-gpu.md

How to train an LLM with multiple GPUs?

Basic Premise

Sharding training across multiple GPUs

Sharding the model itself across multiple GPUs

See Also

References

Files

multi-gpu.md

Latest commit

History

multi-gpu.md

File metadata and controls

How to train an LLM with multiple GPUs?

Basic Premise

Sharding training across multiple GPUs

Sharding the model itself across multiple GPUs

See Also

References