This repository provides a comprehensive guide and practical examples for training deep learning models using PyTorch across various parallelism strategies. Whether you are working on single-GPU training or scaling to multi-GPU setups with Distributed Data Parallel (DDP) or Fully Sharded Data Parallel (FSDP), these examples will guide you through the process.
- Foundational concepts of deep learning and PyTorch.
- HPC Environment Setup:
- Using SLURM for job scheduling: Submitting and managing training jobs.
- Loading necessary modules: Configuring PyTorch and CUDA on an HPC cluster.
- Efficiently training models on a single GPU.
- Optimizations:
- DALI: Efficient data loading using NVIDIA Data Loading Library.
- AMP: Automatic Mixed Precision for faster training with reduced memory consumption.
- Scaling models across multiple GPUs using
torch.nn.DataParallel
. - Key Considerations:
- Understanding inter-GPU communication overhead.
- Differences between DP and DDP for better performance.
- Leveraging
torch.nn.parallel.DistributedDataParallel
for efficient multi-GPU training. - Setting up process groups and distributed samplers
- Advantages of DDP Over DP:
- Lower communication overhead.
- Better scalability across multiple nodes.
- Training large models with memory efficiency using Fully Sharded Data Parallel (FSDP).
- Fine-tuning large-scale models like CodeLlama.
- Running PyTorch training using NVIDIA Enroot and NGC Containers on HPC.
- Topics Covered:
- Importing and running NGC PyTorch containers with Enroot.
- Running single and multi-GPU PyTorch workloads inside containers.
- Using SLURM to launch containerized PyTorch jobs on GPU clusters.
- PyTorch Documentation
- PyTorch with Examples
- Data Parallel
- Distributed Data Parallel (DDP)
- Hugging Face Transformers
- PyTorch FSDP Tutorial
- NVIDIA Enroot
- NGC PyTorch Containers
- If you are already familiar with deep learning with PyTorchand HPC, you can skip 01. Introduction to Deep Learning and go directly to 02. Single-GPU Training.