This project explores the concept of simulating a multi-GPU environment using only a single GPU. By dynamically managing memory and using PyTorch and PyTorch Lightning, it allows users to experience distributed deep learning training methods without the need for multiple physical GPUs.
- Objective: To simulate a multi-GPU training environment on a single GPU by dividing the memory allocation dynamically.
- Technologies Used: Python, PyTorch, PyTorch Lightning, NVIDIA's NVML library, TensorBoard.
- Outcome: Users can test distributed training techniques and understand multi-GPU training behaviors without needing multiple GPUs.
- Simulated Multi-GPU Environment: Emulates the behavior of multiple GPUs using only a single GPU, allowing for distributed training simulations.
- Dynamic Memory Management: Manages GPU memory dynamically to simulate the usage patterns of multiple GPUs.
- Real-time Monitoring: Uses NVIDIA's NVML library to monitor and log GPU memory usage in real-time.
- Comprehensive Visualization: Provides visualizations of simulated GPU memory usage over time to better understand the distribution of memory load.
-
Dynamic Memory Allocation: The project allocates memory dynamically across several simulated GPUs by splitting the memory usage of a single GPU. It uses a combination of Python and NVIDIA's NVML library to manage and monitor these allocations.
-
Training Simulation: The project runs deep learning training jobs that mimic distributed training across the simulated GPUs. The jobs are managed using PyTorch Lightning, which simplifies the model training process and provides a structure for managing different training tasks.
-
Memory Monitoring and Logging: As the training proceeds, the memory usage of each simulated GPU is logged in real-time. This data is then visualized to provide insights into the memory distribution and usage patterns.
-
Accessibility: Not everyone has access to a multi-GPU setup. By simulating multiple GPUs on a single device, we can provide a similar experience to those who only have access to one GPU.
-
Cost-Effectiveness: Buying and maintaining multiple GPUs can be expensive. This simulation allows users to experiment with distributed learning without the additional hardware cost.
-
Educational Purposes: It is a great tool for learning and teaching distributed training methods, helping to understand how multi-GPU setups work without needing the physical hardware.
- Python 3.8 or higher: Ensure you have Python installed. You can download it from Python's official website.
- CUDA Toolkit: A compatible CUDA toolkit for your GPU (version 12.1 used in this project). Download it from NVIDIA's official site.
-
Clone the repository:
git clone https://github.com/Mattjesc/Simulating-Multi-GPU-Distributed-Deep-Learning.git cd Simulating-Multi-GPU-Distributed-Deep-Learning
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Run the main script:
python Complete_Run.py
This will initiate the training process and start simulating GPU memory usage.
-
View GPU Memory Usage: After the training completes, the GPU memory usage logs will be saved in
gpu_memory_usage.log
, and a plot of this usage over time will be generated asgpu_memory_usage.png
.
cnn_model.py
: Defines the structure of the Convolutional Neural Network (CNN) used for training.data_module.py
: Manages data loading, transformation, and preparation using PyTorch Lightning'sDataModule
.train.py
: Contains functions to train the model on each simulated GPU "chunk".monitor_gpu.py
: Logs GPU memory usage dynamically to simulate multi-GPU training.plot_gpu.py
: Visualizes the GPU memory usage data captured during training.main.py
: The main script that coordinates training and monitoring activities.
- Operating System: The project should work on any OS that supports Python and CUDA (Windows, Linux, MacOS with specific configurations).
- CUDA Version: This project is configured for CUDA version 12.1 (
+cu121
). If using a different version, update thetorch
andtorchvision
library versions in therequirements.txt
file accordingly. Refer to the PyTorch Get Started page for compatibility details.
-
Simulated Environment: This project simulates multiple GPUs using only a single GPU by dynamically managing memory. It does not achieve the actual performance of a multi-GPU setup.
-
Variability in Results: Performance and results can vary based on system configuration, GPU model, available memory, system load, and other running processes.
-
External Factors: Factors such as GPU thermal throttling, driver versions, and system-level power management can affect the simulation and its results.