Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
KeitaW committed Mar 12, 2024
1 parent a2981f0 commit 8405a03
Showing 1 changed file with 38 additions and 2 deletions.
40 changes: 38 additions & 2 deletions 3.test_cases/16.pytorch-cpu-ddp/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# PyTorch DDP on CPU <!-- omit in toc -->

This test case is intended to provide simplest possible distributed training example on CPU using [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html).
This test case is intended to provide a simple distributed training example on CPU using [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html).

## 1. Preparation

Expand All @@ -20,5 +20,41 @@ Submit DDP training job with:
sbatch 1.train.sbatch
```

Output of the training job can be found in `logs` directory.
Output of the training job can be found in `logs` directory:

```bash
# cat logs/cpu-ddp_xxx.out
Node IP: 10.1.96.108
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING]
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] *****************************************
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-03-12 08:22:45,549] torch.distributed.run: [WARNING] *****************************************
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] Starting elastic_operator with launch configs:
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] entrypoint : ddp.py
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] min_nodes : 2
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] max_nodes : 2
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] nproc_per_node : 4
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] run_id : 5982
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] rdzv_backend : c10d
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] rdzv_endpoint : 10.1.96.108:29500
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] rdzv_configs : {'timeout': 900}
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] max_restarts : 0
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] monitor_interval : 5
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] log_dir : None
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO] metrics_cfg : {}
[2024-03-12 08:22:45,549] torch.distributed.launcher.api: [INFO]
[2024-03-12 08:22:45,552] torch.distributed.elastic.agent.server.local_elastic_agent: [INFO] log directory set to: /tmp/torchelastic_9g50nxjq/5982_tflt1tcd
[2024-03-12 08:22:45,552] torch.distributed.elastic.agent.server.api: [INFO] [default] starting workers for entrypoint: python
...
[RANK 3] Epoch 49 | Batchsize: 32 | Steps: 8
[RANK 5] Epoch 49 | Batchsize: 32 | Steps: 8
[RANK 4] Epoch 49 | Batchsize: 32 | Steps: 8
[2024-03-12 08:22:56,574] torch.distributed.elastic.agent.server.api: [INFO] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
[2024-03-12 08:22:56,574] torch.distributed.elastic.agent.server.api: [INFO] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
[2024-03-12 08:22:56,575] torch.distributed.elastic.agent.server.api: [INFO] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
[2024-03-12 08:22:56,575] torch.distributed.elastic.agent.server.api: [INFO] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
[2024-03-12 08:22:56,575] torch.distributed.elastic.agent.server.api: [INFO] Done waiting for other agents. Elapsed: 0.0010929107666015625 seconds
[2024-03-12 08:22:56,575] torch.distributed.elastic.agent.server.api: [INFO] Done waiting for other agents. Elapsed: 0.0005395412445068359 seconds
```

0 comments on commit 8405a03

Please sign in to comment.