[Bug]: TP with external_launcher is not working with vLLM version 0.8.0 and above #15895

toslali-ibm · 2025-04-01T17:07:35Z

Your current environment

The output of `examples/offline_inference/torchrun_example.py`

Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 59.10it/s, est. speed input: 384.52 toks/s, output: 946.41 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Vicky and I am living in the Philippines and I am a mother of 2'
Prompt: 'The president of the United States is', Generated text: " pressuring, so I wouldn't worry too much.\nThe president of the United"
Prompt: 'The capital of France is', Generated text: ' destroying itself.\n French people could care less about its name than about the future'
Prompt: 'The future of AI is', Generated text: " far from bright.  It's going to go much, much faster than us"
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 51.83it/s, est. speed input: 337.24 toks/s, output: 830.03 toks/s]
Prompt: 'Hello, my name is', Generated text: " Magicky, I'm an in Germany city. have am about big, three"
Prompt: 'The president of the United States is', Generated text: " calling, but it'm't be too much about\nIs President of the United"
Prompt: 'The capital of France is', Generated text: ' notorious itself.  battles Prime are have less about it great, the a French'
Prompt: 'The future of AI is', Generated text: ' in from clear, Scientists If is already to be down more much faster than we'

🐛 Describe the bug

When I run the script with torchrun --nproc-per-node=2 torchrun_example.py, ranks have different output (vlllm == 0.8.0 and onward). Whn I try it with 0.7.3, it works.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

CC @youkaichao

The text was updated successfully, but these errors were encountered:

youkaichao · 2025-04-02T14:19:49Z

that seems to be true, i can reproduce in the main branch.

this also produces different results:

VLLM_USE_FLASHINFER_SAMPLER=0 VLLM_ENABLE_V1_MULTIPROCESSING=0 torchrun --nproc-per-node=2 torchrun_example.py

VLLM_USE_V1=0 torchrun --nproc-per-node=2 torchrun_example.py also produces different results.

@toslali-ibm can you try to bisect to find which commit is responsible? following https://blog.vllm.ai/2025/01/10/dev-experience.html you can find wheels for all commits.

toslali-ibm · 2025-04-02T14:29:23Z

that seems to be true, i can reproduce in the main branch.

this also produces different results:

VLLM_USE_FLASHINFER_SAMPLER=0 VLLM_ENABLE_V1_MULTIPROCESSING=0 torchrun --nproc-per-node=2 torchrun_example.py

VLLM_USE_V1=0 torchrun --nproc-per-node=2 torchrun_example.py also produces different results.

@toslali-ibm can you try to bisect to find which commit is responsible? following https://blog.vllm.ai/2025/01/10/dev-experience.html you can find wheels for all commits.

I am able to get identical generations if I use vllm 0.7.3. I will try wheels to identify which commit broke this behavior.

toslali-ibm · 2025-04-02T16:54:55Z

I think I found the breaking commit.

pip uninstall vllm -y; pip install https://wheels.vllm.ai/05fb6718f0d80519fcb8011ccd841fc7f37db3c1/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl; torchrun --nproc-per-node=2 torchrun_example.py gives identical generations

vs.

pip uninstall vllm -y; pip install https://wheels.vllm.ai/cc10281498fc2a6eb804274dcf22e6cb766f7aa7/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl; torchrun --nproc-per-node=2 torchrun_example.py gives different generations

CC @youkaichao

youkaichao · 2025-04-02T17:05:20Z

but that commit sets a fixed random seed, right? why would that produce different results 🤔

WoosukKwon · 2025-04-03T00:55:14Z

@toslali-ibm @youkaichao #14274 sets the seed to None, which means no global seed.
One needs to explicitly set a seed for the determinism.
I got the same results from the workers when the seed it set.

Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 37.84it/s, est. speed input: 246.06 toks/s, output: 605.66 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '

WoosukKwon · 2025-04-03T00:58:56Z

@toslali-ibm @youkaichao Please see https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/reproduciblity.py
I think we should update the torchrun_example.py and check whether the seed is set when initializing ParallelConfig.

youkaichao · 2025-04-03T02:40:15Z

@WoosukKwon thanks for the reminder! Indeed I can see that cc10281 adds seed to tests/distributed/test_torchrun_example.py, but not torchrun_example.py .

toslali-ibm added the bug label Apr 1, 2025

toslali-ibm mentioned this issue Apr 1, 2025

Co-Locating vLLM w/ training to achieve higher throughput and GPU utilization huggingface/trl#3162

Open

5 tasks

youkaichao mentioned this issue Apr 3, 2025

[bugfix] add seed in torchrun_example.py #15980

Merged

youkaichao closed this as completed in #15980 Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

[Bug]: TP with external_launcher is not working with vLLM version 0.8.0 and above #15895

[Bug]: TP with external_launcher is not working with vLLM version 0.8.0 and above #15895

toslali-ibm commented Apr 1, 2025 •

edited

Loading

youkaichao commented Apr 2, 2025

toslali-ibm commented Apr 2, 2025

toslali-ibm commented Apr 2, 2025 •

edited

Loading

youkaichao commented Apr 2, 2025

WoosukKwon commented Apr 3, 2025

WoosukKwon commented Apr 3, 2025

youkaichao commented Apr 3, 2025

[Bug]: TP with external_launcher is not working with vLLM version 0.8.0 and above #15895

[Bug]: TP with external_launcher is not working with vLLM version 0.8.0 and above #15895

Comments

toslali-ibm commented Apr 1, 2025 • edited Loading

Your current environment

🐛 Describe the bug

Before submitting a new issue...

youkaichao commented Apr 2, 2025

toslali-ibm commented Apr 2, 2025

toslali-ibm commented Apr 2, 2025 • edited Loading

youkaichao commented Apr 2, 2025

WoosukKwon commented Apr 3, 2025

WoosukKwon commented Apr 3, 2025

youkaichao commented Apr 3, 2025

toslali-ibm commented Apr 1, 2025 •

edited

Loading

toslali-ibm commented Apr 2, 2025 •

edited

Loading