Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TP with external_launcher is not working with vLLM version 0.8.0 and above #15895

Closed
1 task done
toslali-ibm opened this issue Apr 1, 2025 · 7 comments · Fixed by #15980
Closed
1 task done
Labels
bug Something isn't working

Comments

@toslali-ibm
Copy link

toslali-ibm commented Apr 1, 2025

Your current environment

The output of `examples/offline_inference/torchrun_example.py`
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 59.10it/s, est. speed input: 384.52 toks/s, output: 946.41 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Vicky and I am living in the Philippines and I am a mother of 2'
Prompt: 'The president of the United States is', Generated text: " pressuring, so I wouldn't worry too much.\nThe president of the United"
Prompt: 'The capital of France is', Generated text: ' destroying itself.\n French people could care less about its name than about the future'
Prompt: 'The future of AI is', Generated text: " far from bright.  It's going to go much, much faster than us"
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 51.83it/s, est. speed input: 337.24 toks/s, output: 830.03 toks/s]
Prompt: 'Hello, my name is', Generated text: " Magicky, I'm an in Germany city. have am about big, three"
Prompt: 'The president of the United States is', Generated text: " calling, but it'm't be too much about\nIs President of the United"
Prompt: 'The capital of France is', Generated text: ' notorious itself.  battles Prime are have less about it great, the a French'
Prompt: 'The future of AI is', Generated text: ' in from clear, Scientists If is already to be down more much faster than we'

🐛 Describe the bug

When I run the script with torchrun --nproc-per-node=2 torchrun_example.py, ranks have different output (vlllm == 0.8.0 and onward). Whn I try it with 0.7.3, it works.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

CC @youkaichao

@youkaichao
Copy link
Member

that seems to be true, i can reproduce in the main branch.

this also produces different results:

VLLM_USE_FLASHINFER_SAMPLER=0 VLLM_ENABLE_V1_MULTIPROCESSING=0 torchrun --nproc-per-node=2 torchrun_example.py

VLLM_USE_V1=0 torchrun --nproc-per-node=2 torchrun_example.py also produces different results.

@toslali-ibm can you try to bisect to find which commit is responsible? following https://blog.vllm.ai/2025/01/10/dev-experience.html you can find wheels for all commits.

@toslali-ibm
Copy link
Author

that seems to be true, i can reproduce in the main branch.

this also produces different results:

VLLM_USE_FLASHINFER_SAMPLER=0 VLLM_ENABLE_V1_MULTIPROCESSING=0 torchrun --nproc-per-node=2 torchrun_example.py

VLLM_USE_V1=0 torchrun --nproc-per-node=2 torchrun_example.py also produces different results.

@toslali-ibm can you try to bisect to find which commit is responsible? following https://blog.vllm.ai/2025/01/10/dev-experience.html you can find wheels for all commits.

I am able to get identical generations if I use vllm 0.7.3. I will try wheels to identify which commit broke this behavior.

@toslali-ibm
Copy link
Author

toslali-ibm commented Apr 2, 2025

I think I found the breaking commit.

pip uninstall vllm -y; pip install https://wheels.vllm.ai/05fb6718f0d80519fcb8011ccd841fc7f37db3c1/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl; torchrun --nproc-per-node=2 torchrun_example.py gives identical generations

vs.

pip uninstall vllm -y; pip install https://wheels.vllm.ai/cc10281498fc2a6eb804274dcf22e6cb766f7aa7/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl; torchrun --nproc-per-node=2 torchrun_example.py gives different generations

CC @youkaichao

@youkaichao
Copy link
Member

but that commit sets a fixed random seed, right? why would that produce different results 🤔

@WoosukKwon
Copy link
Collaborator

@toslali-ibm @youkaichao #14274 sets the seed to None, which means no global seed.
One needs to explicitly set a seed for the determinism.
I got the same results from the workers when the seed it set.

Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 37.84it/s, est. speed input: 246.06 toks/s, output: 605.66 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '

@WoosukKwon
Copy link
Collaborator

@toslali-ibm @youkaichao Please see https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/reproduciblity.py
I think we should update the torchrun_example.py and check whether the seed is set when initializing ParallelConfig.

@youkaichao
Copy link
Member

@WoosukKwon thanks for the reminder! Indeed I can see that cc10281 adds seed to tests/distributed/test_torchrun_example.py, but not torchrun_example.py .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants