Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] PP support for embedding models and update docs #9090

Merged
merged 23 commits into from
Oct 6, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Oct 5, 2024

I found that PP actually didn't work for embedding models because it's not implemented in the embedding model runner. I've updated the embedding model runner to support PP, and cleaned up the existing code in the models so that both embedding and CausalLM models can share weight loading logic.

I have also updated the Supported Models page (again) with additional models.

Copy link

github-actions bot commented Oct 5, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Comment on lines +112 to +116
if (self.observability_config is not None
and self.observability_config.collect_model_forward_time):
model_forward_start = torch.cuda.Event(enable_timing=True)
model_forward_end = torch.cuda.Event(enable_timing=True)
model_forward_start.record()
Copy link
Member Author

@DarkLight1337 DarkLight1337 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from GPUModelRunner. I don't really like how we're repeating this code though...

@DarkLight1337
Copy link
Member Author

@zhuzilin can you help check whether the Qwen2.5-Math-RM-72B can be used in PP setting? The model is too big for me to test.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 5, 2024
@youkaichao
Copy link
Member

@DarkLight1337 you can test it with dummy weight and manually change the hidden size of config to make the model smaller

@youkaichao
Copy link
Member

hand it over to @andoorve for review

Copy link
Contributor

@andoorve andoorve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one minor comment, otherwise looks straightforward to me!

"baichuan-inc/Baichuan-7B": PPTestSettings.fast(trust_remote_code=True),
"baichuan-inc/Baichuan2-13B-Chat": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
"bigscience/bloomz-1b1": PPTestSettings.fast(),
"THUDM/chatglm3-6b": PPTestSettings.fast(trust_remote_code=True),
"CohereForAI/c4ai-command-r-v01": PPTestSettings.fast(tp_base=2, trust_remote_code=True), # noqa: E501
# TODO: Test on larger GPU
# "databricks/dbrx-instruct": PPTestSettings.fast(),
"databricks/dbrx-instruct": PPTestSettings.fast(tp_base=4),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcode of 4 will only work on 8 GPU machines. Might be a bit confusing, or comments should be added to adjust this as necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not need all of them as well, maybe a few as examples are sufficient with some comments otherwise needs to be kept up to date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some comments in the code. See if it works for you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments look good to me, but would it be possible to run this test with tp_base of 8? I.e. does this test automatically work with 2 nodes? This doesn't need to block but just something to think about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires 16 GPUs to run. What is the setup you used to run those models? I have a comment indicating that tp_base is just an indication of the model size and may have to be adjusted further.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address comments from @andoorve

@DarkLight1337 DarkLight1337 changed the title [Model] PP support for embedding models [Model] PP support for embedding models and update docs Oct 6, 2024
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit

docs/source/models/supported_models.rst Outdated Show resolved Hide resolved
DarkLight1337 and others added 2 commits October 6, 2024 11:43
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
@DarkLight1337 DarkLight1337 merged commit b22b798 into main Oct 6, 2024
58 checks passed
@DarkLight1337 DarkLight1337 deleted the embedding-pp branch October 6, 2024 08:35
liuyanyi pushed a commit to liuyanyi/vllm that referenced this pull request Oct 6, 2024
…#9090)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
@gshtras
Copy link
Contributor

gshtras commented Oct 7, 2024

Not yet sure why, but this introduces a regression on ROCm:
python benchmarks/benchmark_latency.py --model meta-llama/Llama-2-70b-chat-hf -tp 8
crashes during weight loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants