-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] PP support for embedding models and update docs #9090
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
if (self.observability_config is not None | ||
and self.observability_config.collect_model_forward_time): | ||
model_forward_start = torch.cuda.Event(enable_timing=True) | ||
model_forward_end = torch.cuda.Event(enable_timing=True) | ||
model_forward_start.record() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from GPUModelRunner. I don't really like how we're repeating this code though...
@zhuzilin can you help check whether the Qwen2.5-Math-RM-72B can be used in PP setting? The model is too big for me to test. |
@DarkLight1337 you can test it with dummy weight and manually change the hidden size of config to make the model smaller |
hand it over to @andoorve for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one minor comment, otherwise looks straightforward to me!
"baichuan-inc/Baichuan-7B": PPTestSettings.fast(trust_remote_code=True), | ||
"baichuan-inc/Baichuan2-13B-Chat": PPTestSettings.fast(trust_remote_code=True), # noqa: E501 | ||
"bigscience/bloomz-1b1": PPTestSettings.fast(), | ||
"THUDM/chatglm3-6b": PPTestSettings.fast(trust_remote_code=True), | ||
"CohereForAI/c4ai-command-r-v01": PPTestSettings.fast(tp_base=2, trust_remote_code=True), # noqa: E501 | ||
# TODO: Test on larger GPU | ||
# "databricks/dbrx-instruct": PPTestSettings.fast(), | ||
"databricks/dbrx-instruct": PPTestSettings.fast(tp_base=4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcode of 4 will only work on 8 GPU machines. Might be a bit confusing, or comments should be added to adjust this as necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not need all of them as well, maybe a few as examples are sufficient with some comments otherwise needs to be kept up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some comments in the code. See if it works for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments look good to me, but would it be possible to run this test with tp_base
of 8? I.e. does this test automatically work with 2 nodes? This doesn't need to block but just something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires 16 GPUs to run. What is the setup you used to run those models? I have a comment indicating that tp_base is just an indication of the model size and may have to be adjusted further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address comments from @andoorve
af3cacc
to
6e49835
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
…#9090) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Not yet sure why, but this introduces a regression on ROCm: |
I found that PP actually didn't work for embedding models because it's not implemented in the embedding model runner. I've updated the embedding model runner to support PP, and cleaned up the existing code in the models so that both embedding and CausalLM models can share weight loading logic.
I have also updated the Supported Models page (again) with additional models.