Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Make llama3.2 support multiple and interleaved images #9095

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xiangxu-google
Copy link
Contributor

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

  • Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
  • Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
  • Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
  • When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
  • If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

github-actions bot commented Oct 5, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some unit tests on multiple image & interleaved image?

vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Show resolved Hide resolved
attn_metadata,
attn_type=AttentionType.ENCODER_DECODER)
if attention_mask is not None:
if len(kv_cache.shape) == 3:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's for skipping writing kv-cache for the initial profiling run

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywang96 Is there a better way to know whether we are in the profile run?

Copy link
Member

@ywang96 ywang96 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really from within the forward pass, so this check is okay imo. Let's add a NOTE here to indicate the purpose of this check

vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
vllm/model_executor/models/mllama.py Outdated Show resolved Hide resolved
@heheda12345
Copy link
Collaborator

heheda12345 commented Oct 5, 2024

Based on personal discussion with @xiangxu-google , we still need the following steps:

  1. verify the correctness
  2. change vllm/entrypoints/chat_utils.py to support OpenAI chat template API -> huggingface chat template API conversion
  3. add unit tests
  4. add mllama to examples/offline_inference_vision_language_multi_image.py and update the document to show mllama supports multiple image now.

@ywang96
Copy link
Member

ywang96 commented Oct 6, 2024

I talked to @heheda12345 offline about the following steps for this model:

  1. Let's focus on offline inference with llm.generate for this PR, and leave changes on vllm/entrypoints/chat_utils.py in a separate follow-up PR. I have verified that we can leverage tokenizer to format the input prompt in the offline example for multi images.
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image"}
    ]},
    {"role": "assistant", "content": [
        {"type": "text", "text": "This image is xxx"}
    ]},
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "How about this image"}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(input_text)

Output:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

<|image|>Describe this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This image is xxx<|eot_id|><|start_header_id|>user<|end_header_id|>

<|image|>How about this image<|eot_id|><|start_header_id|>assistant<|end_header_id|>
  1. We also need to add the logic to compute the cross-attention mask for each individual request (For this we can reference to get_cross_attention_token_mask from transformers). IMO this should be done in input_processor_for_mllama, and we can store the result mask in llm_inputs["encoder_multi_modal_data"]["per_request_cross_attention_mask"] to be retrieved later.

@xiangxu-google
Copy link
Contributor Author

xiangxu-google commented Oct 6, 2024

  1. We also need to add the logic to compute the cross-attention mask for each individual request...

Curious why do we need the logic for each request? This PR has implemented computing masks for a batch of requests.

@ywang96
Copy link
Member

ywang96 commented Oct 6, 2024

  1. We also need to add the logic to compute the cross-attention mask for each individual request...

Curious why do we need the logic for each request? This PR has implemented computing masks for a batch of requests.

You're right, and it's really a matter of where we put that mask computation logic. IMO it's cleaner if we decouple the mask computation for each individual requests and what's needed for the batch in the forward pass, since this is more or less still a CPU operation that we should put in input_processor_mllama. WDYT?

I'm also okay with leaving this logic as a whole for the batch inside forward for now if it's too much work to decouple :)

@mgoin
Copy link
Sponsor Collaborator

mgoin commented Oct 7, 2024

FYI I tried using this PR to evaluate Llama 3.2 Vision 11B on MMMU and ran into a CUDA error during attention

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0 lm_eval --model vllm-vlm --model_args pretrained=meta-llama/Llama-3.2-11B-Vision-Instruct,enforce_eager=True,max_num_seqs=4,max_model_len=10000 --tasks mmmu_val --batch_size 4
2024-10-07:14:09:39,030 INFO     [__main__.py:279] Verbosity set to INFO
2024-10-07:14:09:39,061 INFO     [__init__.py:449] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-10-07:14:09:43,245 INFO     [__main__.py:376] Selected Tasks: ['mmmu_val']
2024-10-07:14:09:43,246 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
[['pretrained', 'meta-llama/Llama-3.2-11B-Vision-Instruct'], ['enforce_eager', 'True'], ['max_num_seqs', '4'], ['max_model_len', '10000']]
2024-10-07:14:09:43,246 INFO     [evaluator.py:198] Initializing vllm-vlm model, with arguments: {'pretrained': 'meta-llama/Llama-3.2-11B-Vision-Instruct', 'enforce_eager': True, 'max_num_seqs': 4, 'max_model_len': 10000}
[['pretrained', 'meta-llama/Llama-3.2-11B-Vision-Instruct'], ['enforce_eager', 'True'], ['max_num_seqs', '4'], ['max_model_len', '10000']]
[['image', '5']]
WARNING 10-07 14:09:48 arg_utils.py:907] Block Manager v2 does not support encoder-decoder models currently. Using Block Manager v1 as fallback.
WARNING 10-07 14:09:48 config.py:365] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-07 14:09:48 llm_engine.py:237] Initializing an LLM engine (v0.6.3.dev22+g97c00c6e) with config: model='meta-llama/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=meta-llama/Llama-3.2-11B-Vision-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-07 14:09:48 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers.
INFO 10-07 14:09:48 selector.py:116] Using XFormers backend.
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/mgoin/venvs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-07 14:09:49 model_runner.py:1049] Starting to load model meta-llama/Llama-3.2-11B-Vision-Instruct...
INFO 10-07 14:09:49 selector.py:116] Using XFormers backend.
INFO 10-07 14:09:49 weight_utils.py:242] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.03it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:03,  1.08s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:03<00:02,  1.10s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:04<00:01,  1.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.05it/s]

INFO 10-07 14:09:55 model_runner.py:1060] Loading model weights took 19.9073 GB
INFO 10-07 14:09:55 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
INFO 10-07 14:10:00 gpu_executor.py:122] # GPU blocks: 12935, # CPU blocks: 1638
2024-10-07:14:10:59,014 INFO     [evaluator.py:279] Setting fewshot random generator seed to 1234
2024-10-07:14:10:59,016 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-07:14:11:08,650 INFO     [evaluator.py:485] Running generate_until requests
Running generate_until requests with text+image input:   0%|                                                                                                                                                      | 0/900 [00:00<?, ?it/s]WARNING 10-07 14:11:08 preprocess.py:87] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/mgoin/venvs/vllm/bin/lm_eval", line 8, in <module>
[rank0]:     sys.exit(cli_evaluate())
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
[rank0]:     results = evaluator.simple_evaluate(
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
[rank0]:     
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/evaluator.py", line 301, in simple_evaluate
[rank0]:     results = evaluate(
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/utils.py", line 398, in _wrapper
[rank0]:     
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/evaluator.py", line 496, in evaluate
[rank0]:     resps = getattr(lm, reqtype)(cloned_reqs)
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/models/vllm_vlms.py", line 266, in generate_until
[rank0]:     cont = self._model_generate(inputs, stop=until, generate=True, **kwargs)
[rank0]:   File "/home/mgoin/code/lm-evaluation-harness/lm_eval/models/vllm_vlms.py", line 123, in _model_generate
[rank0]:     outputs = self.model.generate(
[rank0]:   File "/home/mgoin/code/vllm/vllm/utils.py", line 1051, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 391, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/home/mgoin/code/vllm/vllm/entrypoints/llm.py", line 899, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/home/mgoin/code/vllm/vllm/engine/llm_engine.py", line 1404, in step
[rank0]:     outputs = self.model_executor.execute_model(
[rank0]:   File "/home/mgoin/code/vllm/vllm/executor/gpu_executor.py", line 130, in execute_model
[rank0]:     output = self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/home/mgoin/code/vllm/vllm/worker/worker_base.py", line 327, in execute_model
[rank0]:     output = self.model_runner.execute_model(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/worker/enc_dec_model_runner.py", line 203, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 1252, in forward
[rank0]:     outputs = self.language_model(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 956, in forward
[rank0]:     hidden_states = self.model(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 896, in forward
[rank0]:     hidden_states = decoder_layer(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 826, in forward
[rank0]:     hidden_states = self.cross_attn(
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 725, in forward
[rank0]:     output = self.attention_with_mask(q, k, v, kv_cache,
[rank0]:   File "/home/mgoin/code/vllm/vllm/model_executor/models/mllama.py", line 775, in attention_with_mask
[rank0]:     output = F.scaled_dot_product_attention(q,
[rank0]: RuntimeError: CUDA error: invalid configuration argument
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Running generate_until requests with text+image input:   0%|                                                                                                                                                      | 0/900 [00:01<?, ?it/s]

@xiangxu-google
Copy link
Contributor Author

xiangxu-google commented Oct 7, 2024

FYI I tried using this PR to evaluate Llama 3.2 Vision 11B on MMMU and ran into a CUDA error during attention

What is your eval environment, like torch version, vllm version, GPU version, CUDA version. I can't reproduce the issue on H100 CUDA 12.4 with vllm installed from source.
Since F.scaled_dot_product_attention is a torch API, users can only make sure the shapes of input tensors are correct. This PR generates correct input shapes with:

        q = q.transpose(0, 1).view(self.num_key_value_groups,
                                   self.num_local_key_value_heads, q_len,
                                   self.head_dim)
        k = k.transpose(0, 1).expand(self.num_key_value_groups,
                                     self.num_local_key_value_heads, kv_len,
                                     self.head_dim)
        v = v.transpose(0, 1).expand(self.num_key_value_groups,
                                     self.num_local_key_value_heads, kv_len,
                                     self.head_dim)
        attention_mask = attention_mask.view(1, 1, q_len, kv_len)
        output = F.scaled_dot_product_attention(q,
                                                k,
                                                v,
                                                attn_mask=attention_mask)

@xiangxu-google
Copy link
Contributor Author

  1. We also need to add the logic to compute the cross-attention mask for each individual request...

Curious why do we need the logic for each request? This PR has implemented computing masks for a batch of requests.

You're right, and it's really a matter of where we put that mask computation logic. IMO it's cleaner if we decouple the mask computation for each individual requests and what's needed for the batch in the forward pass, since this is more or less still a CPU operation that we should put in input_processor_mllama. WDYT?

I'm also okay with leaving this logic as a whole for the batch inside forward for now if it's too much work to decouple :)

Thanks for the suggestion! I need to make sure the correctness of this implementation first for a controllable landing. We can refactor it in a separate PR to move the CPU logic to input_processor_mllama later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants