-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring for maintainability #4
Refactoring for maintainability #4
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
from .fused_moe import fused_topk, moe_align_block_size, try_get_optimal_moe_config | ||
|
||
|
||
def fused_moe_gptq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call this fused_moe_marlin
? We want to separate the naming of the kernel from the algorithm
hidden_size: int, intermediate_size: int, | ||
params_dtype: torch.dtype, **extra_weight_attrs): | ||
|
||
def create_weights( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
@@ -386,7 +162,7 @@ def forward_tpu( | |||
class FusedMoE(torch.nn.Module): | |||
"""FusedMoE layer for MoE models. | |||
|
|||
This layer contains both MergedColumnParallel weights (gate_up_proj / | |||
This layer contains both MergedColumnParallel weights (gate_up_proj / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
@@ -377,6 +152,7 @@ def forward_tpu( | |||
topk_group: Optional[int], | |||
) -> torch.Tensor: | |||
from vllm.model_executor.layers.fused_moe.moe_pallas import fused_moe | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
@@ -491,8 +267,8 @@ def weight_loader(self, | |||
else: | |||
# Input scales can be loaded directly and should be equal. | |||
if "input_scale" in weight_name: | |||
if param_data[expert_id] != 1 and (param_data[expert_id] - | |||
loaded_weight).abs() > 1e-5: | |||
if (param_data[expert_id] != 1 and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
if param_data[expert_id] != 1 and (param_data[expert_id] - | ||
loaded_weight).abs() > 1e-5: | ||
if (param_data[expert_id] != 1 and | ||
(param_data[expert_id] - loaded_weight).abs() > 1e-5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
@@ -546,7 +322,8 @@ def forward(self, hidden_states: torch.Tensor, | |||
renormalize=self.renormalize, | |||
use_grouped_topk=self.use_grouped_topk, | |||
num_expert_group=self.num_expert_group, | |||
topk_group=self.topk_group) | |||
topk_group=self.topk_group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
ckpt_up_proj_name: str, | ||
num_experts: int) -> List[Tuple[str, str, int, int]]: | ||
|
||
cls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't touch unchanged lines
num_expert_group: Optional[int] = None, | ||
topk_group: Optional[int] = None, | ||
) -> torch.Tensor: | ||
if layer.marlin_state == GPTQMarlinState.REPACK: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this REPACK, please do the repacking in process_weights_after_loading
you can look at gptq_marlin.py
for an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this comment. Also deduplicated code with marlin_utils
Thanks for the PR! this looks really good. Other than spurious change nits, the key feedback is:
Other thing - I wonder if there is a better way to do the |
I think the change nits happened because of the formatter I am using on save. Will fix it rn. |
@robertgshaw2-neuralmagic resolved all formatting changes. Let me know if this is good to go! |
/ready |
tests/kernels/test_moe.py
Outdated
from vllm.model_executor.layers.fused_moe import fused_moe, single_marlin_moe | ||
from vllm.model_executor.layers.fused_moe.fused_moe_marlin import fused_moe_marlin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be good to keep single_marlin_moe
in the same place as fused_moe_marlin
, even if the former is only used for testing
@@ -22,33 +22,49 @@ | |||
# limitations under the License. | |||
"""Inference-only Mixtral model.""" | |||
from typing import Iterable, List, Optional, Tuple | |||
|
|||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you make sure to remove all unused imports?
@ElizaWszola thank you for the feedback! I have made the requested changes and also ran tests again. Hope things look good to merge now! Would love to help expedite work on supporting 8-bit quantized models as well (these are returning incorrect outputs on my end). Happy to chat sometime! |
This looks good in overall! Just two small remaining things:
|
@ElizaWszola - I think we can merge this + take it from here |
I'm merging this now. Thanks @DhruvaBansal00! |
magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
Refactoring Marlin MoE implementation for maintainability and mirroring AWQ codepath