Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Fused MoE class #8

Closed
wants to merge 82 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
db1f07e
GPTQ Fused MoE class
ElizaWszola Sep 3, 2024
6753789
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola Sep 3, 2024
7df4014
Use FusedMoE layer for all loads
ElizaWszola Sep 4, 2024
c3dc249
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola Sep 4, 2024
2fa03e5
Make sure that GPTQ runs through mixtral.py
ElizaWszola Sep 4, 2024
8a504d9
enforce float16A/scales for marlin moe
ElizaWszola Sep 4, 2024
689ea0a
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola Sep 4, 2024
ec47561
cleanup
ElizaWszola Sep 4, 2024
2ad2e56
[MISC] Consolidate FP8 kv-cache tests (#8131)
comaniac Sep 4, 2024
d1dec64
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
alexeykondrat Sep 4, 2024
561d6f8
[CI] Change test input in Gemma LoRA test (#8163)
WoosukKwon Sep 4, 2024
e02ce49
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…
K-Mistele Sep 4, 2024
77d9e51
[MISC] Replace input token throughput with total token throughput (#8…
comaniac Sep 4, 2024
008cf88
[Neuron] Adding support for adding/ overriding neuron configuration a…
hbikki Sep 4, 2024
32e7db2
Bump version to v0.6.0 (#8166)
simon-mo Sep 4, 2024
e01c2be
[Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161)
mmcelaney Sep 4, 2024
1afc931
[bugfix] >1.43 constraint for openai (#8169)
SolitaryThinker Sep 5, 2024
4624d98
[Misc] Clean up RoPE forward_native (#8076)
WoosukKwon Sep 5, 2024
ba262c4
[ci] Mark LoRA test as soft-fail (#8160)
khluu Sep 5, 2024
e39ebf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8…
elfiegg Sep 5, 2024
288a938
[Doc] Indicate more information about supported modalities (#8181)
DarkLight1337 Sep 5, 2024
8685ba1
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parall…
Manikandan-Thangaraj-ZS0321 Sep 5, 2024
9da25a8
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
alex-jw-brooks Sep 5, 2024
2ee4528
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165)
mgoin Sep 5, 2024
2febcf2
[Documentation][Spec Decode] Add documentation about lossless guarant…
sroy745 Sep 5, 2024
9f97b3b
update/fix weight loading to support tp
dsikka Sep 5, 2024
db3bf7c
[Core] Support load and unload LoRA in api server (#6566)
Jeffwan Sep 6, 2024
baa5467
[BugFix] Fix Granite model configuration (#8216)
njhill Sep 6, 2024
b841ac4
remove 8-bit stuff for now
ElizaWszola Sep 6, 2024
a245032
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola Sep 6, 2024
9d8a80c
fix; update large model testing cases
dsikka Sep 6, 2024
e5cab71
[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191)
afeldman-nm Sep 6, 2024
315e22f
add hack to support unfused mixtral pathway for int8
dsikka Sep 6, 2024
de80783
[Misc] Use ray[adag] dependency instead of cuda (#7938)
ruisearch42 Sep 6, 2024
565cc43
fix install for tpu test
dsikka Sep 6, 2024
1447c97
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
alexeykondrat Sep 6, 2024
9db52ea
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize…
rasmith Sep 6, 2024
23f3222
[Misc] Remove `SqueezeLLM` (#8220)
dsikka Sep 6, 2024
29f49cd
[Model] Allow loading from original Mistral format (#8168)
patrickvonplaten Sep 6, 2024
12dd715
[misc] [doc] [frontend] LLM torch profiler support (#7943)
SolitaryThinker Sep 7, 2024
41e95c5
[Bugfix] Fix Hermes tool call chat template bug (#8256)
K-Mistele Sep 7, 2024
2f707fc
[Model] Multi-input support for LLaVA (#8238)
DarkLight1337 Sep 7, 2024
795b662
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_ser…
wschin Sep 7, 2024
ce2702a
[tpu][misc] fix typo (#8260)
youkaichao Sep 7, 2024
9f68e00
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
DarkLight1337 Sep 7, 2024
e807125
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
Isotr0py Sep 7, 2024
8886423
Move float16 typecast hack to gptq marlin moe method
ElizaWszola Sep 7, 2024
ab27497
Move output type conversion to gptq method as well
ElizaWszola Sep 7, 2024
36bf815
[Model][VLM] Decouple weight loading logic for `Paligemma` (#8269)
Isotr0py Sep 7, 2024
b962ee1
ppc64le: Dockerfile fixed, and a script for buildkite (#8026)
sumitd2 Sep 7, 2024
cfe712b
[CI/Build] Use python 3.12 in cuda image (#8133)
joerunde Sep 7, 2024
4ef41b8
[Bugfix] Fix async postprocessor in case of preemption (#8267)
alexm-redhat Sep 8, 2024
847e860
Enable 8-bit weights in Fused Marlin MoE
ElizaWszola Aug 30, 2024
430a9cb
fix rocm
ElizaWszola Aug 30, 2024
48047aa
bad paste
ElizaWszola Aug 30, 2024
bfc4fae
add test case; fix imports for tests
dsikka Aug 30, 2024
c5a2f62
fix to adapt custom_routin_function
dsikka Aug 30, 2024
2b308c4
Use select_experts to compute top_k tensors in fused moe
ElizaWszola Sep 2, 2024
71256d4
bring back fused_moe_marlin -> fused_marlin_moe
ElizaWszola Sep 3, 2024
7aa844c
GPTQ Fused MoE class
ElizaWszola Sep 3, 2024
0f7bec3
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola Sep 3, 2024
cb0001e
Use FusedMoE layer for all loads
ElizaWszola Sep 4, 2024
33090a3
Make sure that GPTQ runs through mixtral.py
ElizaWszola Sep 4, 2024
d479837
enforce float16A/scales for marlin moe
ElizaWszola Sep 4, 2024
8baaec6
remove large model
dsikka Sep 4, 2024
8fbc181
Cleanup, comments
ElizaWszola Sep 4, 2024
839915f
cleanup
ElizaWszola Sep 4, 2024
a5bc626
remove 8-bit stuff for now
ElizaWszola Sep 6, 2024
c573fa1
update/fix weight loading to support tp
dsikka Sep 5, 2024
a991d82
fix; update large model testing cases
dsikka Sep 6, 2024
d57804d
add hack to support unfused mixtral pathway for int8
dsikka Sep 6, 2024
96fa486
fix install for tpu test
dsikka Sep 6, 2024
1faab90
Move float16 typecast hack to gptq marlin moe method
ElizaWszola Sep 7, 2024
970e06a
Move output type conversion to gptq method as well
ElizaWszola Sep 7, 2024
fd0a4f2
typo fix; fix comment
dsikka Sep 9, 2024
3ac9273
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola Sep 9, 2024
d51a2f4
Clarify comment, change how we process bias
ElizaWszola Sep 9, 2024
08287ef
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format…
K-Mistele Sep 9, 2024
58fcc85
[Frontend] Add progress reporting to run_batch.py (#8060)
alugowski Sep 9, 2024
f9b4a2d
[Bugfix] Correct adapter usage for cohere and jamba (#8292)
vladislavkruglikov Sep 9, 2024
c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs Sep 9, 2024
12f05c5
Merge branch 'main' into gptq_fused_moe
dsikka Sep 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion vllm/model_executor/layers/fused_moe/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
from vllm.model_executor.layers.fused_moe.layer import (
FusedMoE, FusedMoEMethodBase, FusedMoeWeightScaleSupported)
FusedMoE, FusedMoEMethodBase, FusedMoeWeightScaleSupported, GPTQFusedMoE)
from vllm.triton_utils import HAS_TRITON

__all__ = [
"FusedMoE",
"FusedMoEMethodBase",
"FusedMoeWeightScaleSupported",
"GPTQFusedMoE",
]

if HAS_TRITON:
Expand Down
155 changes: 154 additions & 1 deletion vllm/model_executor/layers/fused_moe/layer.py
Original file line number Diff line number Diff line change
Expand Up @@ -498,4 +498,157 @@ def _load_fp8_scale(self, param: torch.nn.Parameter,
param_data[expert_id][idx] = loaded_weight
# If we are in the row parallel case (down_proj)
else:
param_data[expert_id] = loaded_weight
param_data[expert_id] = loaded_weight


class GPTQFusedMoE(torch.nn.Module):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GPTQMarlinMoEMethod looks good. We'd want to use FusedMoE as opposed to introducing GPTQFusedMoE - we can work on this tomorrow in making this happen.

"""GPTQFusedMoE layer for GPTQ MoE models.
This layer contains both MergedColumnParallel weights (gate_up_proj /
w13) and RowParallelLinear weights (down_proj/ w2).
Note: Mixtral uses w1, w2, and w3 for gate, up, and down_proj. We
copy that naming convention here and handle any remapping in the
load_weights function in each model implementation.
Args:
num_experts: Number of experts in the model
top_k: Number of experts selected for each token
hidden_size: Input hidden state size of the transformer
intermediate_size: Intermediate size of the experts
params_dtype: Data type for the parameters.
reduce_results: Whether to all all_reduce on the output of the layer
renomalize: Whether to renormalize the logits in the fused_moe kernel
quant_config: Quantization configure.
"""

def __init__(
self,
num_experts: int,
top_k: int,
hidden_size: int,
intermediate_size: int,
params_dtype: Optional[torch.dtype] = None,
reduce_results: bool = False,
renormalize: bool = True,
use_grouped_topk: bool = False,
num_expert_group: Optional[int] = None,
topk_group: Optional[int] = None,
quant_config: Optional[QuantizationConfig] = None,
tp_size: Optional[int] = None,
prefix: str = "",
):
super().__init__()

if params_dtype is None:
params_dtype = torch.get_default_dtype()

self.tp_size = (tp_size if tp_size is not None else
get_tensor_model_parallel_world_size())
self.top_k = top_k
self.num_experts = num_experts
self.intermediate_size = intermediate_size
self.intermediate_size_per_partition = intermediate_size // self.tp_size
self.reduce_results = reduce_results
self.renormalize = renormalize
assert (not use_grouped_topk and num_expert_group is None
and topk_group is None)

if quant_config is None:
self.quant_method: Optional[
QuantizeMethodBase] = UnquantizedFusedMoEMethod()
else:
self.quant_method = quant_config.get_quant_method(self, prefix)
assert self.quant_method is not None

self.quant_method.create_weights(
layer=self,
num_experts=num_experts,
hidden_size=hidden_size,
intermediate_size=self.intermediate_size_per_partition,
params_dtype=params_dtype,
weight_loader=self.weight_loader,
)

def weight_loader(self, param: torch.nn.Parameter,
loaded_weight: torch.Tensor, weight_name: str,
shard_id: str, expert_id: int) -> None:

if ("_qweight" in weight_name or "_scales" in weight_name
or "_qzeros" in weight_name):
if "w13" in weight_name:
shard_size = loaded_weight.size()[-1]
if shard_id == "w1":
param.data[expert_id, :, :shard_size] = loaded_weight
elif shard_id == "w2" or shard_id == "w3":
param.data[expert_id, :, shard_size:] = loaded_weight
else:
raise ValueError(f"Invalid shard_id: {shard_id}: "
"must be w1, w2, or w3.")
elif "w2" in weight_name:
param.data[expert_id][:] = loaded_weight
else:
raise ValueError(f"Invalid weight name: {weight_name}: "
"must contain 'w13' or 'w2'.")
elif "_g_idx" in weight_name:
if "w13" not in weight_name and "w2" not in weight_name:
raise ValueError(f"Invalid weight name: {weight_name}: "
"must contain 'w13' or 'w2'.")
param.data[expert_id] = loaded_weight
else:
raise ValueError(f"Invalid weight name: {weight_name}.")

@staticmethod
def select_experts(hidden_states: torch.Tensor,
router_logits: torch.Tensor,
top_k: int,
use_grouped_topk: bool,
renormalize: bool,
topk_group: Optional[int] = None,
num_expert_group: Optional[int] = None):
assert (not use_grouped_topk and topk_group is None
and num_expert_group is None)
from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk

topk_weights, topk_ids = fused_topk(hidden_states=hidden_states,
gating_output=router_logits,
topk=top_k,
renormalize=renormalize)

return topk_weights, topk_ids

def forward(self, hidden_states: torch.Tensor,
router_logits: torch.Tensor):
assert self.quant_method is not None

# Matrix multiply.
final_hidden_states = self.quant_method.apply(
layer=self,
x=hidden_states,
router_logits=router_logits,
top_k=self.top_k,
renormalize=self.renormalize,
use_grouped_topk=False,
topk_group=False,
num_expert_group=False)

if self.reduce_results and self.tp_size > 1:
final_hidden_states = tensor_model_parallel_all_reduce(
final_hidden_states)

return final_hidden_states

@classmethod
def make_expert_params_mapping(
cls, ckpt_gate_proj_name: str, ckpt_down_proj_name: str,
ckpt_up_proj_name: str,
num_experts: int) -> List[Tuple[str, str, int, str]]:

return [
# (param_name, weight_name, expert_id, shard_id)
("experts.w13_" if weight_name
in [ckpt_gate_proj_name, ckpt_up_proj_name] else "experts.w2_",
f"experts.{expert_id}.{weight_name}.", expert_id, shard_id)
for expert_id in range(num_experts) for shard_id, weight_name in [
("w1", ckpt_gate_proj_name),
("w2", ckpt_down_proj_name),
("w3", ckpt_up_proj_name),
]
]
Loading