Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepseekR1] running with multi nodes #819

Closed
wants to merge 21 commits into from

Conversation

xuechendi
Copy link

No description provided.

xuechendi and others added 21 commits February 3, 2025 15:52
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
topk_group not support issue

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
1. move block_fp8 pad to load_weight
2. move moe fp8 linear out of loop
3. remove permute and reshape

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: root <root@g3lc-srv32-c03d-idc.idc9.habana-labs.com>
Add VLLM_MOE_N_SLICE in test script and fix warmup bucket

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
)

This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (*link existing issues this PR will resolve*)

**BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html **

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
enable with env var

VLLM_EP_SIZE=4
VLLM_MOE_N_SLICE=1
gpu_util=0.8 => for bs=96, otherwise it triggers OOM issue

EP_size will be part of TP size.
Ex:
TP = 8, if we set EP==4, we will reduce TP in MOE as 2

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…512 in value cache (HabanaAI#804)

Before, we can only allocate 1854 blocks with 29.2G, now we are able to
allocate 3156 blocks
Performance wise, not visible regression and able to push to higher
batch_size or longer context length

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants