[DeepseekR1] running with multi nodes #819

xuechendi · 2025-02-12T18:03:47Z

No description provided.

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

topk_group not support issue Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

1. move block_fp8 pad to load_weight 2. move moe fp8 linear out of loop 3. remove permute and reshape --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: root <root@g3lc-srv32-c03d-idc.idc9.habana-labs.com>

Add VLLM_MOE_N_SLICE in test script and fix warmup bucket Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu>

) This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights --------- Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html ** Signed-off-by: Chendi Xue <chendi.xue@intel.com>

enable with env var VLLM_EP_SIZE=4 VLLM_MOE_N_SLICE=1 gpu_util=0.8 => for bs=96, otherwise it triggers OOM issue EP_size will be part of TP size. Ex: TP = 8, if we set EP==4, we will reduce TP in MOE as 2 --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com>

HabanaAI#790) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…512 in value cache (HabanaAI#804) Before, we can only allocate 1854 blocks with 29.2G, now we are able to allocate 3156 blocks Performance wise, not visible regression and able to push to higher batch_size or longer context length --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi and others added 21 commits February 3, 2025 15:52

Fix tensor shape for running Deepseek V3 on hpu

7b9815e

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Quick walk around for FusedMOE use_grouped_topk, num_expert_group,

df1c777

topk_group not support issue Signed-off-by: Chendi Xue <chendi.xue@intel.com>

fix compiler failure

6058485

try fix fused moe

8598ebd

fix export

d994fc7

add testing code

074e198

use pytorch dequant block fp8

e87ed4a

update run script

17c59ce

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Deepseek r1 (HabanaAI#765)

655a500

1. move block_fp8 pad to load_weight 2. move moe fp8 linear out of loop 3. remove permute and reshape --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Co-authored-by: root <root@g3lc-srv32-c03d-idc.idc9.habana-labs.com>

add a new env var for VLLM_MOE_N_SLICE (HabanaAI#769)

0fc3531

Add VLLM_MOE_N_SLICE in test script and fix warmup bucket Signed-off-by: Chendi Xue <chendi.xue@intel.com>

support deepseek mla

4dde664

Support mla (HabanaAI#775)

55809c9

Add a new script for 128-1k-1k (HabanaAI#777)

f83175f

FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html ** Signed-off-by: Chendi Xue <chendi.xue@intel.com>

[Deepseek r1] update scripts to make sure it runs with default env var (

c773b13

HabanaAI#790) Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Add run_lm_eval.py (HabanaAI#791)

892ec40

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Fix EP>1 crashing issue (HabanaAI#816)

f946aaa

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

enable multi nodes

427f0c8

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi requested review from kzawora-intel, madamczykhabana, michalkuligowski, mgawarkiewicz, vivekgoe and afierka-intel as code owners February 12, 2025 18:03

xuechendi closed this Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepseekR1] running with multi nodes #819

[DeepseekR1] running with multi nodes #819

xuechendi commented Feb 12, 2025

[DeepseekR1] running with multi nodes #819

[DeepseekR1] running with multi nodes #819

Conversation

xuechendi commented Feb 12, 2025