-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MLA and YarnRope configs in DeepSeek v3 #1288
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one comment about sequence length.
@@ -34,5 +34,13 @@ shared_experts: 1 | |||
routed_scaling_factor: 2.5 | |||
routed_score_func: "sigmoid" | |||
routed_bias: True | |||
rope_max_timescale: 10_000 | |||
max_target_length: 16384 # 4096 * 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this in a model config? Or we could just pass in the script during the run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepSeek provides the default in model config https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/inference/model.py#L22 but yes, we can just provide as an argument during training or inference. I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those are default values. For instance, it has vocab_size: int = 102400
, while the DeepSeek v3's vocab size is 129280. Yeah, let's remove this config from model config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the sharding_tolerance
in your cmd for the test?
sharding tolerance: float between 0.0 and 1.0 representing the allowed percentage of non-sharded parameters.As I tested on v5p-8, I had to reduce number of layers (to 5) to fit the model, and needed to increase sharding_tolerance, it's not needed for full scale model. assert_params_sufficiently_sharded The vocab tensor(s) of shape [vocab, embed] (and transpose) are not sharded by stagemaxtext_utils.assert_params_sufficiently_sharded(state.params, mesh, config.sharding_tolerance) |
I see. Makes sense. |
Fix MLA shape constraints fix formating Update deepseek v3 config Use dot product attention for MLA Update deepseek test config Fix MoE test format
bfb982a
to
4a82219
Compare
Description
Add MLA and YarnRope configs in DeepSeek v3
Tests
Run at small scale with:
python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=/tmp/ run_name=deepsee_training per_device_batch_size=4 enable_checkpointing=false model_name=deepseek3-671b ici_fsdp_parallelism=4 steps=5 async_checkpointing=false tokenizer_path=assets/tokenizer.mistral-v1 attention=dot_product dtype=bfloat16 weight_dtype=bfloat16 dataset_type=synthetic sparse_matmul=True megablox=True sharding_tolerance=0.3
Checklist
Before submitting this PR, please make sure (put X in square brackets):