Skip to content

Training Improvements: MultipackV2, Statistics, Mock Data #483

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link
Member

@RobotSail RobotSail commented Apr 20, 2025

Adds a number of enhancements to provide better training performance, have clarity on raining times, and experiment more robustly.

Multipack V2

Multipack V2 has been tested as a batch sampler and found to improve training throughput by an amount that benefits training long-context models. However; it does not support models non padding-free models, therefore these would need to use Multipack V1.

Experimental setup

Constants:

GPUs: 8xA100
MBL: 52k
Distributed Backend: FSDP
MSL: 50k
Liger: on

Independent variables:

  • Time per epoch: The bias-corrected moving average of the elapsed time per-step multiplied by the total number of batches with $\alpha = 0.9$
  • Throughput: Zero-corrected mving average of the throughput with $\alpha = 0.9$
  • Total tokens: Load the processed dataset and sum up all lens
  • Average token len: Mean of lens
Distributed Sampler Dataset Num samples Time per epoch Throughput Total tokens Average Token Len
Multipack V1 1.4 skills-v1 (baseline short-context) 346432 4.5 hours 21.0 406470537 1173.31
Multipack V2 1.4 skills-v1 (baseline short-context) 346432 4.5 hours 21.0 406470537 1173.31
Multipack V1 1.5 skills-v2 374821 14.2 hours 7.495938060254868 669853984 1801.76
Multipack V2 1.5 skills-v2 374821 13.1 hours 8.44455839070264 669853984 1801.76
Multipack V1 1.5 skill-v2-reduced 163561 5.8 hours 7.8696600699200845 295960957 1809.483660530322
Multipack V2 1.5 skill-v2-reduced 163561 5 hours 9.556907986534208 295960957 1809.483660530322

image

Todos:

  • Clean up API
  • Add MIT license & attribution for MultipackDistributedSamplerV2
  • Add logic for checking that model will be using non-padding training with MultipackDistributedSamplerV2

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: --global <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: --global <97077423+RobotSail@users.noreply.github.com>
@mergify mergify bot removed the ci-failure label Apr 21, 2025
@mergify mergify bot added the ci-failure label Apr 21, 2025
@RobotSail RobotSail changed the title use correct field name for mock_data_len Training Improvements: MultipackV2, Statistics, Mock Data, Gradient Checkpointing Apr 21, 2025
@RobotSail RobotSail changed the title Training Improvements: MultipackV2, Statistics, Mock Data, Gradient Checkpointing Training Improvements: MultipackV2, Statistics, Mock Data Apr 21, 2025
@RobotSail RobotSail marked this pull request as draft April 21, 2025 05:15
@@ -228,3 +229,9 @@ class TrainingArgs(BaseModel):
default=False,
description="Whether to use Liger kernels for training.",
)

# TODO(osilkin): Create a better API for this, should not merge into library this way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Fynn has been working on "SDK-ifying" the sampler specifically @RobotSail , maybe we should sync on this with the training team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants