Training Improvements: MultipackV2, Statistics, Mock Data #483

RobotSail · 2025-04-20T02:27:10Z

Adds a number of enhancements to provide better training performance, have clarity on raining times, and experiment more robustly.

Multipack V2

Multipack V2 has been tested as a batch sampler and found to improve training throughput by an amount that benefits training long-context models. However; it does not support models non padding-free models, therefore these would need to use Multipack V1.

Experimental setup

Constants:

GPUs: 8xA100
MBL: 52k
Distributed Backend: FSDP
MSL: 50k
Liger: on

Independent variables:

Time per epoch: The bias-corrected moving average of the elapsed time per-step multiplied by the total number of batches with $\alpha = 0.9$
Throughput: Zero-corrected mving average of the throughput with $\alpha = 0.9$
Total tokens: Load the processed dataset and sum up all lens
Average token len: Mean of lens

Distributed Sampler	Dataset	Num samples	Time per epoch	Throughput	Total tokens	Average Token Len
Multipack V1	1.4 skills-v1 (baseline short-context)	346432	4.5 hours	21.0	406470537	1173.31
Multipack V2	1.4 skills-v1 (baseline short-context)	346432	4.5 hours	21.0	406470537	1173.31
Multipack V1	1.5 skills-v2	374821	14.2 hours	7.495938060254868	669853984	1801.76
Multipack V2	1.5 skills-v2	374821	13.1 hours	8.44455839070264	669853984	1801.76
Multipack V1	1.5 skill-v2-reduced	163561	5.8 hours	7.8696600699200845	295960957	1809.483660530322
Multipack V2	1.5 skill-v2-reduced	163561	5 hours	9.556907986534208	295960957	1809.483660530322

Todos:

Clean up API
Add MIT license & attribution for MultipackDistributedSamplerV2
Add logic for checking that model will be using non-padding training with MultipackDistributedSamplerV2

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com> Signed-off-by: --global <97077423+RobotSail@users.noreply.github.com>

Signed-off-by: --global <97077423+RobotSail@users.noreply.github.com>

cdoern · 2025-04-21T23:04:19Z

src/instructlab/training/config.py

@@ -228,3 +229,9 @@ class TrainingArgs(BaseModel):
        default=False,
        description="Whether to use Liger kernels for training.",
    )
+
+    # TODO(osilkin): Create a better API for this, should not merge into library this way


I know Fynn has been working on "SDK-ifying" the sampler specifically @RobotSail , maybe we should sync on this with the training team

RobotSail force-pushed the fix-mock-data branch from 74d8c3b to 5ec71f5 Compare April 20, 2025 02:32

mergify bot added the ci-failure label Apr 20, 2025

fix: rename object fields

77589a4

Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com> Signed-off-by: --global <97077423+RobotSail@users.noreply.github.com>

RobotSail force-pushed the fix-mock-data branch from 5ec71f5 to 77589a4 Compare April 20, 2025 02:33

mergify bot added ci-failure and removed ci-failure labels Apr 20, 2025

add the ability to specify how many samples should be mocked

07f909e

Signed-off-by: --global <97077423+RobotSail@users.noreply.github.com>

mergify bot removed the ci-failure label Apr 21, 2025

RobotSail added 2 commits April 21, 2025 03:02

add aggregate statistics for throughput

e46c7ce

add multipack v2

cd71c40

mergify bot added the ci-failure label Apr 21, 2025

RobotSail changed the title ~~use correct field name for mock_data_len~~ Training Improvements: MultipackV2, Statistics, Mock Data, Gradient Checkpointing Apr 21, 2025

RobotSail changed the title ~~Training Improvements: MultipackV2, Statistics, Mock Data, Gradient Checkpointing~~ Training Improvements: MultipackV2, Statistics, Mock Data Apr 21, 2025

RobotSail marked this pull request as draft April 21, 2025 05:15

cdoern reviewed Apr 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Improvements: MultipackV2, Statistics, Mock Data #483

Training Improvements: MultipackV2, Statistics, Mock Data #483

RobotSail commented Apr 20, 2025 •

edited

Loading

cdoern Apr 21, 2025

Training Improvements: MultipackV2, Statistics, Mock Data #483

Are you sure you want to change the base?

Training Improvements: MultipackV2, Statistics, Mock Data #483

Conversation

RobotSail commented Apr 20, 2025 • edited Loading

Multipack V2

Experimental setup

cdoern Apr 21, 2025

Choose a reason for hiding this comment

RobotSail commented Apr 20, 2025 •

edited

Loading