Support of generate #261

bigximik · 2025-05-12T08:00:11Z

✨ Description

Adds support for generate and extends support for forward without handling cache, past_key_values, labels, attention output, or inputs_embeds.
position_ids are ignored and reconstructed from the attention mask.
Currently, only data-parallel generation is supported.

Closes #217

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Updated the HuggingfacePreTrainedModel interface to achieve feature parity with FastLLMModel.from_pretrained, and modified the __init__ method to optionally accept runner and config.
Fixed a bug in attention mask creation for the non–flash attention case.
Modified forward to support generate, aligning behavior with Hugging Face.
Added tests for generate.
Added a generate usage recipe to the documentation.

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable. (not applicable)

…nges

bigximik · 2025-05-12T08:00:41Z

Next steps are to crate tests and add docs

bigximik · 2025-05-12T11:22:44Z

There are 3 open questions:

updated parameter in from_pretrained
Should the keys be tuples, or should we enable support for dot notation inside from_pretrained (if possible)? If not, should we wrap it in a simple dot notation parser?
BatchConfig support
Do we want to support the full BatchConfig, including sequential_micro_batches? Or is it sufficient as it is now, where only micro_batch_size is passed and the full batch size is computed as data_parallel_groups * sequential_micro_batches?
Factory methods
Should we have dedicated from_trainer and from_evaluator methods, or is the current unified from_model interface sufficient?

oleksost · 2025-05-12T12:54:29Z

As a side note, to support downstream evals during training, we do not necessarily need .generate function, do we?
I.e. we need it only if we want to support generation tasks (which usually take longer), but if we want to support multiple choice/likelihood based evals, we might not need it at all, which could make the implementation easier for some models like SSMs and maybe even transformers, since there is potentially no need to care about KV/state caching etc.? (afaiu lm_eval harness is not using .generate for likelihood based evals?)

jlamypoirier · 2025-05-12T13:31:05Z

updated parameter in from_pretrained
Should the keys be tuples, or should we enable support for dot notation inside from_pretrained (if possible)? If not, should we wrap it in a simple dot notation parser?

Let's keep the existing format and not use a new rule. cls.config_class.from_dict can deal with updates.

BatchConfig support
Do we want to support the full BatchConfig, including sequential_micro_batches? Or is it sufficient as it is now, where only micro_batch_size is passed and the full batch size is computed as data_parallel_groups * sequential_micro_batches?

The micro_batch_size is useless by itself. It's only used for pipeline parallelism and throughput computation, and both of which need the full batch config. It's ok to just raise an error for features we don't want.

Factory methods
Should we have dedicated from_trainer and from_evaluator methods, or is the current unified from_model interface sufficient?

I don't really see the point of from_model, seems like an unnecessary wrapper. from_trainer would be more useful since it would do actual work in gathering the necessary information from the trainer.

bigximik · 2025-05-12T13:49:43Z

@oleksost This is a split from the larger sandbox PR for full lm_eval integration (#222).
You're right — lm_eval does not use generate for some tasks, but it does for others, which we may need to support either during training or as a separate model evaluation step.

bigximik · 2025-05-12T14:23:45Z

Let's keep the existing format and not use a new rule. cls.config_class.from_dict can deal with updates.

I just added future parity with FastLLMModel.from_pretrained, so that I can pass updates to from_dict, which is called inside it.

However, I meant a different thing here — Fast-LLM's runnable supports dot notation to apply updates from the command line:
https://github.com/ServiceNow/Fast-LLM/blob/denis/generate_final/fast_llm/engine/config_utils/runnable.py#L183
The from_dict method on config does not support this, right?

That said, @tscholak is okay with using tuples as keys in from_pretraining, so let's leave it as implemented for now.

bigximik · 2025-05-12T15:05:36Z

The micro_batch_size is useless by itself. It's only used for pipeline parallelism and throughput computation, both of which need the full batch config. It's ok to just raise an error for features we don't want.

We still need to support different global batch sizes for the same micro_batch_size, which usedas to the batch size per GPU.

For example, consider a training scenario with:

sequence_length = 8K
global_batch_size = 32
micro_batch_size = 1 (per GPU)
Model = 8B
GPUs = 16

This configuration uses sequential_micro_batches = 2 if without model parallelism because the parameters wouldn't otherwise fit on a single GPU.

However, when performing evaluation during training, we cannot use the same BatchConfig, as we do not yet support sequential_micro_batches > 1 in inference_runner. But we need to leave micro_batch_size = 1 to not to get OOM on GPUs. So we can't accept the full training BatchConfig as-is. Instead, we need to extract just the micro_batch_size and multiply it by the number of GPUs to derive the global batch size for evaluation.

Additionally, sequence_length will differ: for generate or forward, it will not be maximum as in training, but for evaluation tasks, it varies. This means it doesn't make much sense to carry it over from training, and, if we eventually want to compute FLOPs accurately for evaluation, we’ll need to consider the actual tensor sizes in each forward pass — but that’s outside the scope of this PR.

bigximik · 2025-05-12T15:17:47Z

I don't really see the point of from_model, seems like an unnecessary wrapper. from_trainer would be more useful since it would do actual work in gathering the necessary information from the trainer.

Yeah, you're right — from_model just passes arguments one-to-one to the constructor, so I'll remove it.
That said, I also don't see much value in adding from_trainer right now, since it's just a matter of passing a few additional parameters manually.

jlamypoirier · 2025-05-12T16:32:01Z

However, I meant a different thing here — Fast-LLM's runnable supports dot notation to apply updates from the command line: https://github.com/ServiceNow/Fast-LLM/blob/denis/generate_final/fast_llm/engine/config_utils/runnable.py#L183 The from_dict method on config does not support this, right?

The tuple and dot formats are basically the same, it's just parsed vs unparsed. I don't think there is much use for the unparsed version outside of a runnable, because dicts are better in-code.

However, when performing evaluation during training, we cannot use the same BatchConfig, as we do not yet support sequential_micro_batches > 1 in inference_runner. But we need to leave micro_batch_size = 1 to not to get OOM on GPUs. So we can't accept the full training BatchConfig as-is. Instead, we need to extract just the micro_batch_size and multiply it by the number of GPUs to derive the global batch size for evaluation.

Additionally, sequence_length will differ: for generate or forward, it will not be maximum as in training, but for evaluation tasks, it varies. This means it doesn't make much sense to carry it over from training, and, if we eventually want to compute FLOPs accurately for evaluation, we’ll need to consider the actual tensor sizes in each forward pass — but that’s outside the scope of this PR.

Again, forward and generate don't care about the batch config. They take a single, already preprocessed micro-batch and run it no matter the content. The BatchConfig we have in the inference runner is just a placeholder because the schedule runner expects one.

Note that sequential micro-batches are irrelevant for inference, it's just separate batches for all practical purposes.

bigximik · 2025-05-13T06:08:49Z

Thanks for the details.

Note that sequential micro-batches are irrelevant for inference — it's just separate batches for all practical purposes.

Still, if you pass a BatchConfig with sequential_micro_batches > 1 to the inference runner, the forward will crash.

OK, I’ll remove the batch config–related parameters from the constructor.
For now, let's just allow the runner to be optionally passed if we’re creating from an existing model.

bigximik · 2025-05-13T14:19:36Z

I needed to downgrade cairosvg from 2.8.0 to 2.7.0 because the latest version caused issues when building the documentation, with errors occurring inside the Cairo Python wrapper.

…rate_final

jlamypoirier · 2025-05-15T15:43:30Z

I needed to downgrade cairosvg from 2.8.0 to 2.7.0 because the latest version caused issues when building the documentation, with errors occurring inside the Cairo Python wrapper.

Does this relate to #249?

bigximik · 2025-05-15T16:54:47Z

Should be a separate issue as cairosvg version changed to 2.8.0 only on May 12 and Hook bug is from several weeks old.

However this branch is from May 12 main and it build docs on github

fast_llm/engine/inference/config.py

fast_llm/engine/inference/huggingface.py

fast_llm/engine/inference/runner.py

fast_llm/layers/language_model/head.py

jlamypoirier · 2025-05-15T16:22:41Z

tests/test_gpt_generate.py

+
+@pytest.fixture(scope="module")
+def model_and_tokenizer():
+    model = "HuggingFaceTB/SmolLM2-135M-Instruct"


#217 (comment)

How long does the test take to run? (Including the download time)

Around a minute with download and around 30 sec if already downloaded

That's too long even by slow test standard. Can we find a smaller model for testing, or make a mock one?

jlamypoirier

Looks good, only need to address the documentation and slow test issues.

Concerning the doc, I pushed a tentative fix in #271, but we won't know if it works until we merge the PR.

Concerning the tests, I suggest you add one with a dummy checkpoint (ex. the one from test_checkpoint, and keep the existing one with a skip mark. These longer tests do make sense, and I'd like us to find a way to bring them back without disrupting existing workflows.

jlamypoirier · 2025-05-16T18:51:28Z

docs/recipes/generate.md

@@ -0,0 +1,77 @@
+---


This won't be published because of @249. I think the problem is missing variables in

Fast-LLM/.github/workflows/docs.yaml

Line 59 in 3ac976b

FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE FLASH_ATTENTION_FORCE_BUILD=TRUE pip install --no-build-isolation -e ".[CORE,OPTIONAL,DEV,DOCS]"

.

(Like those in

Fast-LLM/.github/workflows/docs.yaml

Line 34 in 3ac976b

FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE FLASH_ATTENTION_FORCE_BUILD=TRUE MAMBA_SKIP_CUDA_BUILD=TRUE MAMBA_FORCE_BUILD=TRUE CAUSAL_CONV1D_FORCE_BUILD=TRUE CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE pip install --no-build-isolation -e ".[CORE,OPTIONAL,DEV,DOCS]"

)

tscholak

LGTM!
We can postpone finding a smaller model for the test to another time. The current test should be kept but not run by default.

copy from denis/generate of only generate support part, with some cha…

a46d581

…nges

bigximik added 3 commits May 12, 2025 11:00

fix to use right config param

9a81f89

added basic generate tests

667aacf

clean up

124cfa7

bigximik added 4 commits May 13, 2025 07:53

updated interface and clean up

fcef337

added from model case, renamed

47ad6d0

added decorators to the new test

e066170

added docs

24b3e1c

bigximik changed the title ~~[work in progress] Support of generate~~ Support of generate May 13, 2025

bigximik marked this pull request as ready for review May 13, 2025 08:51

bigximik requested review from jlamypoirier and tscholak May 13, 2025 08:51

bigximik and others added 3 commits May 13, 2025 08:57

fixed typo

a574f94

docs updates

2094b60

cairosvg downgrade

196e73b

bigximik and others added 2 commits May 13, 2025 17:57

style filters applied

7fe218d

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/gene…

f395bed

…rate_final

jlamypoirier reviewed May 15, 2025

View reviewed changes

bigximik added 5 commits May 16, 2025 11:29

changed of handling of unwanted config deepcopy

5325b50

moved forward declaration to the right class

2778925

changed asserts for clarity

4829b0d

Added assert as fail safe

1aa4344

changes fixes and added test for return of hidden_states

2b5fda6

jlamypoirier reviewed May 16, 2025

View reviewed changes

tscholak approved these changes May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of generate #261

Support of generate #261

bigximik commented May 12, 2025 •

edited

Loading

bigximik commented May 12, 2025

bigximik commented May 12, 2025

oleksost commented May 12, 2025 •

edited

Loading

jlamypoirier commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025 •

edited

Loading

bigximik commented May 12, 2025

jlamypoirier commented May 12, 2025

bigximik commented May 13, 2025

bigximik commented May 13, 2025

jlamypoirier commented May 15, 2025

bigximik commented May 15, 2025

jlamypoirier May 15, 2025

bigximik May 15, 2025

jlamypoirier May 15, 2025

jlamypoirier left a comment

jlamypoirier May 16, 2025

tscholak left a comment

Support of generate #261

Are you sure you want to change the base?

Support of generate #261

Conversation

bigximik commented May 12, 2025 • edited Loading

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

bigximik commented May 12, 2025

bigximik commented May 12, 2025

oleksost commented May 12, 2025 • edited Loading

jlamypoirier commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025 • edited Loading

bigximik commented May 12, 2025

jlamypoirier commented May 12, 2025

bigximik commented May 13, 2025

bigximik commented May 13, 2025

jlamypoirier commented May 15, 2025

bigximik commented May 15, 2025

jlamypoirier May 15, 2025

Choose a reason for hiding this comment

bigximik May 15, 2025

Choose a reason for hiding this comment

jlamypoirier May 15, 2025

Choose a reason for hiding this comment

jlamypoirier left a comment

Choose a reason for hiding this comment

jlamypoirier May 16, 2025

Choose a reason for hiding this comment

tscholak left a comment

Choose a reason for hiding this comment

bigximik commented May 12, 2025 •

edited

Loading

oleksost commented May 12, 2025 •

edited

Loading

bigximik commented May 12, 2025 •

edited

Loading