Feat: Output Rails Streaming #966

Pouyanpi · 2025-02-02T17:49:04Z

Description

Example:

from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./examples/configs/abc_streaming")

rails = LLMRails(config)

messages = [{"role": "user", "content": "what can you do"}]

async for chunk in rails.stream_async(messages=messages):
    print(f"CHUNK: {chunk}")

or using chat cli

nemoguardrails chat --config="./examples/configs/abc_streaming"

Todo

finalize streaming config parameter names
test functioanlity on microservice ✅
docs: output rails streaming #976 by @mikemckiernan
SSE
~~add config example with Nemoguard NIMs?~~
test multi output rails with Nemoguard and self check output

Pouyanpi · 2025-02-02T17:49:53Z

requires #965

github-actions · 2025-02-02T17:50:23Z

Documentation preview

https://nvidia.github.io/NeMo-Guardrails/review/pr-966

mikemckiernan

Unsolicited feedback. LMK if you'd prefer that I not provide unsolicited feedback.

examples/configs/abc_streaming/config.yml

nemoguardrails/actions/actions.py

nemoguardrails/actions/llm/generation.py

nemoguardrails/actions/output_mapping.py

nemoguardrails/rails/llm/buffer.py

Pouyanpi · 2025-02-03T15:21:45Z

Thank you @mikemckiernan 👍🏻 just don't review guardrails/library changes here, their PR is #965 .

I'll push changes wrt review soon, thx!

examples/configs/abc_streaming/config.yml

cparisien

This is looking good to me. Let's make sure we are clear on the "hierarchy" of the "streaming: true" configs and the default behaviour.
Also I'd like to make sure I have a clear understanding of which rails will support this behaviour now, and which won't.

cparisien · 2025-02-03T19:47:16Z

nemoguardrails/rails/llm/config.py

+    look_back_size: int = Field(default=5, description="The look back size.")
+    window_size: int = Field(default=10, description="The window size.")


Elaborate a bit on the description. e.g., how the look back size is how many tokens/characters in total are sent to the rail each time as context. Also give the units -- is this counted in characters?

size is always number of tokens (words).

I need to change the param names and also default value. These were used for testing

window_size is number of tokens that we apply the output rails on, so when the streaming starts it corresponds to first #window_size tokens, then in the next iteration(s) we use #look_back_size tokens from the previous iteration(s) but still respect the window_size.

@trebedea need your expert intuition here, what do you think is a good default here (ignore the defaults I've set)

Basically, we have 2 parameters:

The size of the token group processed at one time: This defines how many tokens are in each block that the output rails operate on.

The amount of previous tokens reused in the next iteration: This defines the carryover or context from the previous block that is blended into the current processing batch.

decision:
chunk_size instead of window_size
context_size instead of look_back_size

chunk_size and context_size are both clear and abstract, communicating exactly what they do: one determines the size of the token block and the other determines how much of the previouss block is reused.

fyr, @mikemckiernan @cparisien

I like those two field names--and I'm extremely picky. FYI: https://nvidia.github.io/NeMo-Guardrails/review/pr-976/user-guides/configuration-guide.html#enabling-streaming-output

Thanks I'm good with these new names too.

cparisien · 2025-02-03T19:47:49Z

nemoguardrails/rails/llm/config.py

+    window_size: int = Field(default=10, description="The window size.")
+    stream_first: bool = Field(
+        default=True,
+        description="Prioritizes streaming chunks before applying output rails.",


Not clear what this means?

stream chunk of tokens first, then apply output rails on them (pros: lower perceived TTFT)

or

apply the output rails on that chunk and then stream (pros: if the chunk is blocked then we don't stream blocked content)

is it clear?

Thanks that's clear, I expect we'll learn from early users how they experience it.

nemoguardrails/rails/llm/config.py

nemoguardrails/rails/llm/llmrails.py

examples/configs/abc_streaming/config.yml

Pouyanpi · 2025-02-03T20:48:03Z

Also I'd like to make sure I have a clear understanding of which rails will support this behaviour now, and which won't.

All the output rails that are defined in the library
any user defined rail that uses @action

Pouyanpi · 2025-02-04T15:37:48Z

Some thoughts on default values:

chuk_size: 256
context_size: 64

A chunk of about 200–256 tokens tends to be large enough to encapsulate a meaningful “unit” of generated text. This provides the rails with enough information? At the same time, it isn’t so large that increases TTFT latency of first (if stream_first: False) and subsequent iterations.

Summary

Chunk Size (200 tokens):
- Pros: Captures a complete, meaningful unit of text for effective rail evaluation.
- Cons: May introduce slight delay if waiting for the full chunk (when stream_first: False).
Context Size (50 tokens):
- Pros: Ensures continuity across chunks with minimal impact on latency.
- Cons: Too small a context might miss cross-chunk violations, but 25% overlap is generally a good compromise.

Input Length	Chunk Size	Context Size	Rails Invocation
512	256	64	3
600	256	64	3
256	256	64	1
1024	256	64	5
1024	256	32	5
1024	256	32	5
1024	128	32	11
512	128	32	5

cc @cparisien

Pouyanpi · 2025-02-04T16:08:52Z

remove abc_streaming example from configs

This commit introduces streaming support for output rails in the nemoguardrails configuration. It includes the following changes: - Added `OutputRailsStreamingConfig` class to define streaming configuration for output rails. - Updated `OutputRails` class to include streaming configuration. - Modified `LLMGenerationActions` to handle streaming for output rails. - Enhanced `LLMRails` to run output rails in streaming mode. - Implemented `BufferStrategy` for buffering and processing streamed tokens - Updated `StreamingHandler` to support asynchronous iteration.

fix: fix test given new implementation add generate_chunk_str test

fix tests

- Replaced direct string replacement with a partial function `get_action_name_from_flow_id` to get action names from flow ids

update llm rails is blocked

- Added `_get_last_context_message` to retrieve the last context message. - Modified `_get_latest_user_message` to return an empty dict instead of None. - Updated `_prepare_params` to include context message and action params. - Enhanced `_prepare_params` to handle placeholders in action params. - Replaced `get_action_name_from_flow_id` with `get_action_details_from_flow_id`. - Updated `_run_output_rails_in_streaming` to use `get_action_details`.

- Removed `enabled` flag from `examples/configs/abc_streaming/config.yml` - Updated `generation.py` to check for output flows instead of the `enabled` flag - Fixed typo in `llmrails.py` and updated streaming handling logic

Pouyanpi · 2025-02-04T17:30:53Z

tested ✅

 ...
 rails:
  input:
    flows:
      - topic safety check input $model=topic_control
  output:
    flows:
      - content safety check output $model=content_safety
      - self check output
    chunk_size: 10
  dialog:
    single_call:
      enabled: False

streaming: True

Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Refer to NVIDIA#966. Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

remove them gosh

Pouyanpi · 2025-02-05T11:11:23Z

in chat cli we have following behavior:

stream_first: True

stream_first: False

for the stream_async interface

stream_first: True

async for chunk in rails.stream_async(messages=messages):
   print(f"CHUNK:{chunk}")

CHUNK:I
....
CHUNK: Is
CHUNK: there
CHUNK: anything
CHUNK: specific
CHUNK: you
CHUNK: would
CHUNK: like
CHUNK: to
CHUNK: know?
CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}

stream_first: False

async for chunk in rails.stream_async(messages=messages):
   print(f"CHUNK:{chunk}")
  
CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}

**The SSE is meant for the microservice. **

what do you think? @cparisien @mikemckiernan

Pouyanpi · 2025-02-05T11:25:04Z

in chat cli we have following behavior:

stream_first: True

stream_first: False

Here I can remove the output rails name if it is client facing and they should not be aware. Or we can have a -debug mode that also includes the output rails name.

mikemckiernan · 2025-02-05T13:46:26Z

CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}

**The SSE is meant for the microservice. **

what do you think? @cparisien @mikemckiernan

You read my mind--I meant to ask the question "How does the client know if the partial response was unsafe and how does the client check?" I have no experience in this area, but the concept of an event has appeal. (We can cover this offline...but does the client check each chunk if it's a dict or str? If dict, it's bad news?) No strong feeling on making the rail name optional. Couldn't the client filter it? Would they want to log it (maybe logged elsewhere...)?

Pouyanpi · 2025-02-05T17:14:34Z

Nice!

(We can cover this offline...but does the client check each chunk if it's a dict or str? If dict, it's bad news?)

it is a json str so not a dict but can be converted to one if needed.

No strong feeling on making the rail name optional. Couldn't the client filter it? Would they want to log it (maybe logged elsewhere...)?

Agreed! We include the name 👍🏻

@mikemckiernan

This commit introduces streaming support for output rails in the nemoguardrails configuration. It includes the following changes: - Added `OutputRailsStreamingConfig` class to define streaming configuration for output rails. - Updated `OutputRails` class to include streaming configuration. - Modified `LLMGenerationActions` to handle streaming for output rails. - Enhanced `LLMRails` to run output rails in streaming mode. - Implemented `BufferStrategy` for buffering and processing streamed tokens - Implemented RollingBuffer - Updated `StreamingHandler` to support asynchronous iteration. - Replaced direct string replacement with a partial function `get_action_name_from_flow_id` to get action names from flow ids - Added `_get_last_context_message` to retrieve the last context message. - Modified `_get_latest_user_message` to return an empty dict instead of None. - Updated `_prepare_params` to include context message and action params. - Enhanced `_prepare_params` to handle placeholders in action params. - Replaced `get_action_name_from_flow_id` with `get_action_details_from_flow_id`. - feat: handle ABORT SSE in streaming output

Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Documents #966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Documents #966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Pouyanpi mentioned this pull request Feb 3, 2025

Feat: Unified Output Mapping for Actions #965

Merged

Pouyanpi self-assigned this Feb 3, 2025

Pouyanpi added enhancement New feature or request documentation Improvements or additions to documentation labels Feb 3, 2025

mikemckiernan reviewed Feb 3, 2025

View reviewed changes

Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from 4bcd6c2 to e7866ef Compare February 3, 2025 16:03

mikemckiernan reviewed Feb 3, 2025

View reviewed changes

examples/configs/abc_streaming/config.yml Outdated Show resolved Hide resolved

cparisien reviewed Feb 3, 2025

View reviewed changes

Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from eb91c2d to ae19b67 Compare February 4, 2025 11:48

Pouyanpi added 15 commits February 4, 2025 18:17

test: add unit tests for BufferStrategy in llmrails

b546e7a

fix: fix test given new implementation add generate_chunk_str test

feat: add example config

404a2ec

test: add self-check output streaming tests

5f9fad8

fix tests

feat: add get_action_name_from_flow_id function

bcf429c

- Replaced direct string replacement with a partial function `get_action_name_from_flow_id` to get action names from flow ids

feat: update OutputRailsStreamingConfig with new fields

0b84ce7

refactor: refactor code and also integrate output_mapping decorator

88a0fde

update llm rails is blocked

refactor(logging): change log level from warning to info

0b25a38

refactor: move buffer module to rails.llm directory

987e09d

refactor: extract output mapping logic to separate module

586515f

fix _get_action_details_from_flow

96e16fd

test: add tests for output mapping functions

17a9c8f

feat: yield tokens instead of chunks in streaming output

95f5570

fix: apply review suggestions by mike

01805e4

Pouyanpi added 3 commits February 4, 2025 18:17

refactor: update streaming config and handling

6374198

- Removed `enabled` flag from `examples/configs/abc_streaming/config.yml` - Updated `generation.py` to check for output flows instead of the `enabled` flag - Fixed typo in `llmrails.py` and updated streaming handling logic

feat: add state parameter to LLMRails stream method

edc834d

fix: finalize name changes and defautl values

c2fb77f

Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from e5c8693 to 655a352 Compare February 4, 2025 17:29

mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 4, 2025

docs: Output streaming

98c003b

Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 4, 2025

Provide guidance for specifying values

a83c49d

Refer to NVIDIA#966. Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Pouyanpi added 2 commits February 5, 2025 10:39

fix supported prefixes

06ffbc2

remove them gosh

fix: remove debug print stmts

4346671

Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from dfbfe55 to 4346671 Compare February 5, 2025 09:39

feat: handle ABORT SSE in streaming output

9a8ae0d

chore: delete abc_streaming config

56a3560

cparisien approved these changes Feb 5, 2025

View reviewed changes

Pouyanpi merged commit 6dcbd87 into develop Feb 5, 2025
6 checks passed

mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 7, 2025

docs: Output streaming

eeb801d

Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

mikemckiernan added a commit that referenced this pull request Feb 7, 2025

docs: Output streaming (#976)

5ba7c57

Documents #966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 7, 2025

docs: Output streaming (NVIDIA#976)

81da538

Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

mikemckiernan mentioned this pull request Feb 7, 2025

docs: Output streaming (#976) #985

Merged

4 tasks

Pouyanpi pushed a commit that referenced this pull request Feb 7, 2025

docs: Output streaming (#976) (#985)

c0687c3

Documents #966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Output Rails Streaming #966

Feat: Output Rails Streaming #966

Pouyanpi commented Feb 2, 2025 •

edited

Loading

Pouyanpi commented Feb 2, 2025

github-actions bot commented Feb 2, 2025

mikemckiernan left a comment

Pouyanpi commented Feb 3, 2025

cparisien left a comment

cparisien Feb 3, 2025

Pouyanpi Feb 3, 2025 •

edited

Loading

Pouyanpi Feb 4, 2025

Pouyanpi Feb 4, 2025

Pouyanpi Feb 4, 2025

mikemckiernan Feb 4, 2025

cparisien Feb 5, 2025

cparisien Feb 3, 2025

Pouyanpi Feb 3, 2025

cparisien Feb 5, 2025

Pouyanpi commented Feb 3, 2025

Pouyanpi commented Feb 4, 2025

Pouyanpi commented Feb 4, 2025

Pouyanpi commented Feb 4, 2025

Pouyanpi commented Feb 5, 2025 •

edited

Loading

Pouyanpi commented Feb 5, 2025

mikemckiernan commented Feb 5, 2025

Pouyanpi commented Feb 5, 2025 •

edited

Loading

		look_back_size: int = Field(default=5, description="The look back size.")
		window_size: int = Field(default=10, description="The window size.")

Feat: Output Rails Streaming #966

Feat: Output Rails Streaming #966

Conversation

Pouyanpi commented Feb 2, 2025 • edited Loading

Description

Pouyanpi commented Feb 2, 2025

github-actions bot commented Feb 2, 2025

Documentation preview

mikemckiernan left a comment

Choose a reason for hiding this comment

Pouyanpi commented Feb 3, 2025

cparisien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pouyanpi Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pouyanpi commented Feb 3, 2025

Pouyanpi commented Feb 4, 2025

Pouyanpi commented Feb 4, 2025

Pouyanpi commented Feb 4, 2025

Pouyanpi commented Feb 5, 2025 • edited Loading

Pouyanpi commented Feb 5, 2025

mikemckiernan commented Feb 5, 2025

Pouyanpi commented Feb 5, 2025 • edited Loading

Pouyanpi commented Feb 2, 2025 •

edited

Loading

Pouyanpi Feb 3, 2025 •

edited

Loading

Pouyanpi commented Feb 5, 2025 •

edited

Loading

Pouyanpi commented Feb 5, 2025 •

edited

Loading