Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Output Rails Streaming #966

Merged
merged 22 commits into from
Feb 5, 2025
Merged

Conversation

Pouyanpi
Copy link
Collaborator

@Pouyanpi Pouyanpi commented Feb 2, 2025

Description

Example:

from nemoguardrails import RailsConfig, LLMRails

config = RailsConfig.from_path("./examples/configs/abc_streaming")

rails = LLMRails(config)

messages = [{"role": "user", "content": "what can you do"}]

async for chunk in rails.stream_async(messages=messages):
    print(f"CHUNK: {chunk}")

or using chat cli

nemoguardrails chat --config="./examples/configs/abc_streaming"

Todo

  • finalize streaming config parameter names
  • test functioanlity on microservice ✅
  • docs: output rails streaming #976 by @mikemckiernan
  • SSE
  • add config example with Nemoguard NIMs?
  • test multi output rails with Nemoguard and self check output

@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 2, 2025

requires #965

Copy link

github-actions bot commented Feb 2, 2025

Documentation preview

https://nvidia.github.io/NeMo-Guardrails/review/pr-966

@Pouyanpi Pouyanpi self-assigned this Feb 3, 2025
@Pouyanpi Pouyanpi added enhancement New feature or request documentation Improvements or additions to documentation labels Feb 3, 2025
Copy link
Member

@mikemckiernan mikemckiernan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsolicited feedback. LMK if you'd prefer that I not provide unsolicited feedback.

examples/configs/abc_streaming/config.yml Outdated Show resolved Hide resolved
nemoguardrails/actions/actions.py Outdated Show resolved Hide resolved
nemoguardrails/actions/llm/generation.py Outdated Show resolved Hide resolved
nemoguardrails/actions/output_mapping.py Outdated Show resolved Hide resolved
nemoguardrails/rails/llm/buffer.py Outdated Show resolved Hide resolved
nemoguardrails/rails/llm/buffer.py Outdated Show resolved Hide resolved
@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 3, 2025

Thank you @mikemckiernan 👍🏻 just don't review guardrails/library changes here, their PR is #965 .

I'll push changes wrt review soon, thx!

@Pouyanpi Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from 4bcd6c2 to e7866ef Compare February 3, 2025 16:03
Copy link
Collaborator

@cparisien cparisien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. Let's make sure we are clear on the "hierarchy" of the "streaming: true" configs and the default behaviour.
Also I'd like to make sure I have a clear understanding of which rails will support this behaviour now, and which won't.

Comment on lines 311 to 312
look_back_size: int = Field(default=5, description="The look back size.")
window_size: int = Field(default=10, description="The window size.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elaborate a bit on the description. e.g., how the look back size is how many tokens/characters in total are sent to the rail each time as context. Also give the units -- is this counted in characters?

Copy link
Collaborator Author

@Pouyanpi Pouyanpi Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size is always number of tokens (words).

I need to change the param names and also default value. These were used for testing

window_size is number of tokens that we apply the output rails on, so when the streaming starts it corresponds to first #window_size tokens, then in the next iteration(s) we use #look_back_size tokens from the previous iteration(s) but still respect the window_size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trebedea need your expert intuition here, what do you think is a good default here (ignore the defaults I've set)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we have 2 parameters:

  1. The size of the token group processed at one time: This defines how many tokens are in each block that the output rails operate on.
  2. The amount of previous tokens reused in the next iteration: This defines the carryover or context from the previous block that is blended into the current processing batch.

decision:
chunk_size instead of window_size
context_size instead of look_back_size

chunk_size and context_size are both clear and abstract, communicating exactly what they do: one determines the size of the token block and the other determines how much of the previouss block is reused.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I'm good with these new names too.

window_size: int = Field(default=10, description="The window size.")
stream_first: bool = Field(
default=True,
description="Prioritizes streaming chunks before applying output rails.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what this means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream chunk of tokens first, then apply output rails on them (pros: lower perceived TTFT)

or

apply the output rails on that chunk and then stream (pros: if the chunk is blocked then we don't stream blocked content)

is it clear?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that's clear, I expect we'll learn from early users how they experience it.

nemoguardrails/rails/llm/config.py Outdated Show resolved Hide resolved
nemoguardrails/rails/llm/llmrails.py Outdated Show resolved Hide resolved
examples/configs/abc_streaming/config.yml Outdated Show resolved Hide resolved
@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 3, 2025

Also I'd like to make sure I have a clear understanding of which rails will support this behaviour now, and which won't.

  • All the output rails that are defined in the library
  • any user defined rail that uses @action

@Pouyanpi Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from eb91c2d to ae19b67 Compare February 4, 2025 11:48
@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 4, 2025

Some thoughts on default values:

chuk_size: 256
context_size: 64

A chunk of about 200–256 tokens tends to be large enough to encapsulate a meaningful “unit” of generated text. This provides the rails with enough information? At the same time, it isn’t so large that increases TTFT latency of first (if stream_first: False) and subsequent iterations.

Summary

  • Chunk Size (200 tokens):
    • Pros: Captures a complete, meaningful unit of text for effective rail evaluation.
    • Cons: May introduce slight delay if waiting for the full chunk (when stream_first: False).
  • Context Size (50 tokens):
    • Pros: Ensures continuity across chunks with minimal impact on latency.
    • Cons: Too small a context might miss cross-chunk violations, but 25% overlap is generally a good compromise.
Input Length Chunk Size Context Size Rails Invocation
512 256 64 3
600 256 64 3
256 256 64 1
1024 256 64 5
1024 256 32 5
1024 256 32 5
1024 128 32 11
512 128 32 5

cc @cparisien

@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 4, 2025

  • remove abc_streaming example from configs

This commit introduces streaming support for output rails in the
nemoguardrails configuration. It includes the following changes:
- Added `OutputRailsStreamingConfig` class to define streaming
  configuration for output rails.
- Updated `OutputRails` class to include streaming configuration.
- Modified `LLMGenerationActions` to handle streaming for output rails.
- Enhanced `LLMRails` to run output rails in streaming mode.
- Implemented `BufferStrategy` for buffering and processing streamed tokens
- Updated `StreamingHandler` to support asynchronous iteration.
fix: fix test given new implementation

add generate_chunk_str test
- Replaced direct string replacement with a partial function  `get_action_name_from_flow_id` to get action names from flow ids
- Added `_get_last_context_message` to retrieve the last context message.
- Modified `_get_latest_user_message` to return an empty dict instead of None.
- Updated `_prepare_params` to include context message and action params.
- Enhanced `_prepare_params` to handle placeholders in action params.
- Replaced `get_action_name_from_flow_id` with `get_action_details_from_flow_id`.
- Updated `_run_output_rails_in_streaming` to use `get_action_details`.
- Removed `enabled` flag from `examples/configs/abc_streaming/config.yml`
- Updated `generation.py` to check for output flows instead of the
  `enabled` flag
- Fixed typo in `llmrails.py` and updated streaming handling logic
@Pouyanpi Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from e5c8693 to 655a352 Compare February 4, 2025 17:29
@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 4, 2025

tested ✅

 ...
 rails:
  input:
    flows:
      - topic safety check input $model=topic_control
  output:
    flows:
      - content safety check output $model=content_safety
      - self check output
    chunk_size: 10
  dialog:
    single_call:
      enabled: False

streaming: True

mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 4, 2025
Documents NVIDIA#966

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 4, 2025
Refer to NVIDIA#966.

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
@Pouyanpi Pouyanpi force-pushed the feat/output-rails-streaming-mvp branch from dfbfe55 to 4346671 Compare February 5, 2025 09:39
@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 5, 2025

in chat cli we have following behavior:

stream_first: True

image

stream_first: False

image

for the stream_async interface

stream_first: True

async for chunk in rails.stream_async(messages=messages):
   print(f"CHUNK:{chunk}")

CHUNK:I
....
CHUNK: Is
CHUNK: there
CHUNK: anything
CHUNK: specific
CHUNK: you
CHUNK: would
CHUNK: like
CHUNK: to
CHUNK: know?
CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}

stream_first: False

async for chunk in rails.stream_async(messages=messages):
   print(f"CHUNK:{chunk}")
  
CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}

**The SSE is meant for the microservice. **

what do you think? @cparisien @mikemckiernan

@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 5, 2025

in chat cli we have following behavior:

stream_first: True

image

stream_first: False

image

Here I can remove the output rails name if it is client facing and they should not be aware. Or we can have a -debug mode that also includes the output rails name.

@mikemckiernan
Copy link
Member

CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}

**The SSE is meant for the microservice. **

what do you think? @cparisien @mikemckiernan

You read my mind--I meant to ask the question "How does the client know if the partial response was unsafe and how does the client check?" I have no experience in this area, but the concept of an event has appeal. (We can cover this offline...but does the client check each chunk if it's a dict or str? If dict, it's bad news?) No strong feeling on making the rail name optional. Couldn't the client filter it? Would they want to log it (maybe logged elsewhere...)?

@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Feb 5, 2025

Nice!

(We can cover this offline...but does the client check each chunk if it's a dict or str? If dict, it's bad news?)

it is a json str so not a dict but can be converted to one if needed.

No strong feeling on making the rail name optional. Couldn't the client filter it? Would they want to log it (maybe logged elsewhere...)?

Agreed! We include the name 👍🏻

@mikemckiernan

@Pouyanpi Pouyanpi merged commit 6dcbd87 into develop Feb 5, 2025
6 checks passed
Pouyanpi added a commit that referenced this pull request Feb 5, 2025
This commit introduces streaming support for output rails in the
nemoguardrails configuration. It includes the following changes:

- Added `OutputRailsStreamingConfig` class to define streaming
  configuration for output rails.
- Updated `OutputRails` class to include streaming configuration.
- Modified `LLMGenerationActions` to handle streaming for output rails.
- Enhanced `LLMRails` to run output rails in streaming mode.
- Implemented `BufferStrategy` for buffering and processing streamed tokens
- Implemented RollingBuffer
- Updated `StreamingHandler` to support asynchronous iteration.
- Replaced direct string replacement with a partial function  `get_action_name_from_flow_id` to get action names from flow ids
- Added `_get_last_context_message` to retrieve the last context message.
- Modified `_get_latest_user_message` to return an empty dict instead of None.
- Updated `_prepare_params` to include context message and action params.
- Enhanced `_prepare_params` to handle placeholders in action params.
- Replaced `get_action_name_from_flow_id` with `get_action_details_from_flow_id`.
- feat: handle ABORT SSE in streaming output
mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 7, 2025
Documents NVIDIA#966

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
mikemckiernan added a commit that referenced this pull request Feb 7, 2025
Documents #966

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
mikemckiernan added a commit to mikemckiernan/NeMo-Guardrails that referenced this pull request Feb 7, 2025
Documents NVIDIA#966

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
@mikemckiernan mikemckiernan mentioned this pull request Feb 7, 2025
4 tasks
Pouyanpi pushed a commit that referenced this pull request Feb 7, 2025
Documents #966

Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants