-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Output Rails Streaming #966
Conversation
requires #965 |
Documentation preview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsolicited feedback. LMK if you'd prefer that I not provide unsolicited feedback.
Thank you @mikemckiernan 👍🏻 just don't review I'll push changes wrt review soon, thx! |
4bcd6c2
to
e7866ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. Let's make sure we are clear on the "hierarchy" of the "streaming: true" configs and the default behaviour.
Also I'd like to make sure I have a clear understanding of which rails will support this behaviour now, and which won't.
nemoguardrails/rails/llm/config.py
Outdated
look_back_size: int = Field(default=5, description="The look back size.") | ||
window_size: int = Field(default=10, description="The window size.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elaborate a bit on the description. e.g., how the look back size is how many tokens/characters in total are sent to the rail each time as context. Also give the units -- is this counted in characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size is always number of tokens (words).
I need to change the param names and also default value. These were used for testing
window_size
is number of tokens that we apply the output rails on, so when the streaming starts it corresponds to first #window_size tokens, then in the next iteration(s) we use #look_back_size tokens from the previous iteration(s) but still respect the window_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trebedea need your expert intuition here, what do you think is a good default here (ignore the defaults I've set)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we have 2 parameters:
- The size of the token group processed at one time: This defines how many tokens are in each block that the output rails operate on.
- The amount of previous tokens reused in the next iteration: This defines the carryover or context from the previous block that is blended into the current processing batch.
decision:
chunk_size
instead of window_size
context_size
instead of look_back_size
chunk_size
and context_size
are both clear and abstract, communicating exactly what they do: one determines the size of the token block and the other determines how much of the previouss block is reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyr, @mikemckiernan @cparisien
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like those two field names--and I'm extremely picky. FYI: https://nvidia.github.io/NeMo-Guardrails/review/pr-976/user-guides/configuration-guide.html#enabling-streaming-output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I'm good with these new names too.
nemoguardrails/rails/llm/config.py
Outdated
window_size: int = Field(default=10, description="The window size.") | ||
stream_first: bool = Field( | ||
default=True, | ||
description="Prioritizes streaming chunks before applying output rails.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream chunk of tokens first, then apply output rails on them (pros: lower perceived TTFT)
or
apply the output rails on that chunk and then stream (pros: if the chunk is blocked then we don't stream blocked content)
is it clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks that's clear, I expect we'll learn from early users how they experience it.
|
eb91c2d
to
ae19b67
Compare
Some thoughts on default values: chuk_size: 256 A chunk of about 200–256 tokens tends to be large enough to encapsulate a meaningful “unit” of generated text. This provides the rails with enough information? At the same time, it isn’t so large that increases TTFT latency of first (if Summary
cc @cparisien |
|
This commit introduces streaming support for output rails in the nemoguardrails configuration. It includes the following changes: - Added `OutputRailsStreamingConfig` class to define streaming configuration for output rails. - Updated `OutputRails` class to include streaming configuration. - Modified `LLMGenerationActions` to handle streaming for output rails. - Enhanced `LLMRails` to run output rails in streaming mode. - Implemented `BufferStrategy` for buffering and processing streamed tokens - Updated `StreamingHandler` to support asynchronous iteration.
fix: fix test given new implementation add generate_chunk_str test
- Replaced direct string replacement with a partial function `get_action_name_from_flow_id` to get action names from flow ids
update llm rails is blocked
- Added `_get_last_context_message` to retrieve the last context message. - Modified `_get_latest_user_message` to return an empty dict instead of None. - Updated `_prepare_params` to include context message and action params. - Enhanced `_prepare_params` to handle placeholders in action params. - Replaced `get_action_name_from_flow_id` with `get_action_details_from_flow_id`. - Updated `_run_output_rails_in_streaming` to use `get_action_details`.
- Removed `enabled` flag from `examples/configs/abc_streaming/config.yml` - Updated `generation.py` to check for output flows instead of the `enabled` flag - Fixed typo in `llmrails.py` and updated streaming handling logic
e5c8693
to
655a352
Compare
tested ✅ ...
rails:
input:
flows:
- topic safety check input $model=topic_control
output:
flows:
- content safety check output $model=content_safety
- self check output
chunk_size: 10
dialog:
single_call:
enabled: False
streaming: True |
Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Refer to NVIDIA#966. Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
remove them gosh
dfbfe55
to
4346671
Compare
in chat cli we have following behavior:
for the
async for chunk in rails.stream_async(messages=messages):
print(f"CHUNK:{chunk}")
CHUNK:I
....
CHUNK: Is
CHUNK: there
CHUNK: anything
CHUNK: specific
CHUNK: you
CHUNK: would
CHUNK: like
CHUNK: to
CHUNK: know?
CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}}
async for chunk in rails.stream_async(messages=messages):
print(f"CHUNK:{chunk}")
CHUNK:{"event": "ABORT", "data": {"reason": "Blocked by self check output rails."}} **The SSE is meant for the microservice. ** what do you think? @cparisien @mikemckiernan |
You read my mind--I meant to ask the question "How does the client know if the partial response was unsafe and how does the client check?" I have no experience in this area, but the concept of an event has appeal. (We can cover this offline...but does the client check each chunk if it's a dict or str? If dict, it's bad news?) No strong feeling on making the rail name optional. Couldn't the client filter it? Would they want to log it (maybe logged elsewhere...)? |
Nice!
it is a json str so not a dict but can be converted to one if needed.
Agreed! We include the name 👍🏻 |
This commit introduces streaming support for output rails in the nemoguardrails configuration. It includes the following changes: - Added `OutputRailsStreamingConfig` class to define streaming configuration for output rails. - Updated `OutputRails` class to include streaming configuration. - Modified `LLMGenerationActions` to handle streaming for output rails. - Enhanced `LLMRails` to run output rails in streaming mode. - Implemented `BufferStrategy` for buffering and processing streamed tokens - Implemented RollingBuffer - Updated `StreamingHandler` to support asynchronous iteration. - Replaced direct string replacement with a partial function `get_action_name_from_flow_id` to get action names from flow ids - Added `_get_last_context_message` to retrieve the last context message. - Modified `_get_latest_user_message` to return an empty dict instead of None. - Updated `_prepare_params` to include context message and action params. - Enhanced `_prepare_params` to handle placeholders in action params. - Replaced `get_action_name_from_flow_id` with `get_action_details_from_flow_id`. - feat: handle ABORT SSE in streaming output
Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Documents #966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Documents NVIDIA#966 Signed-off-by: Mike McKiernan <mmckiernan@nvidia.com>
Description
Example:
or using chat cli
nemoguardrails chat --config="./examples/configs/abc_streaming"
Todo
add config example with Nemoguard NIMs?