Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(weave): add pre-groupby conditions to heavy query filter in calls query #3781

Merged
merged 13 commits into from
Feb 26, 2025

Conversation

gtarpenning
Copy link
Member

@gtarpenning gtarpenning commented Feb 26, 2025

Description

Add additional filters to calls query when querying by heavy fields. Because we must account for unmerged calls, all conditions must take into account individual call parts. Example: instead of adding: inputs_dump.a = 'bob' we must write inputs_dump.a = 'bob' OR inputs_dump is NULL to handle the case that we are only looking at an end call part.

Prod:

WITH filtered_calls AS (...)
SELECT
    calls_merged.id AS id,
    any(calls_merged.inputs_dump) AS inputs_dump
FROM calls_merged
WHERE calls_merged.project_id = 'project'
AND (calls_merged.id IN filtered_calls)
GROUP BY (calls_merged.project_id, calls_merged.id)
HAVING (
    JSON_VALUE(any(calls_merged.inputs_dump), 'a') = 'bob'
)

Branch:

WITH filtered_calls AS (...)
SELECT
    calls_merged.id AS id,
    any(calls_merged.inputs_dump) AS inputs_dump
FROM calls_merged
WHERE calls_merged.project_id = 'project'
AND (calls_merged.id IN filtered_calls)
AND ((JSON_VALUE(calls_merged.inputs_dump, 'a') = 'bob')
    OR calls_merged.inputs_dump IS NULL)
GROUP BY (calls_merged.project_id, calls_merged.id)
HAVING (
    JSON_VALUE(any(calls_merged.inputs_dump), 'a') = 'bob'
)

Testing

  • unit tests
  • end to end tests

Summary by CodeRabbit

  • New Features

    • Enhanced the system’s filtering and data aggregation capabilities to boost query performance and accuracy during high-demand operations.
  • Tests

    • Expanded automated test coverage with new tests for heavy condition aggregation and batch processing, ensuring consistent and reliable system behavior.

Copy link

coderabbitai bot commented Feb 26, 2025

Warning

Rate limit exceeded

@gtarpenning has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 59 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between fadc9f1 and 889cce4.

📒 Files selected for processing (1)
  • weave/trace_server/calls_query_builder.py (10 hunks)

Walkthrough

This pull request introduces new test functions and updates to query builders. Two new test functions in the client trace tests and three in the trace server tests simulate heavy condition aggregations and batch querying. In the production code under weave/trace_server/calls_query_builder.py, methods generating SQL are updated to include a new parameter (use_agg_fn), and additional helper methods are added for predicate filtering and feedback identification. These changes enhance testing coverage and improve SQL generation logic.

Changes

File(s) Change Summary
tests/trace/test_client_trace.py, tests/trace_server/test_calls_query_builder.py Added new test functions to validate heavy condition aggregation and batch query filtering in both client and server test suites.
weave/trace_server/calls_query_builder.py Updated as_sql methods with a new use_agg_fn parameter; modified process_query_to_conditions; added is_feedback, make_call_parts_predicate_filters_sql, and is_heavy functions.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant QueryBuilder
    participant Field
    participant Condition
    participant SQLGen

    Test->>QueryBuilder: Initiate heavy condition query
    QueryBuilder->>Field: Call as_sql(use_agg_fn=True)
    Field-->>QueryBuilder: Return SQL snippet
    QueryBuilder->>Condition: Invoke is_feedback() check
    Condition-->>QueryBuilder: Return feedback status
    QueryBuilder->>SQLGen: Evaluate should_add_predicate_filters()
    SQLGen-->>QueryBuilder: Return decision on predicate filters
    alt Predicate filters required
       QueryBuilder->>SQLGen: Call make_call_parts_predicate_filters_sql()
       SQLGen-->>QueryBuilder: Return predicate filters SQL
    end
    QueryBuilder-->>Test: Return complete SQL query
Loading

Possibly related PRs

Suggested reviewers

  • tssweeney
  • jamie-rasmussen

Poem

I'm a rabbit in a code-filled glade,
Hopping through tests that clearly made,
Heavy queries and filters in cheerful play,
SQL and conditions leading the way.
With bits and bytes, I joyfully bound,
Celebrating changes where logic is found!
🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@circle-job-mirror
Copy link

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
weave/trace_server/calls_query_builder.py (2)

76-76: Clarify default usage of use_agg_fn.

The new boolean attribute defaults to True, which might not always be intuitive. Consider adding a docstring or inline comment explaining in which scenarios the aggregator function should be activated or skipped.


289-294: Use a streamlined approach to detect feedback fields.

You could simplify the method definition with the built-in any() function:

 def is_feedback(self) -> bool:
-    for field in self._get_consumed_fields():
-        if isinstance(field, CallsMergedFeedbackPayloadField):
-            return True
-    return False
+    return any(isinstance(field, CallsMergedFeedbackPayloadField)
+               for field in self._get_consumed_fields())
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9b9486 and 2d8e43c.

📒 Files selected for processing (3)
  • tests/trace/test_client_trace.py (1 hunks)
  • tests/trace_server/test_calls_query_builder.py (2 hunks)
  • weave/trace_server/calls_query_builder.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Focus on pythonic code patterns. Check for proper...

**/*.py: Focus on pythonic code patterns.
Check for proper exception handling.
Verify type hints usage where applicable.
Look for potential performance improvements.
Don't comment on formatting if black/isort is configured.
Check for proper dependency injection patterns.
Verify proper async handling if applicable.

  • weave/trace_server/calls_query_builder.py
  • tests/trace/test_client_trace.py
  • tests/trace_server/test_calls_query_builder.py
⏰ Context from checks skipped due to timeout of 90000ms (40)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
  • GitHub Check: Trace nox tests (3, 13, trace)
  • GitHub Check: Trace nox tests (3, 12, trace)
  • GitHub Check: Trace nox tests (3, 11, trace)
  • GitHub Check: Trace nox tests (3, 10, trace)
🔇 Additional comments (11)
tests/trace/test_client_trace.py (2)

3305-3377: Great implementation of test to verify heavy condition aggregation!

This test effectively validates querying of call parts with heavy conditions. It tests both the initial state after call start and the post-end state, while also ensuring that data merging in the database doesn't affect query results. The conditional logic handling for SQLite client is particularly good, as it acknowledges the different behaviors between database backends.


3379-3477: Well-structured test for batch query handling with heavy conditions

The test creates a good simulation of a batch scenario with multiple calls and verifies both input and output field filtering. I appreciate the thorough validation of different query combinations and the special handling for Clickhouse's asynchronous merging behavior.

tests/trace_server/test_calls_query_builder.py (3)

737-796: Excellent test for SQL predicate filter generation

This test correctly validates that the query builder generates SQL with proper NULL handling for heavy conditions. The predicate filter ensures that queries work correctly even when dealing with unmerged call parts by including OR calls_merged.inputs_dump IS NULL in the condition.


798-868: Comprehensive test for multiple heavy conditions

Good test that verifies the behavior when dealing with multiple heavy conditions on both input and output fields. The SQL generation properly handles complex combinations by creating appropriate predicate filters for each condition while maintaining proper NULL handling.


870-921: Clear validation of OR condition between start and end fields

This test provides an important validation of a special case - ensuring that predicate filters aren't created when there's an OR between start and end fields. The docstring clearly explains the test's purpose, and the assertions verify the expected behavior where the conditions are properly moved to the HAVING clause instead.

weave/trace_server/calls_query_builder.py (6)

85-86: Consider honoring the cast parameter for non-aggregated calls.

When use_agg_fn is False, the method ignores the inbound cast parameter. If this is not the desired behavior, consider applying:

 if not self.use_agg_fn:
-    return clickhouse_cast(inner)
+    return clickhouse_cast(inner, cast)
 
 return clickhouse_cast(f"{self.agg_fn}({inner})", cast)

609-609: No issues detected for initialization.

Setting call_parts_predicate_filters_sql to an empty string at this point looks fine.


709-710: Verify the overlap of attributes_dump in both sets.

attributes_dump is declared as start-only and end-only. This may lead to ambiguous or unexpected filtering logic. Ensure this duplication is intentional or adjust accordingly.


739-739: Parameter addition approved.

The new use_agg_fn parameter in process_query_to_conditions() gives fine-grained control over aggregation. Looks good to proceed.


925-974: Confirm handling of nested OR logic.

should_add_predicate_filters checks top-level OR operations for conflicting start/end fields. If nested or combined logic is possible, ensure those cases are correctly handled or confirm that this limited scope is sufficient.


976-1036: Reassess null allowance for start-only and end-only fields.

Permitting rows with NULL values for heavy fields in an OR-based condition can inflate partial matches. Validate that these broader matching rules meet performance and correctness needs.

@gtarpenning gtarpenning changed the title init with test perf(weave): add pre-orderby to heavy query filter in calls query Feb 26, 2025
@gtarpenning gtarpenning changed the title perf(weave): add pre-orderby to heavy query filter in calls query perf(weave): add pre-orderby conditions to heavy query filter in calls query Feb 26, 2025
@gtarpenning gtarpenning marked this pull request as ready for review February 26, 2025 17:15
@gtarpenning gtarpenning requested a review from a team as a code owner February 26, 2025 17:15
@gtarpenning gtarpenning changed the title perf(weave): add pre-orderby conditions to heavy query filter in calls query perf(weave): add pre-groupby conditions to heavy query filter in calls query Feb 26, 2025
@gtarpenning gtarpenning requested a review from a team as a code owner February 26, 2025 17:29
@@ -65,6 +65,9 @@ def as_sql(
def as_select_sql(self, pb: ParamBuilder, table_alias: str) -> str:
return f"{self.as_sql(pb, table_alias)} AS {self.field}"

def is_heavy(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is a property of CallsMergedField. not sure why it needed to be hoisted up to QueryBuilderField

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you resolved this conversation without any comment or code change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gdi cursor keeps adding that back, if you check my first commit I removed it. There is a linter error that Cursor is obsessed with

if len(having_conditions_sql) > 0:
having_filter_sql = "HAVING " + combine_conditions(
having_conditions_sql, "AND"
)

if should_add_predicate_filters(self.query_conditions):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would slightly adjust this to be more pluggable.

call_parts_predicate_filters_sql = build_call_parts_predicate_filter(self.query_conditions)

I think that the implementation for "should_add_predicate_filers" is coupled with the implementation of "make_call_parts_predicate_filters_sql" and therefore should be abstracted behind a common method. Looking at your algorithm, i can imagine some alternatives which would make this easier to modify

if len(having_conditions_sql) > 0:
having_filter_sql = "HAVING " + combine_conditions(
having_conditions_sql, "AND"
)

heavy_non_feedback_conditions = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just have this inside the make_call_parts_predicate_filters_sql?

Comment on lines 68 to 70
def is_heavy(self) -> bool:
return False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -893,6 +926,80 @@ def process_calls_filter_to_conditions(
return conditions


def make_call_parts_predicate_filters_sql(
pb: ParamBuilder, table_alias: str, heavy_conditions: list[Condition]
) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the comment here that the condition returned MUST account for unmatched parts and NEVER filter out more than the heavy filter would


Returns an empty string if:
1. There are no heavy conditions
2. There are OR operations between start-only and end-only fields, which would
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think #2 is safe, just not using this algorithm

@gtarpenning gtarpenning merged commit 61fbaec into master Feb 26, 2025
131 checks passed
@gtarpenning gtarpenning deleted the griffin/heavy-filter-predicate branch February 26, 2025 20:45
@github-actions github-actions bot locked and limited conversation to collaborators Feb 26, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants