Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more vstream metrics for vstream manager #17858

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

twthorn
Copy link
Contributor

@twthorn twthorn commented Feb 24, 2025

Description

Add more metrics for vstreams on vtgates.

Inspired partially by the tablet vstream metrics. Let me know if you think I should include any others. Once we are decided on the metrics list, I will put out docs PR.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation: Document new metrics for vtgate vstream website#1948

Copy link
Contributor

vitess-bot bot commented Feb 24, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Feb 24, 2025
@github-actions github-actions bot added this to the v22.0.0 milestone Feb 24, 2025
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
@twthorn twthorn force-pushed the vtgate-add-more-vstream-metrics branch from 6359a99 to f6f37d0 Compare February 24, 2025 23:00
@mattlord mattlord added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication Component: Observability Pull requests that touch tracing/metrics/monitoring and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Feb 24, 2025
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Copy link

codecov bot commented Feb 25, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.47%. Comparing base (81ce29c) to head (239367b).
Report is 9 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main   #17858    +/-   ##
========================================
  Coverage   67.46%   67.47%            
========================================
  Files        1593     1594     +1     
  Lines      258885   259089   +204     
========================================
+ Hits       174652   174814   +162     
- Misses      84233    84275    +42     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, @twthorn ! ❤️ I think that we should work out what we want these metrics to represent, and then we can refine from there. Hopefully all of my comments make sense?

Comment on lines 397 to 398
labels := []string{sgtid.Keyspace, sgtid.Shard, vs.tabletType.String()}
vs.vsm.vstreamsEndedWithErrors.Add(labels, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that this was a count of vtgate vstreams rather than tablet streams. Is this really supposed to be a count of tablet streams (one per shard, per vtgate vstream)?

Copy link
Contributor Author

@twthorn twthorn Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the most useful for observability is to have per shard per vtgate stream. For example, with the lag metric, if the vstream has multiple shards, we cannot accurately report lag (do we do the max, min, avg, etc.).

I believe the same is true

  1. errors - it saves operator the time to find in a log which stream has error
  2. active streams - if the vtgate is operating slow the vstream count is not enough granularity if those vstreams have hundreds of shards, we want to know how many shards this vtgate is handling
  3. events streamed - help operator detect which shard has disproportionately high load

I agree per tablet is maybe too granular, but per-shard would be very useful. I am going to update PR based on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense. Thanks!

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
@twthorn twthorn requested a review from mattlord February 26, 2025 18:59
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I only had some minor comments and suggestions. Please let me know what you think.

Thanks again, @twthorn !

@@ -378,11 +394,19 @@ func (vs *vstream) startOneStream(ctx context.Context, sgtid *binlogdatapb.Shard
vs.wg.Add(1)
go func() {
defer vs.wg.Done()

labelValues := []string{sgtid.Keyspace, sgtid.Shard, vs.tabletType.String()}
vs.vsm.vstreamsEndedWithErrors.Add(labelValues, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any value in adding 0. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to initialize the counter to zero for this keyspace/shard/tablet type. This is a best practice. Metrics are hard to work with if they only sometimes exist. This is also seen in the unit test for metrics that we can't assert that there were zero errors for a vstream that worked fine (the metric is simply missing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to Reset to hopefully make it more clear, as Add zero does appear to be not useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Reset what you really want? It's a counter and not a gauge. I assumed that it was meant to be a counter that spanned the life of the vtgate as the description is "Number of vstreams that ended with errors".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes good point, this could be called multiple times. So we will just keep with Add zero.

Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
@twthorn twthorn requested a review from mattlord February 27, 2025 20:22
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this, @twthorn ! ❤️ Much appreciated.

For docs, we should add these new metrics to the v22 docs here: https://vitess.io/docs/22.0/reference/vreplication/metrics/#vtgate-metrics

Do you mind opening a website/docs PR for that as well? I can help as needed.

@twthorn
Copy link
Contributor Author

twthorn commented Feb 27, 2025

Sure thing, opened here vitessio/website#1948

@twthorn twthorn requested a review from mattlord February 27, 2025 22:46
@mattlord mattlord removed the NeedsWebsiteDocsUpdate What it says label Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Observability Pull requests that touch tracing/metrics/monitoring Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: VTGates should report more metrics on vstreams
2 participants