-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more vstream metrics for vstream manager #17858
base: main
Are you sure you want to change the base?
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
6359a99
to
f6f37d0
Compare
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #17858 +/- ##
========================================
Coverage 67.46% 67.47%
========================================
Files 1593 1594 +1
Lines 258885 259089 +204
========================================
+ Hits 174652 174814 +162
- Misses 84233 84275 +42 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this, @twthorn ! ❤️ I think that we should work out what we want these metrics to represent, and then we can refine from there. Hopefully all of my comments make sense?
go/vt/vtgate/vstream_manager.go
Outdated
labels := []string{sgtid.Keyspace, sgtid.Shard, vs.tabletType.String()} | ||
vs.vsm.vstreamsEndedWithErrors.Add(labels, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had assumed that this was a count of vtgate vstreams rather than tablet streams. Is this really supposed to be a count of tablet streams (one per shard, per vtgate vstream)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the most useful for observability is to have per shard per vtgate stream. For example, with the lag metric, if the vstream has multiple shards, we cannot accurately report lag (do we do the max, min, avg, etc.).
I believe the same is true
- errors - it saves operator the time to find in a log which stream has error
- active streams - if the vtgate is operating slow the vstream count is not enough granularity if those vstreams have hundreds of shards, we want to know how many shards this vtgate is handling
- events streamed - help operator detect which shard has disproportionately high load
I agree per tablet is maybe too granular, but per-shard would be very useful. I am going to update PR based on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense. Thanks!
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I only had some minor comments and suggestions. Please let me know what you think.
Thanks again, @twthorn !
@@ -378,11 +394,19 @@ func (vs *vstream) startOneStream(ctx context.Context, sgtid *binlogdatapb.Shard | |||
vs.wg.Add(1) | |||
go func() { | |||
defer vs.wg.Done() | |||
|
|||
labelValues := []string{sgtid.Keyspace, sgtid.Shard, vs.tabletType.String()} | |||
vs.vsm.vstreamsEndedWithErrors.Add(labelValues, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any value in adding 0. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to initialize the counter to zero for this keyspace/shard/tablet type. This is a best practice. Metrics are hard to work with if they only sometimes exist. This is also seen in the unit test for metrics that we can't assert that there were zero errors for a vstream that worked fine (the metric is simply missing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to Reset to hopefully make it more clear, as Add zero does appear to be not useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Reset what you really want? It's a counter and not a gauge. I assumed that it was meant to be a counter that spanned the life of the vtgate as the description is "Number of vstreams that ended with errors"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good point, this could be called multiple times. So we will just keep with Add zero.
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
Signed-off-by: twthorn <thomaswilliamthornton@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this, @twthorn ! ❤️ Much appreciated.
For docs, we should add these new metrics to the v22 docs here: https://vitess.io/docs/22.0/reference/vreplication/metrics/#vtgate-metrics
Do you mind opening a website/docs PR for that as well? I can help as needed.
Sure thing, opened here vitessio/website#1948 |
Description
Add more metrics for vstreams on vtgates.
Inspired partially by the tablet vstream metrics. Let me know if you think I should include any others. Once we are decided on the metrics list, I will put out docs PR.
Related Issue(s)
Checklist