-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VTOrc recovery for mismatch in tablet type #17870
base: main
Are you sure you want to change the base?
Conversation
…played type doesn't match the tablet record Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ype doesn't match the tablet record Signed-off-by: Manan Gupta <manan@planetscale.com>
…sues Signed-off-by: Manan Gupta <manan@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17870 +/- ##
==========================================
- Coverage 67.48% 67.46% -0.03%
==========================================
Files 1593 1594 +1
Lines 258891 259096 +205
==========================================
+ Hits 174725 174804 +79
- Misses 84166 84292 +126 ☔ View full report in Codecov by Sentry. |
@GuptaManan100 this is a cool PR 👍 I wonder if I suppose I'm thinking |
Hey @timvaillancourt No I think I would still want VTOrc to make the call. Because changing tablet types are still a cluster event in some sense. Because changing type to primary (with a higher timestamp), is effectively doing a tablet promotion. (Check the test I've added too), and its better to let VTOrc handle it. Specifically, if we see a newer primary tablet, then VTOrc might choose to run the |
How did this even happen? Did we run out the context clock?
Do you know whether a "context canceled ..." error was received? |
Description
This PR fixes the issue described in #17710.
As the issue describes, the problem happens when the tablet we are trying to promote via
InitPrimary
(can also happen inPromoteReplica
) times out when writing its tablet record to the topo-server.If the topo-server write has succeeded, then the tablet is a primary according to the tablet records, but its internal state doesn't say that its a tablet. Therefore, it keeps publishing to the vtgates that it is a replica. This makes the vtgates think there is no primary tablet and they don't know where to route the queries.
Previously there was no way for VTOrc to detect this situation. It had fixed the mysql level settings by calling
UndoDemotePrimary
but the tablet continued to publish itself as aReplica
type.This PR introduces 2 changes to fix this issue -
UndoDemotePrimary
also checks that the internal state of the tablet is of type Primary. If not, then it consults the tablet record and if it finds a mismatch then it promotes the tablet without changing the primary term start time.Related Issue(s)
Checklist
Deployment Notes