Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can redisc recover from master-slave node failover? #51

Closed
ljfuyuan opened this issue Oct 12, 2023 · 6 comments
Closed

Can redisc recover from master-slave node failover? #51

ljfuyuan opened this issue Oct 12, 2023 · 6 comments

Comments

@ljfuyuan
Copy link

Hello, Martin
I am a little confused about the failover between the primary and backup nodes. Redisc relies on the MOVED command to refresh, but when the primary node crashes, the backup node will eventually be promoted to the primary node, and Redisc still only communicates with the crashed primary node, which means It will never get a MOVED response and cannot be refreshed. The system will not be able to heal itself. Did I misunderstand something? Or is there a way to handle such situations and automatically perform failover?

@mna
Copy link
Owner

mna commented Oct 12, 2023

Hello,

That's a great question, it does recover because it communicates with all nodes in the cluster when refreshing its slots, so it should eventually get back on its feet. I just updated the consistency checker program in https://github.com/mna/redisc/tree/master/ccheck, you can play with it and give it a shot, if you do find some scenarios where it does not properly recover I'm very interested in hearing about it to try to improve its resiliency!

What I tested was to setup the cluster, start the ccheck program with the -f flag provide (refresh the cluster slots mapping at startup) and then after a bit kill -9 one of the primary nodes to see it recover and use the replica in its place.

I think that theoretically, if getting a connection from a pool fails repeatedly, it should force a refresh of the cluster mapping but I haven't heard any issues about something like that in the wild.

Thanks,
Martin

@ljfuyuan
Copy link
Author

ljfuyuan commented Oct 13, 2023

Yes, as you said, when the bind operation is called, Redisc will give priority to the previous connection to the master node. When a connection error is found, it will randomly select it again, giving it the opportunity to obtain the MOVED instruction. Good Job

In addition, I encountered another problem. When my service was started before the redis cluster was ready, when the redis cluster was ready or maybe StartupNodes can be accessed (a small window period), the function getClusterSlots triggered by the Do command would obtain empty slots information, and then all nodes will be Removed. Since the masters in function getNodeAddrs will only be initialized once, it cannot be restored again.

It should be noted that my test environment is docker, with 3 masters and 3 slaves are launched, and StartupNodes is set to one of the host names (such as: redis-node-1:6379), redis version is 6.2

It seems that the easiest way to solve this problem is to modify the masters initialize in the function getNodeAddrs.

just like:

	// populate nodes lazily, only once
	if c.masters == nil {

to:

	if len(c.masters) == 0  {

so, when the above situation occurs, we can at least always use StartupNodes to restore the situation

@mna
Copy link
Owner

mna commented Oct 13, 2023

Thanks for the info, it's weird that CLUSTER SLOTS can return a success with all empty slot information, I wonder if it's due to using docker. It's of course highly recommended to call Cluster.Refresh at application startup so that it starts with a proper mapping but in any case the fix you mention make sense to me instead of getting stuck with no available node, would you like to contribute it as a PR?

@mna
Copy link
Owner

mna commented Oct 13, 2023

Looking at

redisc/cluster.go

Lines 408 to 417 in 5aef5f5

// populate nodes lazily, only once
if c.masters == nil {
c.masters = make(map[string]bool, len(c.StartupNodes))
c.replicas = make(map[string]bool)
// StartupNodes should be masters
for _, n := range c.StartupNodes {
c.masters[n] = true
}
}
a bit more, I think it must be a little more subtle. If len(c.masters) == 0, it should check if there are replicas and if so use them as masters, and if both are empty then fallback to the StartupNodes. This way we don't lose possibly valid cluster information if for some reason all masters failed and only replicas remain, and the next refresh will use the replicas addresses (now stored in c.masters) to get the updated cluster slots.

And the masters and replicas maps should be created only if they are nil.

@ljfuyuan
Copy link
Author

I am glad to contribute to this project.

Yes,as you mentioned,we need to consider the case where there are only replicas nodes, give me more time, I’ll do more testing, then, try to launch a PR

@mna
Copy link
Owner

mna commented Oct 20, 2023

Closed via #52 .

@mna mna closed this as completed Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants