Can redisc recover from master-slave node failover？ #51

ljfuyuan · 2023-10-12T08:21:50Z

Hello, Martin
I am a little confused about the failover between the primary and backup nodes. Redisc relies on the MOVED command to refresh, but when the primary node crashes, the backup node will eventually be promoted to the primary node, and Redisc still only communicates with the crashed primary node, which means It will never get a MOVED response and cannot be refreshed. The system will not be able to heal itself. Did I misunderstand something? Or is there a way to handle such situations and automatically perform failover?

mna · 2023-10-12T15:14:00Z

Hello,

That's a great question, it does recover because it communicates with all nodes in the cluster when refreshing its slots, so it should eventually get back on its feet. I just updated the consistency checker program in https://github.com/mna/redisc/tree/master/ccheck, you can play with it and give it a shot, if you do find some scenarios where it does not properly recover I'm very interested in hearing about it to try to improve its resiliency!

What I tested was to setup the cluster, start the ccheck program with the -f flag provide (refresh the cluster slots mapping at startup) and then after a bit kill -9 one of the primary nodes to see it recover and use the replica in its place.

I think that theoretically, if getting a connection from a pool fails repeatedly, it should force a refresh of the cluster mapping but I haven't heard any issues about something like that in the wild.

Thanks,
Martin

ljfuyuan · 2023-10-13T07:13:44Z

Yes, as you said, when the bind operation is called, Redisc will give priority to the previous connection to the master node. When a connection error is found, it will randomly select it again, giving it the opportunity to obtain the MOVED instruction. Good Job

In addition, I encountered another problem. When my service was started before the redis cluster was ready, when the redis cluster was ready or maybe StartupNodes can be accessed (a small window period), the function getClusterSlots triggered by the Do command would obtain empty slots information, and then all nodes will be Removed. Since the masters in function getNodeAddrs will only be initialized once, it cannot be restored again.

It should be noted that my test environment is docker, with 3 masters and 3 slaves are launched, and StartupNodes is set to one of the host names (such as: redis-node-1:6379), redis version is 6.2

It seems that the easiest way to solve this problem is to modify the masters initialize in the function getNodeAddrs.

just like:

	// populate nodes lazily, only once
	if c.masters == nil {

to:

	if len(c.masters) == 0  {

so, when the above situation occurs, we can at least always use StartupNodes to restore the situation

mna · 2023-10-13T14:10:00Z

Thanks for the info, it's weird that CLUSTER SLOTS can return a success with all empty slot information, I wonder if it's due to using docker. It's of course highly recommended to call Cluster.Refresh at application startup so that it starts with a proper mapping but in any case the fix you mention make sense to me instead of getting stuck with no available node, would you like to contribute it as a PR?

mna · 2023-10-13T14:21:43Z

Looking at

redisc/cluster.go

Lines 408 to 417 in 5aef5f5

    
           // populate nodes lazily, only once 
        
           if c.masters == nil { 
        
           	c.masters = make(map[string]bool, len(c.StartupNodes)) 
        
           	c.replicas = make(map[string]bool) 
        
           	// StartupNodes should be masters 
        
           	for _, n := range c.StartupNodes { 
        
           		c.masters[n] = true 
        
           	} 
        
           }

a bit more, I think it must be a little more subtle. If len(c.masters) == 0, it should check if there are replicas and if so use them as masters, and if both are empty then fallback to the StartupNodes. This way we don't lose possibly valid cluster information if for some reason all masters failed and only replicas remain, and the next refresh will use the replicas addresses (now stored in c.masters) to get the updated cluster slots.

And the masters and replicas maps should be created only if they are nil.

ljfuyuan · 2023-10-13T14:32:50Z

I am glad to contribute to this project.

Yes,as you mentioned,we need to consider the case where there are only replicas nodes, give me more time, I’ll do more testing, then, try to launch a PR

mna · 2023-10-20T13:45:46Z

Closed via #52 .

mna closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can redisc recover from master-slave node failover？ #51

Can redisc recover from master-slave node failover？ #51

ljfuyuan commented Oct 12, 2023

mna commented Oct 12, 2023

ljfuyuan commented Oct 13, 2023 •

edited

Loading

mna commented Oct 13, 2023

mna commented Oct 13, 2023 •

edited

Loading

ljfuyuan commented Oct 13, 2023

mna commented Oct 20, 2023

Can redisc recover from master-slave node failover？ #51

Can redisc recover from master-slave node failover？ #51

Comments

ljfuyuan commented Oct 12, 2023

mna commented Oct 12, 2023

ljfuyuan commented Oct 13, 2023 • edited Loading

mna commented Oct 13, 2023

mna commented Oct 13, 2023 • edited Loading

ljfuyuan commented Oct 13, 2023

mna commented Oct 20, 2023

ljfuyuan commented Oct 13, 2023 •

edited

Loading

mna commented Oct 13, 2023 •

edited

Loading