cluster re-sync after connectivity loss #3748

WorkingClass · 2021-09-12T18:07:15Z

WorkingClass
Sep 12, 2021

Hello,

We run a ClouchDB .3x (git rev-parse HEAD: e83935c) cluster with 3 nodes.

[cluster]
q=4
n=3
placement = z1:2,z2:1

z1 and z2 are in different DataCenter.

Cause of the issue:

[error] 2021-09-08T03:35:35.775157Z couchdb@pbx1-z2.domain.com <0.27127.173> -------- ** Node 'couchdb@pbx1-z1.domain.com' not responding **
** Removing (timedout) connection **

[error] 2021-09-08T03:35:35.775288Z couchdb@pbx1-z2.domain.com <0.30996.2> -------- ** Node 'couchdb@pbx2-z1.domain.com' not responding **
** Removing (timedout) connection **

[error] 2021-09-08T03:35:35.781578Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.1593.0> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{nocatch,{error,{nodedown,<<"progress not possible">>}}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,200}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}

[warning] 2021-09-08T03:35:36.410891Z couchdb@pbx1-z2.domain.com <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.411025Z couchdb@pbx1-z2.domain.com <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.411071Z couchdb@pbx1-z2.domain.com <0.24863.708> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:35:36.981372Z couchdb@pbx1-z2.domain.com <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.981488Z couchdb@pbx1-z2.domain.com <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.981549Z couchdb@pbx1-z2.domain.com <0.469.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904890Z couchdb@pbx1-z2.domain.com <0.15445.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904982Z couchdb@pbx1-z2.domain.com <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-09-08T03:54:36.905051Z couchdb@pbx1-z2.domain.com <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes not in maintenance mode
[error] 2021-09-08T03:54:37.861002Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.815.707> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,172}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,160}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,435}]},{couch_btree,lookup,3,[{file,"src/couch_btree.erl"},{line,286}]},{couch_btree,lookup,2,[{file,"src/couch_btree.erl"},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,"src/couch_bt_engine.erl"},{line,407}]},{couch_db,open_doc_int,3,[{file,"src/couch_db.erl"},{line,1664}]},{couch_db,open_doc,3,[{file,"src/couch_db.erl"},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[warning] 2021-09-08T03:54:37.861771Z couchdb@pbx1-z2.domain.com <0.336.0> -------- mem3_sync shards/40000000-7fffffff/account/1e/1f/3b973e1db60829651a007231a52f-202109.1630229760 couchdb@pbx2-z1.domain.com {{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,172}]},{couch_file,pread_term,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,160}]},{couch_btree,get_node,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,435}]},{couch_btree,lookup,3,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,286}]},{couch_btree,lookup,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,95,101,110,103,105,110,101,46,101,114,108]},{line,407}]},{couch_db,open_doc_int,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,1664}]},{couch_db,open_doc,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
[warning] 2021-09-08T03:54:38.219543Z couchdb@pbx1-z2.domain.com <0.25714.711> -------- 2 conflicted shards in cluster
[error] 2021-09-08T03:54:39.795469Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.2767.711> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{function_clause,[{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,190}]},{couch_server,open_int,2,[{file,"src/couch_server.erl"},{line,106}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,96}]},{couch_db,open,2,[{file,"src/couch_db.erl"},{line,163}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,107}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

The issue: If the connectivity between the DC in interrupted, one or two of the instances are running high CPU because it keeps running this, even if the connectivity is re-established:

[error] 2021-09-08T04:44:45.926690Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.9532.1> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{rexi_DOWN,{'couchdb@pbx2-z1.domain.com',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2021-09-08T04:44:45.926935Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.9531.1> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{rexi_DOWN,{'couchdb@pbx2-z1.domain.com',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[warning] 2021-09-08T04:44:45.933007Z couchdb@pbx1-z2.domain.com <0.1364.0> -------- mem3_sync shards/80000000-bfffffff/account/34/05/a4821ee6f1ade789cb8a1b2fd89e.1620895874 couchdb@pbx2-z1.domain.com {{rexi_DOWN,{'couchdb@pbx2-z1.domain.com',noproc}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}

The only way I found to correct the issue is to restart the affected node.

Is there some setting to allow the node to "re-sync" after connectivity loss?

Here is our replica & fabric config:

[fabric]
; all_docs_concurrency = 10
; changes_duration =
; shard_timeout_factor = 2
; uuid_prefix_len = 7
request_timeout = 600000
; all_docs_timeout = 10000
attachments_timeout = 600000
; view_timeout = 3600000
; partition_view_timeout = 3600000

[replicator]
; Random jitter applied on replication job startup (milliseconds)
startup_jitter = 5000
; Number of actively running replications
max_jobs = 500
;Scheduling interval in milliseconds. During each reschedule cycle
interval = 60000
; Maximum number of replications to start and stop during rescheduling.
max_churn = 20
; More worker processes can give higher network throughput but can also
; imply more disk and network IO.
worker_processes = 4
; With lower batch sizes checkpoints are done more frequently. Lower batch sizes
; also reduce the total amount of used RAM memory.
worker_batch_size = 500
; Maximum number of HTTP connections per replication.
http_connections = 20
; HTTP connection timeout per replication.
; Even for very fast/reliable networks it might need to be increased if a remote
; database is too busy.
connection_timeout = 300000
; Request timeout
request_timeout = 600000
; If a request fails, the replicator will retry it up to N times.
retries_per_request = 5
; Use checkpoints
use_checkpoints = true
; Checkpoint interval
checkpoint_interval = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
; Set to true to validate peer certificates.
verify_ssl_certificates = false
; File containing a list of peer trusted certificates (in the PEM format).
;ssl_trusted_certificates_file = /etc/ssl/certs/ca-certificates.crt
; Maximum peer certificate depth (must be set even if certificate validation is off).
ssl_certificate_max_depth = 3
; Maximum document ID length for replication.
;max_document_id_length = infinity

Thanks,
Victor

Anubiso · 2022-10-20T09:25:21Z

Anubiso
Oct 20, 2022

Hey Victor,

we run into the same issue. Have you found any other solution other than restarting the node?

0 replies

WorkingClass · 2022-10-31T05:55:36Z

WorkingClass
Oct 31, 2022
Author

Hello Aljoscha, We upgrade to the latest 3.1.x and increase the bandwidth between our two DC. No magic. Sorry for the late reply. Thanks On Thursday, October 20, 2022 at 05:25:33 a.m. EDT, Aljoscha Weber ***@***.***> wrote: Hey Victor, we run into the same issue. Have you found any other solution other than restarting the node? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster re-sync after connectivity loss #3748

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

cluster re-sync after connectivity loss #3748

WorkingClass Sep 12, 2021

Replies: 2 comments

Anubiso Oct 20, 2022

WorkingClass Oct 31, 2022 Author

WorkingClass
Sep 12, 2021

Anubiso
Oct 20, 2022

WorkingClass
Oct 31, 2022
Author