cluster re-sync after connectivity loss #3748
Unanswered
WorkingClass
asked this question in
Q&A
Replies: 2 comments
-
Hey Victor, we run into the same issue. Have you found any other solution other than restarting the node? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hello Aljoscha,
We upgrade to the latest 3.1.x and increase the bandwidth between our two DC. No magic.
Sorry for the late reply.
Thanks
On Thursday, October 20, 2022 at 05:25:33 a.m. EDT, Aljoscha Weber ***@***.***> wrote:
Hey Victor,
we run into the same issue. Have you found any other solution other than restarting the node?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
We run a ClouchDB .3x (git rev-parse HEAD: e83935c) cluster with 3 nodes.
[cluster]
q=4
n=3
placement = z1:2,z2:1
z1 and z2 are in different DataCenter.
Cause of the issue:
[error] 2021-09-08T03:35:35.775157Z couchdb@pbx1-z2.domain.com <0.27127.173> -------- ** Node 'couchdb@pbx1-z1.domain.com' not responding **
** Removing (timedout) connection **
[error] 2021-09-08T03:35:35.775288Z couchdb@pbx1-z2.domain.com <0.30996.2> -------- ** Node 'couchdb@pbx2-z1.domain.com' not responding **
** Removing (timedout) connection **
[error] 2021-09-08T03:35:35.781578Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.1593.0> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{nocatch,{error,{nodedown,<<"progress not possible">>}}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,200}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}
[warning] 2021-09-08T03:35:36.410891Z couchdb@pbx1-z2.domain.com <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.411025Z couchdb@pbx1-z2.domain.com <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.411071Z couchdb@pbx1-z2.domain.com <0.24863.708> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:35:36.981372Z couchdb@pbx1-z2.domain.com <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.981488Z couchdb@pbx1-z2.domain.com <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.981549Z couchdb@pbx1-z2.domain.com <0.469.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904890Z couchdb@pbx1-z2.domain.com <0.15445.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904982Z couchdb@pbx1-z2.domain.com <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-09-08T03:54:36.905051Z couchdb@pbx1-z2.domain.com <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes not in maintenance mode
[error] 2021-09-08T03:54:37.861002Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.815.707> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,172}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,160}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,435}]},{couch_btree,lookup,3,[{file,"src/couch_btree.erl"},{line,286}]},{couch_btree,lookup,2,[{file,"src/couch_btree.erl"},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,"src/couch_bt_engine.erl"},{line,407}]},{couch_db,open_doc_int,3,[{file,"src/couch_db.erl"},{line,1664}]},{couch_db,open_doc,3,[{file,"src/couch_db.erl"},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
[warning] 2021-09-08T03:54:37.861771Z couchdb@pbx1-z2.domain.com <0.336.0> -------- mem3_sync shards/40000000-7fffffff/account/1e/1f/3b973e1db60829651a007231a52f-202109.1630229760 couchdb@pbx2-z1.domain.com {{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,172}]},{couch_file,pread_term,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,160}]},{couch_btree,get_node,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,435}]},{couch_btree,lookup,3,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,286}]},{couch_btree,lookup,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,95,101,110,103,105,110,101,46,101,114,108]},{line,407}]},{couch_db,open_doc_int,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,1664}]},{couch_db,open_doc,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
[warning] 2021-09-08T03:54:38.219543Z couchdb@pbx1-z2.domain.com <0.25714.711> -------- 2 conflicted shards in cluster
[error] 2021-09-08T03:54:39.795469Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.2767.711> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{function_clause,[{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,190}]},{couch_server,open_int,2,[{file,"src/couch_server.erl"},{line,106}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,96}]},{couch_db,open,2,[{file,"src/couch_db.erl"},{line,163}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,107}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
The issue: If the connectivity between the DC in interrupted, one or two of the instances are running high CPU because it keeps running this, even if the connectivity is re-established:
[error] 2021-09-08T04:44:45.926690Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.9532.1> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{rexi_DOWN,{'couchdb@pbx2-z1.domain.com',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
[error] 2021-09-08T04:44:45.926935Z couchdb@pbx1-z2.domain.com emulator -------- Error in process <0.9531.1> on node 'couchdb@pbx1-z2.domain.com' with exit value:
{{rexi_DOWN,{'couchdb@pbx2-z1.domain.com',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
[warning] 2021-09-08T04:44:45.933007Z couchdb@pbx1-z2.domain.com <0.1364.0> -------- mem3_sync shards/80000000-bfffffff/account/34/05/a4821ee6f1ade789cb8a1b2fd89e.1620895874 couchdb@pbx2-z1.domain.com {{rexi_DOWN,{'couchdb@pbx2-z1.domain.com',noproc}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
The only way I found to correct the issue is to restart the affected node.
Is there some setting to allow the node to "re-sync" after connectivity loss?
Here is our replica & fabric config:
[fabric]
; all_docs_concurrency = 10
; changes_duration =
; shard_timeout_factor = 2
; uuid_prefix_len = 7
request_timeout = 600000
; all_docs_timeout = 10000
attachments_timeout = 600000
; view_timeout = 3600000
; partition_view_timeout = 3600000
[replicator]
; Random jitter applied on replication job startup (milliseconds)
startup_jitter = 5000
; Number of actively running replications
max_jobs = 500
;Scheduling interval in milliseconds. During each reschedule cycle
interval = 60000
; Maximum number of replications to start and stop during rescheduling.
max_churn = 20
; More worker processes can give higher network throughput but can also
; imply more disk and network IO.
worker_processes = 4
; With lower batch sizes checkpoints are done more frequently. Lower batch sizes
; also reduce the total amount of used RAM memory.
worker_batch_size = 500
; Maximum number of HTTP connections per replication.
http_connections = 20
; HTTP connection timeout per replication.
; Even for very fast/reliable networks it might need to be increased if a remote
; database is too busy.
connection_timeout = 300000
; Request timeout
request_timeout = 600000
; If a request fails, the replicator will retry it up to N times.
retries_per_request = 5
; Use checkpoints
use_checkpoints = true
; Checkpoint interval
checkpoint_interval = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
; Set to true to validate peer certificates.
verify_ssl_certificates = false
; File containing a list of peer trusted certificates (in the PEM format).
;ssl_trusted_certificates_file = /etc/ssl/certs/ca-certificates.crt
; Maximum peer certificate depth (must be set even if certificate validation is off).
ssl_certificate_max_depth = 3
; Maximum document ID length for replication.
;max_document_id_length = infinity
Thanks,
Victor
Beta Was this translation helpful? Give feedback.
All reactions