Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jormungandr v0.8.7 is not propagating blocks to the network #1625

Open
lovelypool opened this issue Jan 23, 2020 · 14 comments
Open

Jormungandr v0.8.7 is not propagating blocks to the network #1625

lovelypool opened this issue Jan 23, 2020 · 14 comments
Labels
bug Something isn't working subsys-network network related

Comments

@lovelypool
Copy link

lovelypool commented Jan 23, 2020

Describe the bug
I have had numerous blocks on epoch 40 not propagate to the network. My leader logs show the block was signed and sent.

3 examples of 3 consecutively missed blocks:

- created_at_time: "2020-01-23T00:44:54.320114806+00:00" enclave_leader_id: 1 finished_at_time: "2020-01-23T02:26:51.002096232+00:00" scheduled_at_date: "40.12997" scheduled_at_time: "2020-01-23T02:26:51+00:00" status: Block: block: 71bc2c42bfb03ef6c689929d143a6c4c478673b5399fbab48ed63e7d8f718595 chain_length: 122860 wake_at_time: "2020-01-23T02:26:51.001334781+00:00"

- created_at_time: "2020-01-23T00:44:54.320117912+00:00" enclave_leader_id: 1 finished_at_time: "2020-01-23T02:55:29.005382635+00:00" scheduled_at_date: "40.13856" scheduled_at_time: "2020-01-23T02:55:29+00:00" status: Block: block: aa833ac0709d7d2119167e73b8ea815e82fb93add0f85ba9edd5dd5ea67b01e3 chain_length: 122912 wake_at_time: "2020-01-23T02:55:29.001757208+00:00"
- created_at_time: "2020-01-23T00:44:54.320132781+00:00" enclave_leader_id: 1 finished_at_time: "2020-01-23T02:59:41.242523233+00:00" scheduled_at_date: "40.13982" scheduled_at_time: "2020-01-23T02:59:41+00:00" status: Block: block: bbed794f976a44dbfe323e102aa91cfcb8ada9cc2c57ba285f1b4f63fa4077ab chain_length: 122925 wake_at_time: "2020-01-23T02:59:41.241818067+00:00"

I monitor my node for slot latency (how fast did a block propagate to me) and block height delta (did my node miss a block). The blue arrows mark my leadership events. You can see one leadership event may have been missed due to my node skipping a block, but the other 2 do not have any explanation. Other operators in telegram confirmed that they saw no mention of the block hashes above in their logs.

image

My node is set to run with 768 connections, and hovers around 350 established TCP connections. It usually becomes unstable (more missed blocks, more slot latency) within 4-6 hours.

I am also using a gossip interval of 60sec to improve uptime and stability.

Mandatory Information

  1. jcli --full-version output;
    jcli 0.8.7 (HEAD-364cd84e+, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
  2. jormungandr --full-version output;
    jormungandr 0.8.7 (HEAD-364cd84e+, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
@johnalotoski
Copy link

Also a problem on v0.8.6. What does your CPU look like at the time you missed your blocks? Has it hit a ceiling it's staying near at that time? (The ceiling is not necessarily 100% CPU; it depends on the number of physical cores and logical threads, of the machine, etc).

@hodlonaut
Copy link

Experienced a similar issue while running 0.8.7 - node has been up for several hours with hundreds of gossiping peers and 45 peers showing up in network p2p view blocks, however after the block was created there is no sign of it in the logs of other known nodes on the network... and no clue as to what went wrong with propagation:

monitoring8502.log:Jan 26 04:11:43.009 WARN processing the new block and propagating, date: 43.14343, parent: c155d2ac112d1f09ef407f86a1e15335c9fa603b451d8bda2afac3ea88f44ee7, hash: 04efbda0d474ec8b2a1cdd2ee92011a46cff045488f9a624054a7fc56b92621b, task: block
monitoring8502.log:Jan 26 04:11:43.009 INFO update current branch tip, date: 43.14343, parent: c155d2ac112d1f09ef407f86a1e15335c9fa603b451d8bda2afac3ea88f44ee7, hash: 04efbda0d474ec8b2a1cdd2ee92011a46cff045488f9a624054a7fc56b92621b, task: block
monitoring8502.log-Jan 26 04:11:43.012 INFO connecting to peer, node_id: 0cdee9e90f701fece10f27e74057b22f4ed642a5fccc7ad4, peer_addr: 3.19.241.151:9000, task: network
monitoring8502.log-Jan 26 04:11:43.012 INFO connecting to peer, node_id: 66ea9926f0c948a6128147e2024094ef3d1e427b4aa4c15c, peer_addr: 176.9.16.236:19300, task: network
monitoring8502.log-Jan 26 04:11:43.013 INFO connecting to peer, node_id: daddcce5d348e59e9c6b1b4bf434d1c62fcf25d531bab959, peer_addr: 159.89.242.163:8299, task: network
monitoring8502.log-Jan 26 04:11:43.013 INFO connecting to peer, node_id: 4f3c5e499af6c3b336a049b261a65fc52ee7b0da892b282f, peer_addr: 34.90.92.222:3300, task: network
monitoring8502.log-Jan 26 04:11:43.013 INFO connecting to peer, node_id: 7dc781913e7047019a9575a9e8e49baeb4d095a0c6851a3e, peer_addr: 139.99.222.97:3100, task: network

@Straightpool
Copy link

Straightpool commented Jan 26, 2020

Hmm I believe I have had the same issue twice, see here #1651

I am not sure but I believe I had this once on 0.8.6 as well, and once on 0.8.7.

Lovelypool, I looked at your three blocks which did not make it into the final blockchain and picked the blocks with the exact blockchainlength which actually made it into the final blockchain:

Block 1:
https://explorer.incentivized-testnet.iohkdev.io/block/e9a65d5348b109a7be891eb8b6ce209b3d07c3614a4c22c0a0e8e144ebea87b9
Slot 13002, your node had this scheduled for 12997, 4 slots earlier

Block 2:
https://explorer.incentivized-testnet.iohkdev.io/block/9d22ed4e7262fdd788059ac5aba21255e782618c56f780789459e1596c1654be
Slot 13860, your node had this scheduled for 13856, 4 slots earlier.

Block 3:
https://explorer.incentivized-testnet.iohkdev.io/block/356f50378afc92b883c1f1dc511a2eedf35c06cac05b7f93a5a086534a5d841e
Slot 14013, your node had this scheduled for 13982, 31 slots earlier.

I have had the the same issue twice already. For some reason our nodes seem in error conclude after successful block production to be on the same blockchainheight with a scheduled slot offset of 3 to 40 slots to the actual slot which produced the block for this blockchainlength.

@adascanBovich
Copy link

adascanBovich commented Jan 26, 2020

Lovelypool, I looked at your three blocks which did not make it into the final blockchain and picked the blocks with the exact blockchainlength which actually made it into the final blockchain:

I think you are confusing slot schedule with blockchainlength. Chain length has nothing to do with schedule of slots. It is just incrementing number representing total amount of blocks that blockchain has now. Lovely's node produced slot 12997 on chain_length 122860 and it was rejected by the network (or not propagated at all). Node that created slot 13002 just wasn't aware of Lovely's created block. His blockchain had 122859 blocks before creation, that is why block 13002 has 122860 height.

@Straightpool
Copy link

Lovelypool, did you keep the logs of your missed blocks? Please check against
#1513
From the looks of it many current issues around block propagation all look to lead to this root cause.

adascanBovich, according to https://cardanodocs.com/technical/ an Epoch is divided into slots, some of the slots have slot leaders from the community. If multiple slot leaders are assigned to the same slot we have a so called competive fork. The slots do progress through the epoch, in each slot a block is produced adding to the length of the blockchain. So I still think slot progression is somewhat linked to blockchainlength progression. Due to bugs in Jormungandr not every slot produces a valid block though.

@adascanBovich
Copy link

Lovelypool, did you keep the logs of your missed blocks? Please check against
#1513
From the looks of it many current issues around block propagation all look to lead to this root cause.

adascanBovich, according to https://cardanodocs.com/technical/ an Epoch is divided into slots, some of the slots have slot leaders from the community. If multiple slot leaders are assigned to the same slot we have a so called competive fork. The slots do progress through the epoch, in each slot a block is produced adding to the length of the blockchain. So I still think slot progression is somewhat linked to blockchainlength progression. Due to bugs in Jormungandr not every slot produces a valid block though.

Exactly! If multiple leaders assigned to the SAME slot. 13002 and 12997 are not the same slots (from example above). Or 41.8603 and 41.8643 from your issue. Multiple leadership means that 2 or more pools trying to produce block in the same slot, 41.8603 for example.

@lovelypool
Copy link
Author

Sorry, my logs for this event have since been rolled over so I cannot provide any new data for this bug report. To work around this issue, I have been running v0.8.5 with my old restart script for the past few days and things have been good enough. Granted, its been painful to see this issue not get much traction yet.

@hodlonaut
Copy link

Looks like can still happen on latest version of jormu (0.8.9) - unless someone has evidence of a sighting of a block announcement of block b251b0483770033960b6e82be36f873a005f1be9f29d6eb5740fa5e3a11f7a0f, that is!

@Oqulent
Copy link

Oqulent commented Feb 2, 2020

I can confirm this is happening in v089. the node successfully produced the block, propagated? it, saw it on its end but the network never received it.

  • created_at_time: "2020-02-02T05:53:42.978154726+00:00"
    enclave_leader_id: 1
    finished_at_time: "2020-02-02T06:24:27.004548220+00:00"
    scheduled_at_date: "50.20125"
    scheduled_at_time: "2020-02-02T06:24:27+00:00"
    status:
    Block:
    block: dfb654859e509b185170cb593f93cbc79361237cfbe2cb412300ed7e71877dde
    chain_length: 155626
    wake_at_time: "2020-02-02T06:24:27.001879847+00:00"

@rinor rinor added bug Something isn't working subsys-network network related labels Feb 2, 2020
@Straightpool
Copy link

Straightpool commented Feb 6, 2020

I just had the same issue on

jcli --full-version output;
jcli 0.8.9 (8.9-30d20d2e, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
jormungandr --full-version output;
jormungandr 0.8.9 (8.9-30d20d2e, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]

Block produced

  • created_at_time: "2020-02-06T19:29:07.411320236+00:00"
    enclave_leader_id: 1
    finished_at_time: "2020-02-06T22:08:25.381939659+00:00"
    scheduled_at_date: "55.5244"
    scheduled_at_time: "2020-02-06T22:08:25+00:00"
    status:
    Block:
    block: 34511c5d80fc15ce58b6b03f97294920aa16195f9287fb919196ba5325424f10
    chain_length: 170916
    wake_at_time: "2020-02-06T22:08:25.380166256+00:00"

Logoutput

499ns, event_end: 2020-02-06T22:08:27+00:00, event_start: 2020-02-06T22:08:25+00:00, event_date: 55.5244, leader_id: 1, task: leadership
123147:Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.384 INFO receiving block from leadership service, date: 55.5244, parent: 6ee4f706bf7b60195049854b47a9da0a106bc9cf25ed644e0704b42151107d6d, hash: 34511c5d80fc15ce58b6b03f97294920aa16195f9287fb919196ba5325424f10, task: block
123435-Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.396 INFO failed to connect to peer, reason: Connection refused (os error 111), node_id: aaa04779afac81a2201224d428bd331d1be54648b3457d11, peer_addr: 45.77.57.228:3000, task: network
123682:Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.408 INFO block from leader event successfully stored, date: 55.5244, parent: 6ee4f706bf7b60195049854b47a9da0a106bc9cf25ed644e0704b42151107d6d, hash: 34511c5d80fc15ce58b6b03f97294920aa16195f9287fb919196ba5325424f10, task: block
123974:Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.409 INFO update current branch tip: 6ee4f706-00029ba3-55.5230 -> 34511c5d-00029ba4-55.5244, date: 55.5244, parent: 6ee4f706bf7b60195049854b47a9da0a106bc9cf25ed644e0704b42151107d6d, hash: 34511c5d80fc15ce58b6b03f97294920aa16195f9287fb919196ba5325424f10, task: block
124304-Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.525 INFO connecting to peer, node_id: ffaf0bb7697de092926429cf39e90eb3e248b16651d29516, peer_addr: 167.71.118.1:3000, task: network
124501-Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.612 INFO failed to connect to peer, reason: Connection refused (os error 111), node_id: bb79641b9228c1fd4b39755f793fb607b3a5b38218bad194, peer_addr: 207.180.253.140:3000, task: network
124751-Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.617 INFO failed to connect to peer, reason: No route to host (os error 113), node_id: 835e61a1ab32e7424d475ad30b779dceb0a7621007739148, peer_addr: 45.79.33.61:3000, task: network
124995-Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.806 INFO connecting to peer, node_id: fff2288c81ba6f39df9c163ddb901fa71e0447da60b974a9, peer_addr: 78.47.174.210:3000, task: network
125193-Feb 06 22:08:25 x start_node2.sh[38531]: Feb 06 22:08:25.809 INFO connecting to peer, node_id: 0fe54118316354ef6f8212dc511ca96b0554fee0c9b25b5c, peer_addr: 185.177.59.209:3100, task: network

Parent was correct, as seen in block before 170916 with hash 6ee4f706bf7b60195049854b47a9da0a106bc9cf25ed644e0704b42151107d6d
https://explorer.incentivized-testnet.iohkdev.io/block/chainLength/170915

Blockchainheight 170916 was produced with slot 55.5255 as my block was not propagated due to this issue: https://explorer.incentivized-testnet.iohkdev.io/block/chainLength/170916

This was not a heightbattle, see also https://pooltool.s3-us-west-2.amazonaws.com/8e4d2a3/pools/3a192bc7285d7ddfd77c0b854b4008bbf4c5fe53d1ad42770f6ffdf9b3d4c7d6/winloss.json

Why is this issue not getting any love?

@rinor
Copy link
Contributor

rinor commented Feb 7, 2020

@Straightpool this issue is a side-effect of other issues fixed and/or being fixed right now. Is getting very much "love" and a lot of work is being done that hopefully will make it to the next release. Some of the issues are related to the network topology primetype/poldercast#15 primetype/poldercast#16 combined with other ones causes also #1599 and has a temp solution on #1739 but there is other work being done. So it is not that this issue is not being checked or loved, is just there is to much work being done around different areas that will fix/improve the overall network issues and side-effects including this one.

@trailbreak
Copy link

I can confirm that this issue is still occurring in v0.8.11. In two instances, blocks were created but not transmitted across the network. The node's clock was confirmed to be in sync with a max variance of 5.3ms and an estimated variance of 43μs. In both cases prior blocks were received within 700ms of creation. No battles or forks were detected by pooltool.io for those blocks. The node that create both blocks had been active for 26 hours and had successfully created a blocks at 71.15272, 71.22154 and 72.10641 during that period.

  • created_at_time: "2020-02-23T19:13:52.191999071+00:00"
    enclave_leader_id: 1
    finished_at_time: "2020-02-24T01:48:19.007526036+00:00"
    scheduled_at_date: "72.11841"
    scheduled_at_time: "2020-02-24T01:48:19+00:00"
    status:
    Block:
    block: 739278dd26cccb9515101f69f11cabc36477e4159601003d61067643db22c4b7
    chain_length: 234574
    wake_at_time: "2020-02-24T01:48:19.000999302+00:00"

A second failure occurred at 72.16305. The node was restarted before I could view the logs but in the second case I was observing my monitoring software and could see the scheduled block being created and could see two backup nodes fail to acknowledge the newly created block.

In both cases, there was a significant delay between the time of block creation and the next scheduled block. The 6h Average sync delays for blocks on that node was at 512ms with an StdDev of 434ms with no delayed or doubled blocks.

P2P Config settings:

p2p:
topics_of_interest:
blocks: high
messages: high
max_connections: 2048
max_connections_threshold: 128
max_client_connections: 16
gossip_interval: "1s"
max_unreachable_nodes_to_connect_per_event: 5
max_bootstrap_attempts: 1
policy:
quarantine_duration: "1h"

I had log level set to "warn" and no data was posted to the logs at or before block creation times to indicate that the node was not able to transmit blocks.

@arussell29
Copy link

I also will confirm along with trailbreak's recent comment that I just experienced this same loss of a block using v0.8.11 in Epoch 73. This is still happening. I confirmed sync and block made with correct parent, done and signed in correct time in log. It just did not propagate to the network.

@trailbreak
Copy link

trailbreak commented Mar 15, 2020

I can again confirm this issue, now with v0.8.14. The node looked to be in perfect working order with very low branch switching (< 3 events per hour), no slow of delayed blocks and an average branch update time of < 400ms. Gossip rate was also very healthy at 16 "received block announcement" events per new block. The only factor that looks problematic is that the node had been active for 25.23hrs at the time of block creation. Every time we've seen this issue, it's been with a node that's been active for more than 24 hours.

Can you please add a info level log entries for when a block is sent out to the network to try to debug this issue? Thank you.

Leader Logs:

  • created_at_time: "2020-03-14T19:13:52.085093652+00:00"
    enclave_leader_id: 1
    finished_at_time: "2020-03-15T17:42:49.015243352+00:00"
    scheduled_at_date: "92.40476"
    scheduled_at_time: "2020-03-15T17:42:49+00:00"
    status:
    Block:
    block: 52fd81e493d32092e03c92eec427f4a087f594cc4dbd007a9cdd32fddf2af17f
    chain_length: 308691
    parent: 4cc6e927b340037e6c6a07bcf23d7e1195d62765076fe90a3b81d38708bcaca7
    wake_at_time: "2020-03-15T17:42:49.006057693+00:00"

Logs:

Mar 15 17:42:49 upool1 jormungandr[41569]: Leader event started, event_remaining_time: 1s 991ms 45us 458ns, event_end: 2020-03-15T17:42:51+00:00, event_start: 2020-03-15T17:42:49+00:00, event_date: 92.40476, leader_id: 1, task: leadership
Mar 15 17:42:49 upool1 jormungandr[41569]: receiving block from leadership service, date: 92.40476, parent: 4cc6e927b340037e6c6a07bcf23d7e1195d62765076fe90a3b81d38708bcaca7, hash: 52fd81e493d32092e03c92eec427f4a087f594cc4dbd007a9cdd32fddf2af17f, task: block
Mar 15 17:42:49 upool1 jormungandr[41569]: block from leader event successfully stored, date: 92.40476, parent: 4cc6e927b340037e6c6a07bcf23d7e1195d62765076fe90a3b81d38708bcaca7, hash: 52fd81e493d32092e03c92eec427f4a087f594cc4dbd007a9cdd32fddf2af17f, task: block
Mar 15 17:42:49 upool1 jormungandr[41569]: update current branch tip: 4cc6e927-0004b5d2-92.40473 -> 52fd81e4-0004b5d3-92.40476, date: 92.40476, parent: 4cc6e927b340037e6c6a07bcf23d7e1195d62765076fe90a3b81d38708bcaca7, hash: 52fd81e493d32092e03c92eec427f4a087f594cc4dbd007a9cdd32fddf2af17f, task: block

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working subsys-network network related
Projects
None yet
Development

No branches or pull requests

9 participants