-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jormungandr v0.8.7 is not propagating blocks to the network #1625
Comments
Also a problem on v0.8.6. What does your CPU look like at the time you missed your blocks? Has it hit a ceiling it's staying near at that time? (The ceiling is not necessarily 100% CPU; it depends on the number of physical cores and logical threads, of the machine, etc). |
Experienced a similar issue while running 0.8.7 - node has been up for several hours with hundreds of gossiping peers and 45 peers showing up in network p2p view blocks, however after the block was created there is no sign of it in the logs of other known nodes on the network... and no clue as to what went wrong with propagation: monitoring8502.log:Jan 26 04:11:43.009 WARN processing the new block and propagating, date: 43.14343, parent: c155d2ac112d1f09ef407f86a1e15335c9fa603b451d8bda2afac3ea88f44ee7, hash: 04efbda0d474ec8b2a1cdd2ee92011a46cff045488f9a624054a7fc56b92621b, task: block |
Hmm I believe I have had the same issue twice, see here #1651 I am not sure but I believe I had this once on 0.8.6 as well, and once on 0.8.7. Lovelypool, I looked at your three blocks which did not make it into the final blockchain and picked the blocks with the exact blockchainlength which actually made it into the final blockchain: Block 1: Block 2: Block 3: I have had the the same issue twice already. For some reason our nodes seem in error conclude after successful block production to be on the same blockchainheight with a scheduled slot offset of 3 to 40 slots to the actual slot which produced the block for this blockchainlength. |
I think you are confusing slot schedule with blockchainlength. Chain length has nothing to do with schedule of slots. It is just incrementing number representing total amount of blocks that blockchain has now. Lovely's node produced slot 12997 on chain_length 122860 and it was rejected by the network (or not propagated at all). Node that created slot 13002 just wasn't aware of Lovely's created block. His blockchain had 122859 blocks before creation, that is why block 13002 has 122860 height. |
Lovelypool, did you keep the logs of your missed blocks? Please check against adascanBovich, according to https://cardanodocs.com/technical/ an Epoch is divided into slots, some of the slots have slot leaders from the community. If multiple slot leaders are assigned to the same slot we have a so called competive fork. The slots do progress through the epoch, in each slot a block is produced adding to the length of the blockchain. So I still think slot progression is somewhat linked to blockchainlength progression. Due to bugs in Jormungandr not every slot produces a valid block though. |
Exactly! If multiple leaders assigned to the SAME slot. 13002 and 12997 are not the same slots (from example above). Or 41.8603 and 41.8643 from your issue. Multiple leadership means that 2 or more pools trying to produce block in the same slot, 41.8603 for example. |
Sorry, my logs for this event have since been rolled over so I cannot provide any new data for this bug report. To work around this issue, I have been running v0.8.5 with my old restart script for the past few days and things have been good enough. Granted, its been painful to see this issue not get much traction yet. |
Looks like can still happen on latest version of jormu (0.8.9) - unless someone has evidence of a sighting of a block announcement of block b251b0483770033960b6e82be36f873a005f1be9f29d6eb5740fa5e3a11f7a0f, that is! |
I can confirm this is happening in v089. the node successfully produced the block, propagated? it, saw it on its end but the network never received it.
|
I just had the same issue on jcli --full-version output; Block produced
Logoutput 499ns, event_end: 2020-02-06T22:08:27+00:00, event_start: 2020-02-06T22:08:25+00:00, event_date: 55.5244, leader_id: 1, task: leadership Parent was correct, as seen in block before 170916 with hash 6ee4f706bf7b60195049854b47a9da0a106bc9cf25ed644e0704b42151107d6d Blockchainheight 170916 was produced with slot 55.5255 as my block was not propagated due to this issue: https://explorer.incentivized-testnet.iohkdev.io/block/chainLength/170916 This was not a heightbattle, see also https://pooltool.s3-us-west-2.amazonaws.com/8e4d2a3/pools/3a192bc7285d7ddfd77c0b854b4008bbf4c5fe53d1ad42770f6ffdf9b3d4c7d6/winloss.json Why is this issue not getting any love? |
@Straightpool this issue is a side-effect of other issues fixed and/or being fixed right now. Is getting very much "love" and a lot of work is being done that hopefully will make it to the next release. Some of the issues are related to the network topology primetype/poldercast#15 primetype/poldercast#16 combined with other ones causes also #1599 and has a temp solution on #1739 but there is other work being done. So it is not that this issue is not being checked or loved, is just there is to much work being done around different areas that will fix/improve the overall network issues and side-effects including this one. |
I can confirm that this issue is still occurring in v0.8.11. In two instances, blocks were created but not transmitted across the network. The node's clock was confirmed to be in sync with a max variance of 5.3ms and an estimated variance of 43μs. In both cases prior blocks were received within 700ms of creation. No battles or forks were detected by pooltool.io for those blocks. The node that create both blocks had been active for 26 hours and had successfully created a blocks at 71.15272, 71.22154 and 72.10641 during that period.
A second failure occurred at 72.16305. The node was restarted before I could view the logs but in the second case I was observing my monitoring software and could see the scheduled block being created and could see two backup nodes fail to acknowledge the newly created block. In both cases, there was a significant delay between the time of block creation and the next scheduled block. The 6h Average sync delays for blocks on that node was at 512ms with an StdDev of 434ms with no delayed or doubled blocks. P2P Config settings: p2p: I had log level set to "warn" and no data was posted to the logs at or before block creation times to indicate that the node was not able to transmit blocks. |
I also will confirm along with trailbreak's recent comment that I just experienced this same loss of a block using v0.8.11 in Epoch 73. This is still happening. I confirmed sync and block made with correct parent, done and signed in correct time in log. It just did not propagate to the network. |
I can again confirm this issue, now with v0.8.14. The node looked to be in perfect working order with very low branch switching (< 3 events per hour), no slow of delayed blocks and an average branch update time of < 400ms. Gossip rate was also very healthy at 16 "received block announcement" events per new block. The only factor that looks problematic is that the node had been active for 25.23hrs at the time of block creation. Every time we've seen this issue, it's been with a node that's been active for more than 24 hours. Can you please add a info level log entries for when a block is sent out to the network to try to debug this issue? Thank you. Leader Logs:
Logs: Mar 15 17:42:49 upool1 jormungandr[41569]: Leader event started, event_remaining_time: 1s 991ms 45us 458ns, event_end: 2020-03-15T17:42:51+00:00, event_start: 2020-03-15T17:42:49+00:00, event_date: 92.40476, leader_id: 1, task: leadership |
Describe the bug
I have had numerous blocks on epoch 40 not propagate to the network. My leader logs show the block was signed and sent.
3 examples of 3 consecutively missed blocks:
- created_at_time: "2020-01-23T00:44:54.320114806+00:00" enclave_leader_id: 1 finished_at_time: "2020-01-23T02:26:51.002096232+00:00" scheduled_at_date: "40.12997" scheduled_at_time: "2020-01-23T02:26:51+00:00" status: Block: block: 71bc2c42bfb03ef6c689929d143a6c4c478673b5399fbab48ed63e7d8f718595 chain_length: 122860 wake_at_time: "2020-01-23T02:26:51.001334781+00:00"
- created_at_time: "2020-01-23T00:44:54.320117912+00:00" enclave_leader_id: 1 finished_at_time: "2020-01-23T02:55:29.005382635+00:00" scheduled_at_date: "40.13856" scheduled_at_time: "2020-01-23T02:55:29+00:00" status: Block: block: aa833ac0709d7d2119167e73b8ea815e82fb93add0f85ba9edd5dd5ea67b01e3 chain_length: 122912 wake_at_time: "2020-01-23T02:55:29.001757208+00:00"
- created_at_time: "2020-01-23T00:44:54.320132781+00:00" enclave_leader_id: 1 finished_at_time: "2020-01-23T02:59:41.242523233+00:00" scheduled_at_date: "40.13982" scheduled_at_time: "2020-01-23T02:59:41+00:00" status: Block: block: bbed794f976a44dbfe323e102aa91cfcb8ada9cc2c57ba285f1b4f63fa4077ab chain_length: 122925 wake_at_time: "2020-01-23T02:59:41.241818067+00:00"
I monitor my node for slot latency (how fast did a block propagate to me) and block height delta (did my node miss a block). The blue arrows mark my leadership events. You can see one leadership event may have been missed due to my node skipping a block, but the other 2 do not have any explanation. Other operators in telegram confirmed that they saw no mention of the block hashes above in their logs.
My node is set to run with 768 connections, and hovers around 350 established TCP connections. It usually becomes unstable (more missed blocks, more slot latency) within 4-6 hours.
I am also using a gossip interval of 60sec to improve uptime and stability.
Mandatory Information
jcli --full-version
output;jcli 0.8.7 (HEAD-364cd84e+, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
jormungandr --full-version
output;jormungandr 0.8.7 (HEAD-364cd84e+, release, linux [x86_64]) - [rustc 1.40.0 (73528e339 2019-12-16)]
The text was updated successfully, but these errors were encountered: