Peer loss over time #5271

jajaislanina · 2024-02-20T14:55:01Z

Description

Over the period of 2 days Lighthouse peer count goes from ~70 to 0 which stops the sync.

Version

Docker image version 4.6.0

Present Behaviour

This happens few times per week.
I am running multiple Ethereum Mainnet, Sepolia and Holesky nodes and this happens mostly on Mainnet and Sepolia.
When the peer count drops to single digits - node stops syncing.
Restarting the node fixes the issue.
Nothing in the logs screams at me that should be relevant to this issue.

I am assuming that somehow my nodes are flagged as "bad" and over time get blacklisted by other nodes in the network - but i have no proof or means to confirm this.

Expected Behaviour

I would expect the peer count to remain stable over time and for Lighthouse to re-connect to peers - basically not allow the count to drop to 0.

Steps to resolve

Restart is a temporary mitigation.

pawanjay176 · 2024-02-21T04:44:28Z

Can you share your beacon node logs? You can send on our discord, I'm @pawan on the sigp discord

jajaislanina · 2024-02-21T11:14:31Z

Logs shared via Discord.

jajaislanina · 2024-02-23T09:15:00Z

Experienced another similar issue on another node - this time managed to capture all logs.
Sent via PM to @pawanjay176 on Discord.

demon-xxi · 2024-06-04T17:04:59Z

What was the resolution on this issue? I have the same thing happening consistently with lighthouse+reth combo running in k8s. Everything works initially with both consensus and execution getting peers without issues. But then lighthouse starts loosing peers over time. I guess it just never gets new peers while old ones disconnect naturally over time.

Restarting lighthouse container does not seem to help, restarting reth alone does not fix this either. But restarting both seems to fix the issue.

They have different discovery ports configured. My lighthouse is configured as so:

 lighthouse bn --http --http-address=0.0.0.0 --execution-endpoint=http://localhost:8551
      --logfile-debug-level debug --port 9000 --enable-private-discovery --metrics
      --metrics-address=0.0.0.0 --execution-jwt=/config/jwt-secret.txt --disable-deposit-contract-sync
      --checkpoint-sync-url=https://checkpoint-sync.sepolia.ethpandaops.io/ --disable-backfill-rate-limiting
      --network=sepolia --datadir=/data --network-dir=/tmp --disable-upnp --execution-timeout-multiplier=1
      --disable-lock-timeouts

i have confirmed with netcat that ports 9000 and 9001 are listening and accepting external connections

jajaislanina · 2024-06-10T11:52:15Z

Never found the solution to this. I am running Lighthouse+Geth in the same pod and have added a liveness probe that kills both containers if peer count on LH is <4 for longer than 60 minutes.
What we found out was that there were failures (timeout) to dial peers without apparent root cause.

michaelsproul · 2024-06-11T08:10:51Z

@jajaislanina That does sound strange. Please let us know if it continues in 5.2, as we've fixed a few sync & lookup bugs. Sounds like the dialing issue is unrelated to those fixes though

jajaislanina · 2024-06-11T08:22:17Z

Will update in a few days. Currently upgrading Holesky nodes to 5.2.0 for the memory footprint (right now we have weird spikes over 40GB of memory and 15vCPU cores when the node is lagging.
Hopefully this also helps with the peer retention.

jajaislanina · 2024-06-16T10:17:54Z

Hi @michaelsproul

Just had one of the Sepolia nodes that was on version 5.2.0 of Lighthouse experience sync issues.
When we checked peer count was 5 and has been declining over last few days.
Note that one node is fine and the other one (light blue) starts loosing peers

AgeManning · 2024-09-15T23:06:49Z

This is an old issue. I imagine its a duplicate of #6384 - Closing in favour of #6384

chong-he added the Networking label Feb 26, 2024

chong-he mentioned this issue Feb 27, 2024

How to solve WARN Execution endpoint is not synced? #5312

Closed

chong-he mentioned this issue Sep 12, 2024

Peer count slowly decrease to 0 #6384

Open

AgeManning closed this as completed Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer loss over time #5271

Peer loss over time #5271

jajaislanina commented Feb 20, 2024

pawanjay176 commented Feb 21, 2024 •

edited

Loading

jajaislanina commented Feb 21, 2024

jajaislanina commented Feb 23, 2024

demon-xxi commented Jun 4, 2024 •

edited

Loading

jajaislanina commented Jun 10, 2024

michaelsproul commented Jun 11, 2024

jajaislanina commented Jun 11, 2024

jajaislanina commented Jun 16, 2024

AgeManning commented Sep 15, 2024

Peer loss over time #5271

Peer loss over time #5271

Comments

jajaislanina commented Feb 20, 2024

Description

Version

Present Behaviour

Expected Behaviour

Steps to resolve

pawanjay176 commented Feb 21, 2024 • edited Loading

jajaislanina commented Feb 21, 2024

jajaislanina commented Feb 23, 2024

demon-xxi commented Jun 4, 2024 • edited Loading

jajaislanina commented Jun 10, 2024

michaelsproul commented Jun 11, 2024

jajaislanina commented Jun 11, 2024

jajaislanina commented Jun 16, 2024

AgeManning commented Sep 15, 2024

pawanjay176 commented Feb 21, 2024 •

edited

Loading

demon-xxi commented Jun 4, 2024 •

edited

Loading