Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http - EventQueueGet - Certain curl errors result in viewer abandoning event delivery from main and child regions #3464

Open
monty-linden opened this issue Jan 23, 2025 · 3 comments
Labels
bug Something isn't working stale triage Flags issues that need to be triaged

Comments

@monty-linden
Copy link
Contributor

Environment

Firestorm 7.1.11.76496 Release

Description

[ Analyzed from Firestorm viewer logs (https://jira.firestormviewer.org/browse/FIRE-35085) but I'm duplicating it in viewer as something similar is very likely present in the long poll code. ]

This is likely a bug in the SL base code but I'm reporting via FS Jira as there may be special handling in Firestorm.  There will be a GHI on secondlife/viewer as well.

The following came in from logs from a resident so I'm editing to limit information leakage.  But the gist should be clear.  This stems from a mix of the EventQueueGet work underway and the eternal search for better TPs/RCs.  Resi reported a TP failure, I decide to dig.  Here's the discovery.

Working backwards, resi had a teleport failure with disconnect from the grid.  Examining Firestorm.log indicated a missing TeleportFinish message was the likely cause.

Moving to server, I found:

Jan 23 00:32:44 simhost-0f7e4be4373667b6e.agni.secondlife.io simulator[3223]: WARNING:#EventQueue LLAgentCommunication::sendEvent: 100 events queued (100 pending, 0 sent) for agent <resident 1>, discarding event: TeleportFinish
Jan 23 00:32:44 simhost-0f7e4be4373667b6e.agni.secondlife.io simulator[3223]: WARNING:#PendingTeleport LLTeleportFinishResponder::httpFailure: Error sending TeleportFinish to agent <resident 1> [(invalid):] [status:500] [reason:Max Events Queued] [content:!]

And that will end a session.  So, why was the EventQueue filled?  Was it not being read?  Hitting a different set of logs for the session, I found that the last event-get long-poll to the simulator was around Jan 22 19:46, almost five hours earlier.

Bouncing back to the viewer,

2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle1> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle2> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle3> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle4> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle5> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
...
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0f7e4be4373667b6e.agni.secondlife.io:12043/cap/<cap1>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0fb0a28082d21b6f3.agni.secondlife.io:12043/cap/<cap2>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0bea8ada1571717de.agni.secondlife.io:12043/cap/<cap3>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-006b1b658c5a1fbc9.agni.secondlife.io:12043/cap/<cap4>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0b61d1438e744c2e7.agni.secondlife.io:12043/cap/<cap5>' because SSL connect error
...
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.

So, certain spurious/transitory errors are leading to the abandonment not of the full circuit but just of the HTTP half found in EventQueue.  For both main and child regions.  The EventQueue continues to function on the simulator side - there's been no takedown of the circuit.  But the EventQueue fills, blocks further entries, and only critical operations, such as TeleportFinish, lead to a synchronization event which, unfortunately, ends the session.  Sometimes hours after the causal failure.

Reproduction steps

  1. Run a viewer
  2. Wait a long time
@monty-linden monty-linden added bug Something isn't working triage Flags issues that need to be triaged labels Jan 23, 2025
@Dan-Linden
Copy link
Contributor

Server, can you take a look at this? Toss back to viewer if it's a viewer issue.

@monty-linden
Copy link
Contributor Author

Dan, I filed this. It's a viewer issue in the case of FS. I haven't verified that it's identical in our viewer but suspect it probably is. Not a server issue. Bad client-side behavior.

@kylelinden kylelinden added this to the 2024.12-ForeverFPS milestone Feb 6, 2025
@monty-linden
Copy link
Contributor Author

To secure some related server-side historical metrics, attaching the following which you can get in the CloudWatch dashboards.

Image

One side-effect of the viewer issue is that events will queue up and get dropped on the simulator side. That happens for many reason and this chart is the result (up to 8E6 events lost per day).

@github-actions github-actions bot added the stale label Mar 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale triage Flags issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants