http - EventQueueGet - Certain curl errors result in viewer abandoning event delivery from main and child regions #3464

monty-linden · 2025-01-23T19:21:13Z

Environment

Firestorm 7.1.11.76496 Release

Description

[ Analyzed from Firestorm viewer logs (https://jira.firestormviewer.org/browse/FIRE-35085) but I'm duplicating it in viewer as something similar is very likely present in the long poll code. ]

This is likely a bug in the SL base code but I'm reporting via FS Jira as there may be special handling in Firestorm. There will be a GHI on secondlife/viewer as well.

The following came in from logs from a resident so I'm editing to limit information leakage. But the gist should be clear. This stems from a mix of the EventQueueGet work underway and the eternal search for better TPs/RCs. Resi reported a TP failure, I decide to dig. Here's the discovery.

Working backwards, resi had a teleport failure with disconnect from the grid. Examining Firestorm.log indicated a missing TeleportFinish message was the likely cause.

Moving to server, I found:

Jan 23 00:32:44 simhost-0f7e4be4373667b6e.agni.secondlife.io simulator[3223]: WARNING:#EventQueue LLAgentCommunication::sendEvent: 100 events queued (100 pending, 0 sent) for agent <resident 1>, discarding event: TeleportFinish
Jan 23 00:32:44 simhost-0f7e4be4373667b6e.agni.secondlife.io simulator[3223]: WARNING:#PendingTeleport LLTeleportFinishResponder::httpFailure: Error sending TeleportFinish to agent <resident 1> [(invalid):] [status:500] [reason:Max Events Queued] [content:!]

And that will end a session. So, why was the EventQueue filled? Was it not being read? Hitting a different set of logs for the session, I found that the last event-get long-poll to the simulator was around Jan 22 19:46, almost five hours earlier.

Bouncing back to the viewer,

2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle1> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle2> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle3> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle4> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
2025-01-22T19:46:54Z WARNING #CoreHttp# llcorehttp/_httppolicy.cpp(404) LLCore::HttpPolicy::stageAfterCompletion : HTTP request <handle5> failed after 0 retries.  Reason:  SSL connect error (Easy_35)
...
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0f7e4be4373667b6e.agni.secondlife.io:12043/cap/<cap1>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0fb0a28082d21b6f3.agni.secondlife.io:12043/cap/<cap2>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0bea8ada1571717de.agni.secondlife.io:12043/cap/<cap3>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-006b1b658c5a1fbc9.agni.secondlife.io:12043/cap/<cap4>' because SSL connect error
2025-01-22T19:46:54Z WARNING #CoreHTTP# llmessage/llcorehttputil.cpp(275) LLCoreHttpUtil::HttpCoroHandler::onCompleted : Possible failure [Easy_35] cannot POST url 'https://simhost-0b61d1438e744c2e7.agni.secondlife.io:12043/cap/<cap5>' because SSL connect error
...
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.
2025-01-22T19:46:54Z WARNING #LLEventPollImpl# newview/lleventpoll.cpp(243) LLEventPolling::Details::LLEventPollImpl::eventPollCoro : Critical error from poll request returned from libraries.  Canceling coroutine.

So, certain spurious/transitory errors are leading to the abandonment not of the full circuit but just of the HTTP half found in EventQueue. For both main and child regions. The EventQueue continues to function on the simulator side - there's been no takedown of the circuit. But the EventQueue fills, blocks further entries, and only critical operations, such as TeleportFinish, lead to a synchronization event which, unfortunately, ends the session. Sometimes hours after the causal failure.

Reproduction steps

Run a viewer
Wait a long time

The text was updated successfully, but these errors were encountered:

Dan-Linden · 2025-02-04T17:50:56Z

Server, can you take a look at this? Toss back to viewer if it's a viewer issue.

monty-linden · 2025-02-04T18:08:50Z

Dan, I filed this. It's a viewer issue in the case of FS. I haven't verified that it's identical in our viewer but suspect it probably is. Not a server issue. Bad client-side behavior.

monty-linden · 2025-02-13T16:45:29Z

To secure some related server-side historical metrics, attaching the following which you can get in the CloudWatch dashboards.

One side-effect of the viewer issue is that events will queue up and get dropped on the simulator side. That happens for many reason and this chart is the result (up to 8E6 events lost per day).

monty-linden added bug Something isn't working triage Flags issues that need to be triaged labels Jan 23, 2025

kylelinden added this to the 2024.12-ForeverFPS milestone Feb 6, 2025

github-actions bot added the stale label Mar 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http - EventQueueGet - Certain curl errors result in viewer abandoning event delivery from main and child regions #3464

http - EventQueueGet - Certain curl errors result in viewer abandoning event delivery from main and child regions #3464

monty-linden commented Jan 23, 2025

Dan-Linden commented Feb 4, 2025

monty-linden commented Feb 4, 2025

monty-linden commented Feb 13, 2025

http - EventQueueGet - Certain curl errors result in viewer abandoning event delivery from main and child regions #3464

http - EventQueueGet - Certain curl errors result in viewer abandoning event delivery from main and child regions #3464

Comments

monty-linden commented Jan 23, 2025

Environment

Description

Reproduction steps

Dan-Linden commented Feb 4, 2025

monty-linden commented Feb 4, 2025

monty-linden commented Feb 13, 2025