Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze cause of duplicate events in db #333

Closed
acn-sbuad opened this issue Apr 20, 2023 · 9 comments
Closed

Analyze cause of duplicate events in db #333

acn-sbuad opened this issue Apr 20, 2023 · 9 comments
Assignees
Labels
kind/chore Non functional, often repeating tasks.

Comments

@acn-sbuad
Copy link
Contributor

acn-sbuad commented Apr 20, 2023

Description

Analyze

Additional Information

No response

Tasks

No elements in poison queues, but a number of duplicates in db

select cloudevent->>'id', count(cloudevent->>'id') as noDuplicates from events.events
where sequenceno > 6717116
group by cloudevent->>'id'
order by noduplicates desc
SELECT 
    cloudevent->>'id' AS id, 
    COUNT(cloudevent->>'id') AS noDuplicates, 
    cloudevent->>'resource' AS resource,
    ARRAY_AGG(registeredtime) AS registered_times
FROM events.events
WHERE sequenceno > 24763608
GROUP BY cloudevent->>'id', cloudevent->>'resource'
HAVING COUNT(cloudevent->>'id') > 1

Hypothesis

No need for the inbound endpoint in events if function can push elements directly to queue.
Conclusion: Did not fix the problem of duplicates in the database, however it does save us 1 lookup in keyvault per processed cloud event. Can't quite remember why we implemented it like this is the first place, is there a reason function cannot return the cloud event directly to the next queue ?

== > Directy using an out binding for the function resulted in some lost events. Will need to find out if we can change the function config
[return: Queue("events-inbound", Connection = "QueueStorage")]

Hypothesis

Duplicates occur due to exhaustion of connections to key vault
Conclusion: The connection to KV fails far more often than we see duplicates.

Hypothesis 05.08.24

Duplicates are in large created during deploy.
Defining a preStop hook in the HELM deployment can allow us to postpone the shutdown process, potentially allowing the pod to complete all ongoing requests before being shut down. Functions log // PostInbound event with id 661cc13f-9b21-4af2-9639-6de0e845aead failed with status code GatewayTimeout. Docs

Acceptance Criterias

No response

@acn-sbuad acn-sbuad added status/draft Status: When you create an issue before you have enough info to properly describe the issue. kind/chore Non functional, often repeating tasks. labels Apr 20, 2023
@acn-sbuad acn-sbuad self-assigned this Apr 20, 2023
@acn-sbuad acn-sbuad removed the status/draft Status: When you create an issue before you have enough info to properly describe the issue. label Apr 21, 2023
@acn-sbuad
Copy link
Contributor Author

Next steps: add logging to events component and deploy to yt01 to log all event ids that are sent to the storage endpoint

@acn-sbuad
Copy link
Contributor Author

last duplicate in tt02 was created "2023-06-05 12:32:57.818753+00". Any changed implemented around this time, @SandGrainOne ?

@acn-sbuad
Copy link
Contributor Author

Last duplicates in production "2023-06-12 08:16:24.930741+00". Which kind of matches the deployment schedule, I guess.

@annerisbakk
Copy link
Member

@acn-sbuad Kan du sjekke om det finnes duplikater siden sist det ble sjekket? Dersom det ikke er noen, kan kanskje denne lukkes?

@acn-sbuad
Copy link
Contributor Author

8 events har blitt duplisert siste 90 dagene. @annerisbakk FYI

@olebhansen
Copy link

FYI: As of 2024-08-05, there were 71 events with 2 or more entries during the past 90 days. This issue is still relevant...

@olebhansen
Copy link

Continued in #573

@olebhansen
Copy link

Things to check/understand (read docs?):

  • that a pod under (gracefull) shutdown is able to respond back to the function "200 OK"

@HenningNormann
Copy link
Contributor

HenningNormann commented Aug 19, 2024

with duplicates as (
    select cloudevent->>'id' as id from events.events
    where cloudevent->>'resource' like 'urn:altinn:resource:app%'
    group by cloudevent->>'id'
    having count(cloudevent->>'id') > 1
)
select * from events.events e join duplicates on e.cloudevent->>'id' = duplicates.id order by registeredtime

Can't find any duplicates in prod or tt02. (Data older then 90 days are deleted.) Should we postpone further analysis until the problem is observed again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/chore Non functional, often repeating tasks.
Projects
None yet
Development

No branches or pull requests

4 participants