Error 503 Backend fetch failed

Backend fetch failed

Guru Meditation:

XID: 1165393923


Varnish cache server

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DUP metrics showing active passes that should be expired #346

Open
JLWade opened this issue Feb 16, 2024 · 10 comments
Open

DUP metrics showing active passes that should be expired #346

JLWade opened this issue Feb 16, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@JLWade
Copy link

JLWade commented Feb 16, 2024

In DUP Prod it is showing passes (mostly summer months) as active when they should be expired because they are in the past. When you look at the main db for passes, they are correctly set to expired. The metrics call is reading from a post processing cronjob that generates the metrics in a new table which the metrics page uses to fetch.

The cronjob for the metrics object is running at midnight and reading passes as still active before they expire and not respecting time zones. The expiry job is running at the same time as the metrics cronjob which is also causing some issue. The cronjob for the metrics is also only looking at 'today and forward' so once the clock ticks past midnight, any tickets missed before they changed status from active to expired will never be captured in a future run, creating the problem we are seeing.

To be fixed:

  • Cron job that changes passes to expired at the end of the day runs at 6pm instead of midnight
  • Script that is in DUP (Past Cam created it) is run to regenerate metrics objects for previous dates so that those active passes are corrected to expired.
  • Spike into the manual lookup index (and other indexes) would be prudent at this time as those indexes have grown substantially large.
@JLWade JLWade converted this from a draft issue Feb 16, 2024
@Dianadec Dianadec added the bug Something isn't working label Feb 16, 2024
@cameronpettit cameronpettit moved this from Todo to In Progress in BC Parks Digital Delivery (Team Osprey) Feb 20, 2024
@cameronpettit cameronpettit self-assigned this Feb 20, 2024
@cameronpettit
Copy link
Contributor

@JLWade @marklise just a heads up that I've discovered the checkExpiry Lambda that moves passes from active to expired runs every hour, on the hour, as opposed to once at midnight like we originally assumed. This doesn't change our logic very much as the root cause of this issue is still the PDT timezone shift in the summertime, but here is how the logic will operate anyways:

  • Every hour, on the hour, checkExpiry runs. It uses the passStatus-index to gather every pass in the system with status = active.
  • For each active pass, the pass status is set to expired if any of the following conditions are met:
    • It is past 6pm (18:00 PST/PDT) on the date the cronjob was run.
    • The active pass's date is earlier than the current date.
    • The active pass is an AM pass and it is past 12pm (12:00 PST/PDT, this time is currently hardcoded in our dynamoUtils) on the date the cronjob was run.

This handles almost every case, but there is still a small gap that might miss active passes if someone (for whatever reason) books a same-day pass between 23:00 and 23:59.

I also recommend looking at NOT running the metrics cronjob overnight - it runs every 5 minutes which isnt super necessary when no changes are being made. At the very least, we should turn it off between 11pm and 1am the following day, to give the checkExpiry cron enough time to properly sort out the active -> expired passes.

@cameronpettit
Copy link
Contributor

@JLWade the more I think about this, the more I think that the proper way to fix this is to disallow same-day bookings after some closing hour has passed (like 6pm). Doing it this way is more true to how the system is supposed to function and it avoids a bunch of complications that arise from cronjobs/timezones, etc. It isn't too difficult to implement, but there will likely be a public front-end portion of this as we will need to convey that you can't book/don't need a pass after the closing hour to public users. It might need some discussion/input from other team members if it now involves some UX/Design considerations? That's up to you.

As for now, I've remade all the metrics objects that were showing active passes, so the immediate issue has been handled and this ticket is okay to stagnate a bit as we won't run into this same issue again until daylight savings time starts next month AND Day Use Parks are open for use (May/June).

I'll also add some of my suggestions for the manual lookup index spike to the ticket info.

@cameronpettit
Copy link
Contributor

Some investigations about the index:

The checkExpiry cronjob leverages the passStatus-index index which has only a partition key on the pass status and no sort key. This means there are only four partitions to shard all the passes in the system which is extremely inefficient. It has not caused us many issues yet since the active, and reserved partitions never get very large, but the expired and cancelled partitions continue to grow as they are terminal states.

A far better index implementation would use the pass date as the partition key and the status as the sort key. This will create a partition for every day which can then be sorted by status. The drawback to this implementation is that if you wanted to gather every active pass in the system, you would have to do a query on every date and check for active passes - which, while it seems laborious, is the way DynamoDB was intended to be used. Increasing the number of distinct partitions is more scalable-friendly than increasing the size of the partitions themselves. Many + smaller > fewer + larger, and this becomes more important as the database size and query frequency increases.

In other words, you go from being able to ask the expensive query:

I want all the active passes in the database

to having to ask the cheaper query:

I want all the active passes on these dates

The passStatus-index index is used in a handful of places so there would be a moderate amount of refactoring involved.

Unrelated to indexes, another improvement that can be made to the metrics gathering process is to replace the current cronjob with a DynamoDB Stream, such that a day's metrics object is updated only when it needs to be (ie, when a pass for that day is booked). This follows the same paradigm as above: smaller, more frequent queries at a higher volume in place of fewer, larger queries, with the added bonus of being up-to-date instantaneously.

@Dianadec Dianadec moved this from In Progress to Blocked in BC Parks Digital Delivery (Team Osprey) Feb 22, 2024
@Dianadec
Copy link
Contributor

Standup: Cam identified that we need to clarify decision about the 6pm cutoff - cc @JLWade . Possible UX implications.

@cameronpettit
Copy link
Contributor

Spinning up #349 to handle the following implications of this ticket:

...the proper way to fix this is to disallow same-day bookings after some closing hour has passed (like 6pm). Doing it this way is more true to how the system is supposed to function and it avoids a bunch of complications that arise from cronjobs/timezones, etc. It isn't too difficult to implement, but there will likely be a public front-end portion of this as we will need to convey that you can't book/don't need a pass after the closing hour to public users.

For now, the cronjob has been moved forward which satisfies this ticket AC.

@manuji
Copy link

manuji commented Feb 29, 2024

Tested on PROD: Passed

  • Verified that there are no "Active" passes displayed for previous season.
  • @cameronpettit , will you be able to make the adjustment on the TEST environment as well? Please let me know. Meantime, I will move this ticket to PO review.

@cameronpettit
Copy link
Contributor

@manuji which adjustment are you meaning? There should be very few items in the TEST database that show the wrong status.

@manuji
Copy link

manuji commented Feb 29, 2024

@cameronpettit , the adjustment to get all the incorrect "Active" passes on TEST to show as expired.

@cameronpettit
Copy link
Contributor

I ran the same backfill metrics script on TEST but there are still 28 active passes showing on the front end between 2023-03-01 and 2023-11-30. I can confirm this isn't the case with the passes in the database so my guess is that there is something up with the TEST data that is causing the metrics objects to not be recreated with the backfill script.

Not sure its a great use of time to further hunt down this discrepancy considering its TEST data. Do we really need this fixed @manuji @JLWade ?

@JLWade
Copy link
Author

JLWade commented Feb 29, 2024

Nope for test we can leave it.
Looks like the issue has been resolved in prod - will move this to done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

4 participants