Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small fixes + UI jobs refresh + spam submission failure #1013

Closed
wants to merge 13 commits into from

Conversation

supraja-968
Copy link
Collaborator

@supraja-968 supraja-968 commented Aug 7, 2024

What type of PR is this?

  • 🎮 Feature
  • 🐛 Bug Fix

Description

This PR addresses the below changes:

  1. Bug example: user's compute tally = 450 so far. Tier threshold = 500. Each colabdesign job costs = 50 credits. So the user shouldn't be able to submit more than 1 job in this tier without being prompted to subscribe. But the user was still able to submit combinatorially more than 1, as we were updating the compute tally post job creation. So in this case, for example with 3 jobs submitted combinatorially, all 3 will be submitted, and the DB will be updated with compute tally = 600, tier = 1. This tier = 1 will then trigger the subscription the next time user submits 1 or more jobs combinatorially.
    Fix: calculate compute tally before job creation(not update. the update still comes after the job creation), and redirect to subscribe page without actually submitting these jobs.

  2. Bug: jobs weren't updating live on the UI. Everytime a user has to refresh to see the current state of the experiment.
    Fix: A polling mechanism just within the jobs accordion, so the whole page doesn't refresh when the jobs refresh with their current state. (Note to dev: the dot next to experiment name still is a bit behind that it requires a refresh to catch up. But this can be addressed in a following PR).

  3. Bug: API keys were getting created, but with a refresh, they disappear. So the creation worked, not the fetch.
    Fix: the fetch was using wallet_address, where as the column name was user_id. Which holds the wallet_address still. This got missed in the big DB migration. So I have temporarily fixed it with the fetch looking for user_id, instead of migrating the column and naming it wallet_address.

  4. Bug: With a combinatorial submission or a spam of resubmissions, some jobs were failing with 'unexpected Ray state running'.
    Fix: This is due to carry over of some of the logic from ray services when we migrated to ray jobs. The gateway was setting a job to pending and subsequently running states, BEFORE submitting the job to the ray's internal queue. This is fixed by removing setting these states before submission. So the status lifecycle looks like: queued -> processing -> submit to ray -> set it to pending -> start monitoring -> set it to running/stopped/failed/succeeded based on the result of the response. With this fix, we start monitoring jobs that are in running as well as pending state. Note: 'pending' is Ray's internal convention for pending jobs. So in a previous PR we introduced another status 'processing' to differentiate jobs that are pending on the gateway side to be submitted vs jobs in the internal ray queue waiting to be picked up by a worker.
    image
    image

  5. Bug: PDB files were only being used to display checkpoints, but there was no way to download them.
    Fix: the addFilesToDB function was handling only the files other than PDB because they are categorized separately in the RayJobResponse struct. This is fixed by adding PDB files to DB separately after the rest of the files are added.

Copy link

vercel bot commented Aug 7, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
docs ⬜️ Ignored (Inspect) Visit Preview Aug 7, 2024 2:14pm

@supraja-968 supraja-968 requested a review from acashmoney August 7, 2024 14:41
@supraja-968 supraja-968 changed the title Support demo job Small fixes + UI jobs refresh + spam submission failure Aug 7, 2024
@supraja-968 supraja-968 marked this pull request as ready for review August 7, 2024 14:48
@supraja-968
Copy link
Collaborator Author

this PR has already been included in the plex migration PR to convexity. changes are in main. deployed to test and prod. closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant