Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize object store cache operations #17025

Merged
merged 8 commits into from
Nov 20, 2023

Conversation

SergeyYakubov
Copy link
Contributor

This is the second step after #16783. We introduce sync_cache parameter to the get_file_name() function and use it to postpone pulling to Galaxy cache until the data is really needed. Also added the cache_updated_data parameter to the object store config that allows saving local storage by sending data directly to an object store without storing it in the cache. This is useful by itself (e.g. when a job is running in Pulsar and we just want to send results to the object store) and also used for integration tests to test the sync_cache functionality.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

added sync_cache parameter to the get_filename function that allows
skip pulling to cache when not needed
Co-authored-by: John Davis <jdavcs@gmail.com>
@jdavcs
Copy link
Member

jdavcs commented Nov 20, 2023

Thank you @SergeyYakubov, this looks great! I'll wait for the remaining tests to pass, after which we can merge this.

@SergeyYakubov
Copy link
Contributor Author

I had to configure S3 object store to always update cache by upload - d0eef6a. The reason is that these tests were failing because Galaxy did not pull data correctly from objects store in case of composite datasets. This problem is "hidden" when the cache is shared between the job and Galaxy - then pull still does not happen, but the data is there because set_metadata put it there.

It would be good to create integration tests that explicitly reveal the problem - Galaxy/Pulsar/extended metadata/object store/Galaxy cache and Pulsar cache are separate storage. it is unrelated to this PR, so I just made the tests work for now by using a shared cache as before.

@jdavcs
Copy link
Member

jdavcs commented Nov 20, 2023

I had to configure S3 object store to always update cache by upload - d0eef6a. The reason is that these tests were failing because Galaxy did not pull data correctly from objects store in case of composite datasets. This problem is "hidden" when the cache is shared between the job and Galaxy - then pull still does not happen, but the data is there because set_metadata put it there.

It would be good to create integration tests that explicitly reveal the problem - Galaxy/Pulsar/extended metadata/object store/Galaxy cache and Pulsar cache are separate storage. it is unrelated to this PR, so I just made the tests work for now by using a shared cache as before.

Added this to our Testing Requests board

@jdavcs jdavcs merged commit 36fc64a into galaxyproject:dev Nov 20, 2023
49 checks passed
Copy link

This PR was merged without a "kind/" label, please correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants