Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outline the best settings for Parsl configuration with CytoTable #176

Open
d33bs opened this issue Mar 26, 2024 · 7 comments
Open

Outline the best settings for Parsl configuration with CytoTable #176

d33bs opened this issue Mar 26, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@d33bs
Copy link
Member

d33bs commented Mar 26, 2024

Determine the best settings for Parsl configuration with CytoTable (and make subsequent changes within documentation or possibly implement automated configuration). This touches on aspects of #25 but is a bit of a deeper dive into system resources and how they operate in conjunction with CytoTable through Parsl.

  • I've recently heard that the HTE (highthroughput executor, default configuration) sometimes works better than the TPE (threadpool executor).

Originally posted by @d33bs in discussion with @shntnu via #163 (comment)

@d33bs d33bs added the enhancement New feature or request label Mar 26, 2024
@d33bs
Copy link
Member Author

d33bs commented Mar 26, 2024

From #163:

Anything we should keep in mind as we attempt convert ~3000 plates using this?

Consider increasing your chunk_size to a higher number to improve time duration performance, contingent on what might work best for the data involved. What would work best will depend on the amount of system memory available where you run CytoTable, the shape of the source data (row and column size), and the complexity of the join operations. It may be worth a quick test using chunk sizes [10000, 100000, 500000, 1000000] to see what might perform the best.

... I recommend using cytotable==0.0.6 to help address issues with memory during post join concatenation (as per #168). I've also noticed there may be issues with the default Parsl executor for CytoTable, HighThroughputExecutor (HTE), documented as part of #169 (these errors may have resulted for reasons related to the HTE or perhaps resources, but I'm not certain at this time).

As a result, I might recommend using the ThreadPoolExecutor instead which may be configured as follows:

import cytotable 
import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor

cytotable.convert(
    ...
    parsl_config=parsl.load(
        Config(
            executors=[
                ThreadPoolExecutor(
                    # set maximum number of threads at any time, for example 3.
                    # if not set, the default is 2.
                    max_threads=3, 
                )
            ]
        )
    ),
)

@d33bs
Copy link
Member Author

d33bs commented Mar 29, 2024

Hi @shntnu - I wanted to follow up as you prepare for the work you originally mentioned in #163 ("... convert ~3000 plates using this..". In order to provide the best possible guidance I'd wonder:

  • Would you be able to share a link to examples or actual data which will be used?
  • What type of system will run CytoTable? (HPC/not, CPU, memory, etc).

I can also work towards providing more generalized guidance here, but figured it might be good to work this issue from your perspective.

@shntnu
Copy link
Member

shntnu commented Mar 30, 2024

Hi @shntnu - I wanted to follow up as you prepare for the work you originally mentioned in #163 ("... convert ~3000 plates using this..". In order to provide the best possible guidance I'd wonder:

Thank you for doing this!

  • Would you be able to share a link to examples or actual data which will be used?

This is a draft of the script we will use; we will likely need to iterate on the join (unrelated to CytoTable)

#163 (comment)

We will eventually want do it on all the 2378 SQLite files in cpg0016-jump that can be fetched like this https://gist.github.com/shntnu/a57ac6413ed41c653b566c885f29f95a

But to get started, we will likely do it on a smaller batch of data from cpg0020-varchamp (@emiglietta @bethac07 @ErinWeisbart @Zitong-Chen-16 @jessica-ewald)

  • What type of system will run CytoTable? (HPC/not, CPU, memory, etc).

We will do this on an EC2 instance, so we can configure it at will. Most likely, we will create something similar to https://github.com/DistributedScience/Distributed-Collate for this although some of us are experimenting with skypilot for tasks like this.

@d33bs
Copy link
Member Author

d33bs commented Apr 11, 2024

Thanks @shntnu ! I took a look at those resources you provided - very neat to see the things happening within https://github.com/broadinstitute/cpg 🙂 .

Sharing some thoughts and questions based on work through this Google Colab notebook (and GitHub Gist backup).

  • I didn't find any objects under cpg0020-varchamp but I may have been in the wrong bucket. I used aws s3 ls --no-sign-request --region us-west-2 s3://cellpainting-gallery/cpg0020-varchamp/cpg0020-varchamp to try and view objects. Is there a different place I may look to find those data?
  • I ran into AWS Cli snags through cpgdata.utils which I thought might be worth knowing about (let me know if opening issues would be more appropriate than citing here). These both stem from attempting to use cpg index sync ... commands.
    • I encountered a bug that would return botocore objects after issuing cpg commands (looks like: <botocore.awsrequest.AWSRequest object at 0x7a6435835db0>). These apparently stem from AWS modules sometimes requiring the region to be set (see this SO solution). I found env vars were ineffective at resolving this and instead had to override the cpgdata.utils.sync_s3_prefix function with a region entry in order to use it properly (see the notebook above for a reference implementation).
    • Through the Colab environment (Python 3.10.12 on Linux x86_64) I had to update cryptography in order to circumvent errors with AttributeError: module 'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK' (I believe through AWS modules). cryptography is set as with constraints from cpgdata through cpgaws 2.15.19 requires cryptography<40.0.2,>=3.3.2 (via pip install). Pip was upset about it, but I was able to install and use cryptography-42.0.5 to avoid this error (post-install of cpgdata). It's possible this has something to do with how Google Colab environments are setup (same for the other item), but I thought it could be important to mention just in case.

All this said, I'm still looking into testing Parsl configurations through CytoTable to ensure this works as best it can for your use-case. Could the following object paths for cpg0016-jump be a representative sample of others in the dataset?

$ aws s3 ls --recursive --human-readable --summarize --no-sign-request s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12 | grep .sqlite
2022-10-02 06:14:43   22.0 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126113/BR00126113.sqlite
2022-10-02 07:47:03    3.6 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite
2022-10-02 06:14:43   10.3 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126115/BR00126115.sqlite
2022-10-02 06:14:42   17.5 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126116/BR00126116.sqlite
2022-10-02 07:21:28   27.9 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126117/BR00126117.sqlite
2022-10-02 06:57:52   18.7 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126706/BR00126706.sqlite
2022-10-02 06:14:43   25.9 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126707/BR00126707.sqlite
2022-10-02 06:14:43   17.6 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126708/BR00126708.sqlite
2022-10-02 07:29:05   18.2 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126709/BR00126709.sqlite
2022-10-02 07:32:34   18.8 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126710/BR00126710.sqlite
2022-10-02 06:14:42   18.0 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126711/BR00126711.sqlite
2022-10-02 06:14:42   17.4 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126712/BR00126712.sqlite
2022-10-02 07:27:15   18.7 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126714/BR00126714.sqlite
2022-10-02 08:01:59   19.9 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126715/BR00126715.sqlite
2022-10-02 07:24:01   18.4 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126716/BR00126716.sqlite
2022-10-02 06:14:43   17.8 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126717/BR00126717.sqlite
2022-10-02 06:14:41   18.4 GiB cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126718/BR00126718.sqlite

For now I might start by benchmarking how operations occur with BR00126113.sqlite due to the small storage footprint and imagine we may see variation in performance depending on the total size of the databases and/or compartment tables.

@ErinWeisbart
Copy link
Member

For cpg0020-varchamp you had one too many nestings
You can aws s3 ls s3://cellpainting-gallery/cpg0020-varchamp/ and the "source" subfolder is broad => s3://cellpainting-gallery/cpg0020-varchamp/broad/

@shntnu
Copy link
Member

shntnu commented Apr 12, 2024

Thanks @shntnu ! I took a look at those resources you provided - very neat to see the things happening within https://github.com/broadinstitute/cpg 🙂 .

That was all @leoank 🎉 who built on @johnarevalo's https://github.com/jump-cellpainting/data-validation (private)

@ErinWeisbart has been leading the broader effort of making the gallery useful for humanity https://arxiv.org/abs/2402.02203

Sharing some thoughts and questions based on work through this Google Colab notebook (and GitHub Gist backup).

Thanks for diving in!

  • I ran into AWS Cli snags through cpgdata.utils which I thought might be worth knowing about (let me know if opening issues would be more appropriate than citing here). These both stem from attempting to use cpg index sync ... commands.

@leoank can decide what to do here (create an issue or resolve here)

All this said, I'm still looking into testing Parsl configurations through CytoTable to ensure this works as best it can for your use-case. Could the following object paths for cpg0016-jump be a representative sample of others in the dataset?

Yes

For now I might start by benchmarking how operations occur with BR00126113.sqlite due to the small storage footprint and imagine we may see variation in performance depending on the total size of the databases and/or compartment tables.

That works

Thanks again for looking into this!

@d33bs
Copy link
Member Author

d33bs commented Apr 15, 2024

Thank you @ErinWeisbart and @shntnu for the replies here, helpful! I'm planning to follow up here this week with findings / thoughts on best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants