-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize lazy execution during pvar read #11
Conversation
I can only replicate the error here within the .zst files. I will need to investigate further.
|
The more I look at this, the more it seems like a duckdb + polars problem rather than just polars. Particularly, when there is an INFO column (large one at that), polars chokes hard regardless of optimization strategy. A tool like duckdb could efficiently ignore that column. Example on why I do not feel polars alone is enough here - I have a pvar file of ~10GB with the INFO column taking up 80-90% of that size. When scanning the file and selecting just the ID column and collecting using polars, memory maxes out at ~11GB as shown in the image below, that is still not good enough IMO. For reference, reading in the entire file instead does take up ~23GB of memory so the improvements over that are still good. I don't have much experience optimizing memory management for full projects like this, but I think it is worth considering other options. Curious to hear thoughts on the direction you guys are wishing this to go. E: Just as a further demonstration of my suggestion, running the same benchmark (tracking indices too) and including filtering down to a small list of 5 variants took 160MB using duckdb, whereas just to get row indices and start filtering columns with polars I was past 10GB. Speed is moderately decreased (took my query 20 seconds from reading to filtering). On smaller files, the speed would be even less noticeable. |
Thanks for investigating this! We tried to optimize for the reading speed and memory usage of the SNPObject, but haven't done many tests on the memory usage of reading itself, so some improvements might be possible. However, we would not like to compromise speed since it might often be preferable to use more memory if the processing is faster, for example, when working in a cluster with a lot of memory and processing whole biobanks. We have not used duckdb in the past, and the fourth Google result for duckdb is https://www.reddit.com/r/dataengineering/comments/zp6ai6/what_are_the_actual_use_cases_for_duckdb_when_you/ which says that it's very slow compared to polars, but the post is 2 years old, so things might have changed. However, if you believe that using duckdb or any other modification can improve memory usage (without compromising speed, or even improving speed), we are happy to accept your contributions! |
I hadn't read the edit before. I understand that memory usage is improved as a tradeoff of the reading speed. If you believe the improvements are considerable, we could consider having two backends, one with only polars and one integrating duckdb. This is actually similar to what we did for the VCF reader, since VCF files are effectively CSV/TSV too (what we did there is to use our own polars reader by default, but the user can also choose to use scikit-allel as an alternative backend, which tends to be slower but more memory-efficient). |
Understood, it is good to know the overall goals. I think the polars solution is clearly best if speed is paramount. I think more testing needs to be done before making any serious changes, but it is something to think about. For the time being, I will clean up this PR with a single commit that takes care of the issues I found with 'zst' files and should improve the header reading logic. 'zst' files need to be read in using a simple read_csv as opposed to the scan. |
* row indices create an issue with lazy execution. any function polars provides to add indices requires the lf be read into memory. * instead of reading the entire lf into memory, we track indices by using the variant ID. * we can determine a blank header from the number of columns in a pvar file - pvar spec allows only for 5/6 cols when header is blank. * the order of the control block for filtering variants provided an issue where the entire lf was read in the case that variant_idxs were not specified, even if variant_ids was. * fix the ordering logic into one if,elif,else block and copy the structure to the psam section. Signed-off-by: Kyle Scott <kms309@miami.edu>
We released v0.2.10 that includes this enhancement and some other bug fixes! |
I noticed some issues with the lazy execution on the pvar file. To the best of my knowledge, there is only so much that can be done to save memory when row indices are important, but these fixes do help. I did not change any of the behavior regarding row indices for psam, so there are some logical differences between the pvar/psam code blocks here. psam files are so small that it probably isn't worth optimizing except for consistency.
I also did some work on the header checking. As it stands, if the header doesn't exist (i.e. #CHROM is not found), the entire file will be read and the if statement checking for #CHROM will continue on until the very end. Doesn't seem like the ideal behavior, so I propose some fixes to that here.
Sorry to drop it all in one commit, I figured I would put up what I have before bed. I may try to look at this a little more, but do with the PR what you wish :)