Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to manage input as streams instead of storing everything in RAM #94

Open
biologyguy opened this issue Feb 17, 2017 · 2 comments
Open

Comments

@biologyguy
Copy link
Owner

biologyguy commented Feb 17, 2017

BuddySuite does not work well on very large files, because the Buddy classes read everything into memory.
File size can be pre-determined, and perhaps files can be managed as handles. This will likely require a major rewrite.

Any way to leverage SQLite databases?

This may be useful (not sure yet)
https://github.com/mdshw5/pyfaidx

@KarlKeat
Copy link
Collaborator

The best way to do this would probably require reading the files in buffered chunks, which would definitely require a rewrite of pretty much everything. I'm also not sure how you would go about parsing a chunk of a sequence file, especially in alignment formats. If you want to cut down on memory usage it's probably easiest to just look for places where stuff is being copied unnecessarily (I noticed before that sometimes BuddySuite was using several times more memory than the file size, especially with larger file sizes, so there's probably room for improvement). On the other hand, it could be that a lot of that is biopython's fault in which case it would be hard to fix.

In terms of sqlite databases, you could make it so that people could have BuddySuite load their massive files and store them in a reusable sqlite database for future use and reuse. That way I think you'd limit the overhead on future runs by being able to access the sequences straight from the database, without having to parse the files or load everything into memory at once. I'm not entirely sure this would yield a performance increase though, it depends on how sqlite databases work under the hood (if it's all loaded into memory then basically nothing is gained except for skipping the parsing step).

@biologyguy
Copy link
Owner Author

Thanks for the thoughts Karl.
I have a feeling this is going to be a lingering issue... Too many other juicy pieces of low hanging fruit to grab ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants