Try to manage input as streams instead of storing everything in RAM #94

biologyguy · 2017-02-17T02:15:47Z

BuddySuite does not work well on very large files, because the Buddy classes read everything into memory.
File size can be pre-determined, and perhaps files can be managed as handles. This will likely require a major rewrite.

Any way to leverage SQLite databases?

This may be useful (not sure yet)
https://github.com/mdshw5/pyfaidx

KarlKeat · 2017-02-17T02:49:22Z

The best way to do this would probably require reading the files in buffered chunks, which would definitely require a rewrite of pretty much everything. I'm also not sure how you would go about parsing a chunk of a sequence file, especially in alignment formats. If you want to cut down on memory usage it's probably easiest to just look for places where stuff is being copied unnecessarily (I noticed before that sometimes BuddySuite was using several times more memory than the file size, especially with larger file sizes, so there's probably room for improvement). On the other hand, it could be that a lot of that is biopython's fault in which case it would be hard to fix.

In terms of sqlite databases, you could make it so that people could have BuddySuite load their massive files and store them in a reusable sqlite database for future use and reuse. That way I think you'd limit the overhead on future runs by being able to access the sequences straight from the database, without having to parse the files or load everything into memory at once. I'm not entirely sure this would yield a performance increase though, it depends on how sqlite databases work under the hood (if it's all loaded into memory then basically nothing is gained except for skipping the parsing step).

biologyguy · 2017-02-17T03:09:06Z

Thanks for the thoughts Karl.
I have a feeling this is going to be a lingering issue... Too many other juicy pieces of low hanging fruit to grab ;)

biologyguy added the enhancement label Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to manage input as streams instead of storing everything in RAM #94

Try to manage input as streams instead of storing everything in RAM #94

biologyguy commented Feb 17, 2017 •

edited

Loading

KarlKeat commented Feb 17, 2017

biologyguy commented Feb 17, 2017

Try to manage input as streams instead of storing everything in RAM #94

Try to manage input as streams instead of storing everything in RAM #94

Comments

biologyguy commented Feb 17, 2017 • edited Loading

KarlKeat commented Feb 17, 2017

biologyguy commented Feb 17, 2017

biologyguy commented Feb 17, 2017 •

edited

Loading