Optimize HGMD information search in Feature Engineering Part 1 #61
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
In Issue #54, we identified that Feature Engineering Part 1 was the primary bottleneck when processing WGS-level VCF files.
Based on profiling by @hyunhwan-bcm, we pinpointed the following lines as the main culprits for the slowdown:
Given that these operations are executed 400,000 times (with
N
as the number of rows invarDf
) andhgmdDf
has 350,000 rows (denoted byM
), the time complexity scales toO(NM)
. This results in an intractable number of operations, proportional to 140 billion pairs.Optimized Code
We've improved the implementation to achieve
O(N log M)
complexity:Where,
Conclusion
After applying these optimizations, the processing time for WGS data was reduced from 5 hours to 42 minutes, with identical output to the original implementation.