Optimize HGMD information search in Feature Engineering Part 1 #61

jylee-bcm · 2024-08-19T18:31:45Z

Background

In Issue #54, we identified that Feature Engineering Part 1 was the primary bottleneck when processing WGS-level VCF files.

Based on profiling by @hyunhwan-bcm, we pinpointed the following lines as the main culprits for the slowdown:

np.any(hgmdDf["gene_sym"].isin([varObj.geneSymbol]))

varDf = hgmdHPOScoreDf[hgmdHPOScoreDf["acc_num"] == varObj.hgmd_id]
varScore = max(varDf["Similarity_Score"].tolist())

Given that these operations are executed 400,000 times (with N as the number of rows in varDf) and hgmdDf has 350,000 rows (denoted by M), the time complexity scales to O(NM). This results in an intractable number of operations, proportional to 140 billion pairs.

Optimized Code

We've improved the implementation to achieve O(N log M) complexity:

varObj.geneSymbol in hgmdHPOScoreGeneSortedDf.index

varScore = hgmdHPOScoreAccSortedDf.loc[varObj.hgmd_id].Similarity_Score

Where,

hgmdHPOScoreGeneSortedDf = hgmdHPOScoreDf.groupby('gene_sym').first().sort_index()
hgmdHPOScoreAccSortedDf = hgmdHPOScoreDf.groupby('acc_num').first().sort_index()

Conclusion

After applying these optimizations, the processing time for WGS data was reduced from 5 hours to 42 minutes, with identical output to the original implementation.

hyunhwan-bcm

LGTM

hyunhwan-bcm · 2024-08-19T20:27:40Z

bin/annotation/utils_for_marrvel_flatfile.py

    """
    function to get HGMD from local flat file
    Params:
    varObj:a varaint object read from VEP annotation
-    hgmdDf: HGMD data frame read from local file (CL: now it refers to hgmdHPOScoreDf in main.py)


thanks for clean the comments up

hyunhwan-bcm · 2024-08-19T20:28:37Z

bin/annotation/utils_for_marrvel_flatfile.py

-    # print('\thgmdVarFound:',hgmdVarFound,'hgmdGeneFound:',hgmdGeneFound,
-    #      'hgmdVarPhenIdList:',hgmdVarPhenIdList,'hgmdVarHPOIdList:',hgmdVarHPOIdList,
-    #      'hgmdVarHPOStrList:',hgmdVarHPOStrList)
-    # return
    retList = [


not sure we need other things here, but let's keep them then remove in future PR.

jylee-bcm requested a review from hyunhwan-bcm August 19, 2024 19:12

Optimize HGMD information search in Feature Engineering Part 1

49dd7ce

jylee-bcm force-pushed the fe1 branch from 4d3faab to 49dd7ce Compare August 19, 2024 19:40

jylee-bcm marked this pull request as ready for review August 19, 2024 19:46

jylee-bcm assigned jylee-bcm and hyunhwan-bcm Aug 19, 2024

jylee-bcm added the enhancement New feature or request label Aug 19, 2024

hyunhwan-bcm approved these changes Aug 19, 2024

View reviewed changes

hyunhwan-bcm merged commit d04d442 into nextflow_conversion Aug 19, 2024
0 of 2 checks passed

jylee-bcm mentioned this pull request Aug 23, 2024

Optimize bin/feature.py for memory usage and processing time #54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize HGMD information search in Feature Engineering Part 1 #61

Optimize HGMD information search in Feature Engineering Part 1 #61

jylee-bcm commented Aug 19, 2024 •

edited

Loading

hyunhwan-bcm left a comment

hyunhwan-bcm Aug 19, 2024

hyunhwan-bcm Aug 19, 2024

Optimize HGMD information search in Feature Engineering Part 1 #61

Optimize HGMD information search in Feature Engineering Part 1 #61

Conversation

jylee-bcm commented Aug 19, 2024 • edited Loading

Background

Optimized Code

Conclusion

hyunhwan-bcm left a comment

Choose a reason for hiding this comment

hyunhwan-bcm Aug 19, 2024

Choose a reason for hiding this comment

hyunhwan-bcm Aug 19, 2024

Choose a reason for hiding this comment

jylee-bcm commented Aug 19, 2024 •

edited

Loading