Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HGMD information search in Feature Engineering Part 1 #61

Merged
merged 1 commit into from
Aug 19, 2024

Conversation

jylee-bcm
Copy link
Contributor

@jylee-bcm jylee-bcm commented Aug 19, 2024

Background

In Issue #54, we identified that Feature Engineering Part 1 was the primary bottleneck when processing WGS-level VCF files.

Based on profiling by @hyunhwan-bcm, we pinpointed the following lines as the main culprits for the slowdown:

np.any(hgmdDf["gene_sym"].isin([varObj.geneSymbol]))
varDf = hgmdHPOScoreDf[hgmdHPOScoreDf["acc_num"] == varObj.hgmd_id]
varScore = max(varDf["Similarity_Score"].tolist())

Given that these operations are executed 400,000 times (with N as the number of rows in varDf) and hgmdDf has 350,000 rows (denoted by M), the time complexity scales to O(NM). This results in an intractable number of operations, proportional to 140 billion pairs.

Optimized Code

We've improved the implementation to achieve O(N log M) complexity:

varObj.geneSymbol in hgmdHPOScoreGeneSortedDf.index
varScore = hgmdHPOScoreAccSortedDf.loc[varObj.hgmd_id].Similarity_Score

Where,

hgmdHPOScoreGeneSortedDf = hgmdHPOScoreDf.groupby('gene_sym').first().sort_index()
hgmdHPOScoreAccSortedDf = hgmdHPOScoreDf.groupby('acc_num').first().sort_index()

Conclusion

After applying these optimizations, the processing time for WGS data was reduced from 5 hours to 42 minutes, with identical output to the original implementation.

@jylee-bcm jylee-bcm requested a review from hyunhwan-bcm August 19, 2024 19:12
@jylee-bcm jylee-bcm marked this pull request as ready for review August 19, 2024 19:46
@jylee-bcm jylee-bcm added the enhancement New feature or request label Aug 19, 2024
Copy link
Contributor

@hyunhwan-bcm hyunhwan-bcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"""
function to get HGMD from local flat file
Params:
varObj:a varaint object read from VEP annotation
hgmdDf: HGMD data frame read from local file (CL: now it refers to hgmdHPOScoreDf in main.py)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clean the comments up

# print('\thgmdVarFound:',hgmdVarFound,'hgmdGeneFound:',hgmdGeneFound,
# 'hgmdVarPhenIdList:',hgmdVarPhenIdList,'hgmdVarHPOIdList:',hgmdVarHPOIdList,
# 'hgmdVarHPOStrList:',hgmdVarHPOStrList)
# return
retList = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we need other things here, but let's keep them then remove in future PR.

@hyunhwan-bcm hyunhwan-bcm merged commit d04d442 into nextflow_conversion Aug 19, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants