Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bin/feature.py for memory usage and processing time #54

Closed
hyunhwan-bcm opened this issue Aug 14, 2024 · 2 comments
Closed

Optimize bin/feature.py for memory usage and processing time #54

hyunhwan-bcm opened this issue Aug 14, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@hyunhwan-bcm
Copy link
Contributor

hyunhwan-bcm commented Aug 14, 2024

Description

The bin/feature.py script is currently consuming substantial memory and has slow processing times. We need to optimize this script to improve its performance, focusing on pandas usage optimization and reducing processing time for specific functions.

Current Performance Issues

Memory Usage

  1. Pandas consumes more memory than the file size after pd.read_csv operations.
  2. Significant memory increments observed for various dataframe operations.

Time Consumption

  1. The apply function on varDf takes 27.9% of the total time.
  2. The hgmdSymMatch function consumes 65.0% of the total time.
@hyunhwan-bcm hyunhwan-bcm added this to the v1.0 milestone Aug 14, 2024
@hyunhwan-bcm
Copy link
Contributor Author

Annotation file size

[ 204]  .
├── [ 272]  anno_hg19
│   ├── [4.7M]  decipher.csv
│   ├── [164M]  dgv.csv
│   ├── [1.8M]  gene_clinvar.csv
│   ├── [ 43M]  gene_omim.json
│   ├── [ 13M]  gnomad.v2.1.1.lof_metrics.by_gene.txt
│   └── [4.4M]  omim_alleric_variants.json
├── [ 272]  anno_hg38
│   ├── [4.7M]  decipher.csv
│   ├── [ 38M]  dgv.csv
│   ├── [1.8M]  gene_clinvar.csv
│   ├── [ 43M]  gene_omim.json
│   ├── [ 13M]  gnomad.v2.1.1.lof_metrics.by_gene.txt
│   └── [4.4M]  omim_alleric_variants.json
└── [8.8K]  feature_stats.csv

The running time profile

Timer unit: 1 s

Total time: 1169.54 s
File: /Users/hyun-hwanjeong/Workspaces/AI_MARRVEL/bin/feature.py
Function: main at line 47

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

   316         1        326.6    326.6     27.9          annotateInfoDf = varDf.apply(f, axis=1, result_type='expand')

   352         1         28.2     28.2      2.4              resDf = annotateInfoDf.apply(f, axis=1, result_type='expand')

   360     55531         27.5      0.0      2.3              omimSymMatch(varObj, omimHPOScoreDf, args.inFileType)

   361     55531        760.6      0.0     65.0              hgmdSymMatch(varObj, hgmdHPOScoreDf)

   428         1          1.2      1.2      0.1      score.to_csv("scores.csv", index=False)

The memory profile

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================

   108  121.605 MiB   35.859 MiB           1       gnomadMetricsGeneDf = pd.read_csv(fileName, sep="\t")

   136  127.469 MiB    5.863 MiB           1           omimHPOScoreDf = pd.read_csv(fileName, sep="\t")

   140  179.078 MiB   51.609 MiB           1           hgmdHPOScoreDf = pd.read_csv(fileName, sep="\t")

   153  185.281 MiB    6.203 MiB           1           clinvarGeneDf = pd.read_csv(fileName, sep=",")

   164  237.117 MiB   50.961 MiB           1               omimGeneList = json.load(f)

   207  604.121 MiB  356.621 MiB           1           dgvDf = pd.read_csv(fileName, sep=",")

   273 1221.754 MiB  661.543 MiB           2           varDf = pd.read_csv(

   304 1703.629 MiB  315.082 MiB       55532           def f(row):

   423 1538.184 MiB   27.973 MiB           1       score = load_raw_matrix(annotateInfoDf)

   425 1537.270 MiB   25.773 MiB           1       score = hgmdCurate(score)

@jylee-bcm
Copy link
Contributor

Can I get an update regarding this issue? The recent PR #61 improved the memory usage and processing time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants