Optimize bin/feature.py for memory usage and processing time #54

hyunhwan-bcm · 2024-08-14T03:36:28Z

Description

The bin/feature.py script is currently consuming substantial memory and has slow processing times. We need to optimize this script to improve its performance, focusing on pandas usage optimization and reducing processing time for specific functions.

Current Performance Issues

Memory Usage

Pandas consumes more memory than the file size after pd.read_csv operations.
Significant memory increments observed for various dataframe operations.

Time Consumption

The apply function on varDf takes 27.9% of the total time.
The hgmdSymMatch function consumes 65.0% of the total time.

The text was updated successfully, but these errors were encountered:

hyunhwan-bcm · 2024-08-14T03:42:31Z

Annotation file size

[ 204]  .
├── [ 272]  anno_hg19
│   ├── [4.7M]  decipher.csv
│   ├── [164M]  dgv.csv
│   ├── [1.8M]  gene_clinvar.csv
│   ├── [ 43M]  gene_omim.json
│   ├── [ 13M]  gnomad.v2.1.1.lof_metrics.by_gene.txt
│   └── [4.4M]  omim_alleric_variants.json
├── [ 272]  anno_hg38
│   ├── [4.7M]  decipher.csv
│   ├── [ 38M]  dgv.csv
│   ├── [1.8M]  gene_clinvar.csv
│   ├── [ 43M]  gene_omim.json
│   ├── [ 13M]  gnomad.v2.1.1.lof_metrics.by_gene.txt
│   └── [4.4M]  omim_alleric_variants.json
└── [8.8K]  feature_stats.csv

The running time profile

Timer unit: 1 s

Total time: 1169.54 s
File: /Users/hyun-hwanjeong/Workspaces/AI_MARRVEL/bin/feature.py
Function: main at line 47

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

   316         1        326.6    326.6     27.9          annotateInfoDf = varDf.apply(f, axis=1, result_type='expand')

   352         1         28.2     28.2      2.4              resDf = annotateInfoDf.apply(f, axis=1, result_type='expand')

   360     55531         27.5      0.0      2.3              omimSymMatch(varObj, omimHPOScoreDf, args.inFileType)

   361     55531        760.6      0.0     65.0              hgmdSymMatch(varObj, hgmdHPOScoreDf)

   428         1          1.2      1.2      0.1      score.to_csv("scores.csv", index=False)

The memory profile

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================

   108  121.605 MiB   35.859 MiB           1       gnomadMetricsGeneDf = pd.read_csv(fileName, sep="\t")

   136  127.469 MiB    5.863 MiB           1           omimHPOScoreDf = pd.read_csv(fileName, sep="\t")

   140  179.078 MiB   51.609 MiB           1           hgmdHPOScoreDf = pd.read_csv(fileName, sep="\t")

   153  185.281 MiB    6.203 MiB           1           clinvarGeneDf = pd.read_csv(fileName, sep=",")

   164  237.117 MiB   50.961 MiB           1               omimGeneList = json.load(f)

   207  604.121 MiB  356.621 MiB           1           dgvDf = pd.read_csv(fileName, sep=",")

   273 1221.754 MiB  661.543 MiB           2           varDf = pd.read_csv(

   304 1703.629 MiB  315.082 MiB       55532           def f(row):

   423 1538.184 MiB   27.973 MiB           1       score = load_raw_matrix(annotateInfoDf)

   425 1537.270 MiB   25.773 MiB           1       score = hgmdCurate(score)

jylee-bcm · 2024-08-23T19:39:04Z

Can I get an update regarding this issue? The recent PR #61 improved the memory usage and processing time?

hyunhwan-bcm added this to the v1.0 milestone Aug 14, 2024

hyunhwan-bcm assigned hyunhwan-bcm and jylee-bcm Aug 14, 2024

hyunhwan-bcm added the enhancement New feature or request label Aug 15, 2024

jylee-bcm mentioned this issue Aug 19, 2024

Optimize HGMD information search in Feature Engineering Part 1 #61

Merged

hyunhwan-bcm closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize bin/feature.py for memory usage and processing time #54

Optimize bin/feature.py for memory usage and processing time #54

hyunhwan-bcm commented Aug 14, 2024 •

edited

Loading

hyunhwan-bcm commented Aug 14, 2024

jylee-bcm commented Aug 23, 2024

Optimize bin/feature.py for memory usage and processing time #54

Optimize bin/feature.py for memory usage and processing time #54

Comments

hyunhwan-bcm commented Aug 14, 2024 • edited Loading

Description

Current Performance Issues

Memory Usage

Time Consumption

hyunhwan-bcm commented Aug 14, 2024

Annotation file size

The running time profile

The memory profile

jylee-bcm commented Aug 23, 2024

hyunhwan-bcm commented Aug 14, 2024 •

edited

Loading