Skip to content

dsgelab/gnn_family_pedigree

Repository files navigation

GNN ON FAMILY PEDIGREE

Repository containing the code used for evaluating a Graph Neural Network (GNN) model for family pedigree data.

RESEARCH QUESTION:
Can we “impute” a phenotype by knowing nothing about the target individual and only leveraging information for each node in the familial pedigree?

FILES

FILE STRUCTURE

|--- / 

    pipeline_data.sh 
    pipeline_model.sh 
    run_explainability.sh 
    run_tuning.sh

    |--- data/
        statfile.csv
        maskfile.csv
        edgefile_onlyparents.csv
        featfile_chd.csv
        |--- extended_data/
            statfile_Drug.csv
            statfile_EndPt.csv
            statfile_SES.csv
            statfile_all.csv
            featfile_chd_Drug.csv
            featfile_chd_EndPt.csv
            featfile_chd_SES.csv
            featfile_chd_all.csv
        |--- scripts/
            extract_study_population.py
            extract_edge_onlyparents.py
            add_extra_features.py

    |--- src/
        main.py
        data.py
        model.py
        utils.py
        explainability.py
        my_explainability.py

    |--- logs/
    |--- output/

FILE CONTENT

data/
the folder contains all input files used for the GNN models:

  • statfile    : contains basic information available for every patient
  • maskfile  : specifies if patient is used in the project, if it is a target patient and if it is used to train/validate or test the model
  • edgefile   : specifies all the graph edges i.e. all the connections between patients
  • featfile    : needs to be manually generated, specifies the features to be used for training the model

plus the scripts used to create them:

  • extract_study_population.py   : create statfile.csv and maskfile.csv
  • extract_edge_onlyparents.py  : create edgefile_onlyparents.csv
  • add_extra_features.py     : extend the main statfile with extra registry information (see extended_data folder)

NB:
extended_data/ can be substituted with another folder containing a different extension of the stafiles, e.g. using a subsample of all the available covariates

src/
the folder contains all the scripts used for the GNN:

  • utils.py   : utility functions
  • model.py   : GNN model architecture
  • data.py   : construct pytorch_geometric objects
  • main.py  : perform model train and test

plus the shell pipelines used:

  • pipeline_data.sh    : used for extracting the study population and create the GNN input files
  • pipeline_models.sh     : used for training and testing the desired models
  • run_explainability.sh  : used for extracting the GNNExpaliner results on the desired model
  • run_tuning.sh     : used for performing the hyperparameter finetuning (using Optuna package)

REFERENCES

project inspired by Sophie Wharrie's paper on a similar analysis in finregistry
PREPRINT: https://arxiv.org/abs/2304.05010

PEOPLE

CODE AUTHOR

COLLABORATORS