Repository containing the code used for evaluating a Graph Neural Network (GNN) model for family pedigree data.
RESEARCH QUESTION:
Can we “impute” a phenotype by knowing nothing about the target individual and only leveraging information for each node in the familial pedigree?
|--- /
pipeline_data.sh
pipeline_model.sh
run_explainability.sh
run_tuning.sh
|--- data/
statfile.csv
maskfile.csv
edgefile_onlyparents.csv
featfile_chd.csv
|--- extended_data/
statfile_Drug.csv
statfile_EndPt.csv
statfile_SES.csv
statfile_all.csv
featfile_chd_Drug.csv
featfile_chd_EndPt.csv
featfile_chd_SES.csv
featfile_chd_all.csv
|--- scripts/
extract_study_population.py
extract_edge_onlyparents.py
add_extra_features.py
|--- src/
main.py
data.py
model.py
utils.py
explainability.py
my_explainability.py
|--- logs/
|--- output/
data/
the folder contains all input files used for the GNN models:
- statfile : contains basic information available for every patient
- maskfile : specifies if patient is used in the project, if it is a target patient and if it is used to train/validate or test the model
- edgefile : specifies all the graph edges i.e. all the connections between patients
- featfile : needs to be manually generated, specifies the features to be used for training the model
plus the scripts used to create them:
- extract_study_population.py : create statfile.csv and maskfile.csv
- extract_edge_onlyparents.py : create edgefile_onlyparents.csv
- add_extra_features.py : extend the main statfile with extra registry information (see extended_data folder)
NB:
extended_data/ can be substituted with another folder containing a different extension of the stafiles, e.g. using a subsample of all the available covariates
src/
the folder contains all the scripts used for the GNN:
- utils.py : utility functions
- model.py : GNN model architecture
- data.py : construct pytorch_geometric objects
- main.py : perform model train and test
plus the shell pipelines used:
- pipeline_data.sh : used for extracting the study population and create the GNN input files
- pipeline_models.sh : used for training and testing the desired models
- run_explainability.sh : used for extracting the GNNExpaliner results on the desired model
- run_tuning.sh : used for performing the hyperparameter finetuning (using Optuna package)
project inspired by Sophie Wharrie's paper on a similar analysis in finregistry
PREPRINT:
https://arxiv.org/abs/2304.05010
CODE AUTHOR
- Matteo Ferro matteo.ferro@heslinki.fi
COLLABORATORS
- Zhiyu Yang zhiyu.yang@helsinki.fi
- Sophie Wharrie sophie.wharrie@aalto.fi