Using AlphaFold and GNN to Predict Catalytic Efficiency of Enzymes
- Create a conda environment with Python>3.7
conda create --name afgnn python=3.8
conda activate afgnn
- Install required packages
pip install -r requirements.txt
Use PCA to reduce the dimensionality of the logits
in the output file XXXXX.pkl
of modified AlphaFold.
XXXXX.pkl
is a Python dict object denoting a AlphaFold output of a protein sequence with length of e_dim
, whose dict_keys
is given by
distogram
,bin_edges
logits
experimentally_resolved
,masked_msa
,predicted_lddt
,representations
,msa
msa_first_row
pair
single
structure_module
structure_module
,plddt
We only need the logits
and single
for the enzyme features.
-
Modify the
./pca/pca.py
to change theAF2_OUTPUT_DIR
to the corresponding path where the actual AlphaFold outputs are saved. -
You may need to change the padding size if the maximum length of the
e_dim
is greater than 1600 in./pca/pca.py
. -
Conduct PCA on the pkl in
AF2_OUTPUT_DIR
cd pca
nohup python3 pca.py >> log.txt
Construct the reaction graph as well as the node feature from the raw data.
-
Modify the
model_name
in the second block of the./datasets/preprocess.ipynb
to match the target dataset (i.e. iMM904, iYO844 or iML1515) -
Run through the code blocks in
./datasets/preprocess.ipynb
-
[Optional] Modify the
train_test_split
ratio in the 19th code block to get different size of test set.
The output graph and feature will be saved to ./dataset/<dataset_name>/
.
- The node information with enzymes' single representation and molecules' feature are saved to
node.pkl
as a pandas dataframe. - The link information is saved to
link.dat
. - The training label is saved to
label.dat
. - The test label is saved to
label.dat.test
. - The logits feature of enzyme nodes are saved to
logits/X.npy
withX
denotes the node index.
- Train the modified GCN
cd gnn
nohup python3 train.py --dataset iYO844 --num-layers 2 >> logs/iYO844.txt
Code derived and reshaped from HGB.