MFs Dataset Creation Journal

Process

Downloaded the human proteins from AlphaFold (UP000005640) - 23391 Proteins
Kept only F1 from AlphaFold. Left with 19966 Proteins
Retrieved all the protein accession numbers and sequences
Extracted molecular function keywords using UniProt API Calls for each protein. Left with 11222 Proteins
Extracted protein sequence descriptors using Protr R library
Extracted protein sequence embeddings from UniProt for each protein
Extracted amino acids descriptors from protein sequences using Peptides R library (Converted to numpy arrays)
Extracted amino acid embeddings from UniProt (Converted to numpy array)
Extracted protein PSSM using Protr R library (Converted to numpy arrays)
Extracted contact maps from protein 3D structures using this nanoHUB tool (Converted to numpy arrays)
One-hot encoded all those molecular functions
Protein Sequence UniProt embeddings each entry given a column
Used PCA to reduce Tripeptide Descriptors from 8000 to 4813
Removed 20 entries for missing protein sequence descriptors. Left with 11202 Proteins
Decided to simplify problem from multi-label to one-class classification. The molecular function we would be trying to predict would be "DNA Binding" as it was the most prevalent
Used RFECV to reduce the protein sequence descriptors from 7757 to 144