-
Notifications
You must be signed in to change notification settings - Fork 0
MFs Dataset Creation Journal
George Iniatis edited this page Mar 5, 2023
·
10 revisions
- Downloaded the human proteins from AlphaFold (UP000005640) - 23391 Proteins
- Kept only F1 from AlphaFold. Left with 19966 Proteins
- Retrieved all the protein accession numbers and sequences
- Extracted molecular function keywords using UniProt API Calls for each protein. Left with 11222 Proteins
- Extracted protein sequence descriptors using Protr R library
- Extracted protein sequence embeddings from UniProt for each protein
- Extracted amino acids descriptors from protein sequences using Peptides R library (Converted to numpy arrays)
- Extracted amino acid embeddings from UniProt (Converted to numpy array)
- Extracted protein PSSM using Protr R library (Converted to numpy arrays)
- Extracted contact maps from protein 3D structures using this nanoHUB tool (Converted to numpy arrays)
- One-hot encoded all those molecular functions
- Protein Sequence UniProt embeddings each entry given a column
- Used PCA to reduce Tripeptide Descriptors from 8000 to 4813
- Removed 20 entries for missing protein sequence descriptors. Left with 11202 Proteins
- Decided to simplify problem from multi-label to one-class classification. The molecular function we would be trying to predict would be "DNA Binding" as it was the most prevalent
- Used RFECV to reduce the protein sequence descriptors from 7757 to 144