Skip to content

MFs Dataset Creation Journal

George Iniatis edited this page Mar 5, 2023 · 10 revisions

Process

  • Downloaded the human proteins from AlphaFold (UP000005640) - 23391 Proteins
  • Kept only F1 from AlphaFold. Left with 19966 Proteins
  • Retrieved all the protein accession numbers and sequences
  • Extracted molecular function keywords using UniProt API Calls for each protein. Left with 11222 Proteins
  • Extracted protein sequence descriptors using Protr R library
  • Extracted protein sequence embeddings from UniProt for each protein
  • Extracted amino acids descriptors from protein sequences using Peptides R library (Converted to numpy arrays)
  • Extracted amino acid embeddings from UniProt (Converted to numpy array)
  • Extracted protein PSSM using Protr R library (Converted to numpy arrays)
  • Extracted contact maps from protein 3D structures using this nanoHUB tool (Converted to numpy arrays)
  • One-hot encoded all those molecular functions
  • Protein Sequence UniProt embeddings each entry given a column
  • Used PCA to reduce Tripeptide Descriptors from 8000 to 4813
  • Removed 20 entries for missing protein sequence descriptors. Left with 11202 Proteins
  • Decided to simplify problem from multi-label to one-class classification. The molecular function we would be trying to predict would be "DNA Binding" as it was the most prevalent
  • Used RFECV to reduce the protein sequence descriptors from 7757 to 144
Clone this wiki locally