-
Starting a new discussion (follow ups from this one about GOAL: Have a final annotation table with chr, gene ID, Uniprot Accession, gene name/abbreviation, gene description, and GOtermsWHAT I'VE TRIEDInitially, I thought I could get Uniprot Accession information by using I then tried using the Uniprot ID mapping script from @kubu4 (see this discussion). We realized this wasn't working because the Uniprot database does not include the information from the new genome annotation! I will work from the protein annotation FastA for ID mapping and will report back here. MY NEW PROBLEMAssuming the ID mapping from the protein FastA works, how do I map the protein IDs to gene IDs? Looking at the annotation README, I don't see any RefSeq files that may help with this. Any suggestions? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 8 replies
-
Looks like RefSeq accession can be reliably mapped to NCBI gene IDs: e.g. This info is in the GFF. Using GFFutils, you'll need to operate on e.g. gtf_extract \
--feature mRNA \
--fields=chrom,start,end,ID \
--gff ${data_dir}/${orig_gff} \
| sed 's/rna-//' Then, you can proceed with isolating column 4 to generate your list. Then, map using the updated Perl script to use the RefSeq protein accession instead of the NCBI gene ID. EDITED: Remove pound symbol in front of column number to remove autolink to GitHub Issue. |
Beta Was this translation helpful? Give feedback.
-
Extracted protein IDs from the protein annotation FastA, got 0 results:
Will try working from the GFF |
Beta Was this translation helpful? Give feedback.
-
FWIW, |
Beta Was this translation helpful? Give feedback.
-
Uniprot GUI mapping yielded data for ~1000 entries, only UniParc mapping https://d.pr/i/jXB0Y9 Decided to just use |
Beta Was this translation helpful? Give feedback.
Uniprot GUI mapping yielded data for ~1000 entries, only UniParc mapping https://d.pr/i/jXB0Y9
Decided to just use
blastx
to annotate the genome in order to get GOterms for enrichment