Mapping protein FastA to gene information #1228

yaaminiv · 2021-06-02T17:58:21Z

yaaminiv
Jun 2, 2021
Collaborator

Starting a new discussion (follow ups from this one about blastx and this one about Uniprot ID mapping).

GOAL: Have a final annotation table with chr, gene ID, Uniprot Accession, gene name/abbreviation, gene description, and GOterms

WHAT I'VE TRIED

Initially, I thought I could get Uniprot Accession information by using DIAMOND blastx to annotate the genome. This did not work for a variety of reasons (see this discussion). I also tried running DIAMOND blastx without the index or block size information, and got some recommendations (-b12 -c1). I'm running it again with those recommendations now.

I then tried using the Uniprot ID mapping script from @kubu4 (see this discussion). We realized this wasn't working because the Uniprot database does not include the information from the new genome annotation! I will work from the protein annotation FastA for ID mapping and will report back here.

MY NEW PROBLEM

Assuming the ID mapping from the protein FastA works, how do I map the protein IDs to gene IDs? Looking at the annotation README, I don't see any RefSeq files that may help with this. Any suggestions?

Answered by yaaminiv

Jun 6, 2021

Uniprot GUI mapping yielded data for ~1000 entries, only UniParc mapping https://d.pr/i/jXB0Y9

Decided to just use blastx to annotate the genome in order to get GOterms for enrichment

View full answer

kubu4 · 2021-06-02T18:14:36Z

kubu4
Jun 2, 2021
Maintainer

Looks like RefSeq accession can be reliably mapped to NCBI gene IDs:

e.g. NM_001305288 only maps to 105326593

This info is in the GFF. Using GFFutils, you'll need to operate on mRNA instead of gene. Then, you'll have to strip off the beginning notation of the ID results.

e.g.

gtf_extract \
--feature mRNA \
--fields=chrom,start,end,ID \
--gff ${data_dir}/${orig_gff} \
| sed 's/rna-//'

Then, you can proceed with isolating column 4 to generate your list. Then, map using the updated Perl script to use the RefSeq protein accession instead of the NCBI gene ID.

EDITED: Remove pound symbol in front of column number to remove autolink to GitHub Issue.

5 replies

yaaminiv Jun 2, 2021
Collaborator Author

To clarify, I these are the three changes I would make from your original salmon genome script:

use the genome.gff
modify gft_extract to run on mrNA
map with the updated perl script

kubu4 Jun 2, 2021
Maintainer

Yes. You might need to have the Perl script use REFSEQ_NT_ID instead of P_REFSEQ_AC to work on nucleotides instead of proteins.

yaaminiv Jun 2, 2021
Collaborator Author

Okay! I have some new errors.

Here is my gene list: https://github.com/RobertsLab/project-oyster-oa/blob/master/analyses/Haws_08-GOterm-annotation/20210602_cgigas_roslin_gene-list.txt

I ran my perl script with REFSEQ_NT_ID, no output. I got the following error for each line of my gene list:

Semicolon seems to be missing at 20210602_cgigas_roslin_gene-list.txt line 1

Not sure what semicolon I need to have for input!

deleted previous comment because it was on the wrong thread

kubu4 Jun 2, 2021
Maintainer

I ran my perl script with REFSEQ_NT_ID

Have you tried with P_REFSEQ_AC?

Per the semicolon error message, maybe it doesn't like accessions ending with .1?

Add this to end of your gtf_extract command to remove that:

| sed 's/.1$//

yaaminiv Jun 2, 2021
Collaborator Author

Have you tried with P_REFSEQ_AC

Tried and got the same error!

Per the semicolon error message, maybe it doesn't like accessions ending with .1?

Trying without that now.

yaaminiv · 2021-06-02T18:26:02Z

yaaminiv
Jun 2, 2021
Collaborator Author

I will work from the protein annotation FastA for ID mapping and will report back here.

Extracted protein IDs from the protein annotation FastA, got 0 results:

Failed, got 500 Server closed connection without sending any data back for https://www.uniprot.org/uploadlists/

Will try working from the GFF

1 reply

kubu4 Jun 2, 2021
Maintainer

You might need to have the Perl script use REFSEQ_NT_ID instead of P_REFSEQ_AC to work on nucleotides instead of proteins.

sr320 · 2021-06-02T18:27:02Z

sr320
Jun 2, 2021
Maintainer

FWIW,
I would grab RefSeq IDs from Fasta.
Upload to web, download associated info.
It might be the case that your RefSeq ID does not come back down with annotation (eg GO etc)
In that case I would then take Salmon Fasta and blastp (not diamond) to fasta of corresponding mapped UniProtKB seqs.

2 replies

yaaminiv Jun 2, 2021
Collaborator Author

@sr320 To clarify:

grab RefSeq IDs from protein Fasta and use web interface
blastp Roslin genome fasta to something else...?

sr320 Jun 2, 2021
Maintainer

grab RefSeq IDs from protein Fasta and use web interface

yes

blastp Roslin genome fasta to something else...?

that is protein fasta only if

It might be the case that your RefSeq ID does not come back down with annotation (eg GO etc)

Only if this is case.. then you would need to know which RefSeq IDs correspond with annotations, and I would use blastp and the UniProtKB sequences found in mapping as db.

happy to walk through in zoom if helpful

yaaminiv · 2021-06-06T01:25:13Z

yaaminiv
Jun 6, 2021
Collaborator Author

Uniprot GUI mapping yielded data for ~1000 entries, only UniParc mapping https://d.pr/i/jXB0Y9

Decided to just use blastx to annotate the genome in order to get GOterms for enrichment

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping protein FastA to gene information #1228

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Mapping protein FastA to gene information #1228

yaaminiv Jun 2, 2021 Collaborator

GOAL: Have a final annotation table with chr, gene ID, Uniprot Accession, gene name/abbreviation, gene description, and GOterms

WHAT I'VE TRIED

MY NEW PROBLEM

Replies: 4 comments · 8 replies

kubu4 Jun 2, 2021 Maintainer

yaaminiv Jun 2, 2021 Collaborator Author

kubu4 Jun 2, 2021 Maintainer

yaaminiv Jun 2, 2021 Collaborator Author

kubu4 Jun 2, 2021 Maintainer

yaaminiv Jun 2, 2021 Collaborator Author

yaaminiv Jun 2, 2021 Collaborator Author

kubu4 Jun 2, 2021 Maintainer

sr320 Jun 2, 2021 Maintainer

yaaminiv Jun 2, 2021 Collaborator Author

sr320 Jun 2, 2021 Maintainer

yaaminiv Jun 6, 2021 Collaborator Author

yaaminiv
Jun 2, 2021
Collaborator

Replies: 4 comments 8 replies

kubu4
Jun 2, 2021
Maintainer

yaaminiv Jun 2, 2021
Collaborator Author

kubu4 Jun 2, 2021
Maintainer

yaaminiv Jun 2, 2021
Collaborator Author

kubu4 Jun 2, 2021
Maintainer

yaaminiv Jun 2, 2021
Collaborator Author

yaaminiv
Jun 2, 2021
Collaborator Author

kubu4 Jun 2, 2021
Maintainer

sr320
Jun 2, 2021
Maintainer

yaaminiv Jun 2, 2021
Collaborator Author

sr320 Jun 2, 2021
Maintainer

yaaminiv
Jun 6, 2021
Collaborator Author