Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need for a file with MOD mappings #257

Open
nataled opened this issue Sep 10, 2021 · 10 comments
Open

Need for a file with MOD mappings #257

nataled opened this issue Sep 10, 2021 · 10 comments
Assignees

Comments

@nataled
Copy link
Collaborator

nataled commented Sep 10, 2021

From PIR-PRO discussion: It is possible that the mapping between types of modifications and the color used to display them is incomplete. @Julie-Cowart will find the source of the mapping in the code. @nataled will attempt to create a mapping file, akin to how GO terms are mapped to slims.

@Julie-Cowart
Copy link
Collaborator

So there are 3 parts to how this works. One is a mod.txt file that maps mod ids to a modification (p for phosphorylation, ac for acetylation, etc). The second is mapping these to css styles (colors). This currently supports

modTypeList = ['p', 'ac', 'g', 'm', 'ub']

as well as sequence variants ('v') and unmodified ('un'). The rest will show as Other ('o'). Note there is a bug in the logic for how unmodified is applied that should be fixed because it seems to show as other despite the style for un being defined as white with grey border.

The 3rd part is the legend that uses the following list (and uses the corresponding style to show the color block)

var msaModTypeDict = {"mod-p":"Phosphorylation","mod-ac":"Acetylation","mod-g":"Glycosylation","mod-m":"Methylation","mod-ub":"Ubiquitination","mod-v":"Sequence Variant","mod-o":"Other"};

(@nataled) can review mod.txt for completeness and we can consider adding more modification types as different colors (to the code, style sheet, and legend).

@nataled
Copy link
Collaborator Author

nataled commented Sep 17, 2021

Here is my progress so far. First, the slims I've selected:

MOD:90001    = lipoacylated
MOD:90002    = peptide chain
MOD:00033    = cross-linked
MOD:00427    = methylated
MOD:00649    = acylated 
MOD:00674    = amidated
MOD:00675    = oxidized
MOD:00677    = hydroxylated
MOD:00696    = phosphorylated
MOD:00701    = nucleic
MOD:00703    = isoprenylated
MOD:00764    = glycoconjugated
MOD:01152    = carboxylated
MOD:02078    = acetylated

Note the made-up slims (MOD:9000x). MOD:90001 (lipoacylated) was made up to handle the large number of cases where the modification is considered both lipoconjugated and acylated, while MOD:90002 (peptide chain) combines things like ubiquitinated, sumoylated, etc.

Next, I checked all the MOD terms mentioned in either PRO or Reactome (since PRO will eventually contain all of Reactome) to see the frequency of mention. Only the lines marked 'SLIM' are mapped to a slim; the others would appear on the alignment--at least if we adopt the slims above--as 'other':

13104 SLIM  MOD:00764 glycoconjugated residue
7867 SLIM  MOD:00696 phosphorylated residue
6690 SLIM  MOD:00677 hydroxylated residue
1368 SLIM  MOD:90002 peptide-linked
981 SLIM  MOD:02078 acetylated residue
449 SLIM  MOD:00427 methylated residue
448 SLIM  MOD:01152 carboxylated residue
312 SLIM  MOD:90001 lipoacylated
207 SLIM  MOD:00033 crosslinked residues
196 SLIM  MOD:00703 isoprenylated residue
136 SLIM  MOD:00674 amidated residue
89 SLIM  MOD:00649 acylated residue
88 SLIM  MOD:00701 nucleotide or nucleic acid modified residue
85  MOD:00128 N6-pyridoxal phosphate-L-lysine
72  MOD:00130 L-allysine
45 SLIM  MOD:00675 oxidized residue
32  MOD:00219 L-citrulline
20  MOD:00207 L-isoglutamyl-polyglutamic acid
19  MOD:00206 L-isoglutamyl-polyglycine
14  MOD:00314 glycine cholesterol ester
9  MOD:00159 O-phosphopantetheine-L-serine
8  MOD:01116 S-farnesyl-L-cysteine methyl ester
5  MOD:01119 S-geranylgeranyl-L-cysteine methyl ester
4  MOD:00317 N6-3,4-didehydroretinylidene-L-lysine
4  MOD:00685 deamidated L-glutamine
4  MOD:00912 modified L-lysine residue
4  MOD:01999 N6-(11-cis)-retinylidene-L-lysine
3  MOD:00031 L-selenocysteine residue
3  MOD:00274 L-cysteine persulfide
3  MOD:00909 modified L-histidine residue
2  MOD:00049 2'-[3-carboxamido-3-(trimethylammonio)propyl]-L-histidine
2  MOD:00125 hypusine
2  MOD:00181 O4'-sulfo-L-tyrosine
2  MOD:00237 L-beta-methylthioaspartic acid
2  MOD:01048 2-pyrrolidone-5-carboxylic acid
2  MOD:01699 protonated residue                                  <-----------------------------
2  MOD:01777 S-(glycyl)-L-cysteine (Gly)
2  MOD:01786 3'-nitro-L-tyrosine
2  MOD:01880 L-deoxyhypusine
1  MOD:00129 N6-retinylidene-L-lysine
1  MOD:00908 modified glycine residue
1  MOD:00913 modified L-methionine residue
1  MOD:01623 1-thioglycine (C-terminal)
1  MOD:01625 1-thioglycine
1  MOD:01684 palmitoylated-L-cysteine

The above numbers indicate to me that the selected slims are well-suited to the need, since all the non-slims (with the exception of 'protonated residue'; see arrow) are for specific amino acids and not of the general "something-ated" residue type.

@karenross @Julie-Cowart do you agree with the selections? No need to worry about colors for now, beyond noting that we'll need 19 of them (the above 14 plus other, sequence variant, conserved site, conserved substitution, and one that's used to highlight the mouse-over'd position).

@Julie-Cowart
Copy link
Collaborator

I'm not sure what you mean by your selections. Are you proposing that the 15 MOD ids you listed as slims each have a seperate color? I think that is too many. Past 8 or so you run out of easy to distinguish colors. Based on counts it looks like we should add hydroxylated. Possibly carboxylated and lipoacylated too.

Not sure about peptide linked. Is that instead of ubiquitinated? Is that a more useful categorization to users?

@nataled
Copy link
Collaborator Author

nataled commented Sep 17, 2021

Potentially, yes, if we adopt all the indicated selections they would need separate colors. The three you suggested are already on the list.

I struggled with the peptide-linked one because it would indeed replace ubiquitinated. But I could not justify having ubiquitinated having its own color and not sumoylated, neddylated, and similar. Remember, these are just for color scheme. We'd have documentation indicating what is included in each color for those merged cases.

@nataled
Copy link
Collaborator Author

nataled commented Sep 17, 2021

Again, we need not worry about colors for now. There are ways of dealing with the need for too many colors. For example, selecting colors on the fly based on actual need for the entry (I'm sure we won't have any entries that require all colors simultaneously), or using a resource that lists optimal colors (for example https://sashamaps.net/docs/resources/20-colors/).

@Julie-Cowart
Copy link
Collaborator

I wouldn't do that. The colors should be consistent across views. E.g. Phosphorylation is always pink. But yes we can define more categories now and then decide to combine to the same color later. Or leave as other.

So I'm starting from how it is implemented now and trying to figure out to get it to do what you are suggesting. For most of those categories it's just a matter of making sure the mod.txt is more complete with all the mod ids for all the leaves under those terms (like we missed some of phosphorylation ids that are the children of MOD:00696). But as I documented above first the mod id is converted to a prefix and then that is used for the colors. So I'm pretty sure ubiquitination should stay ub but if we did similar for all the other peptide chain versions and just happened to give then the same color in the style sheet then that would work. Then we change the label in the legend to say 'Peptide Linked' but the mouseover says ub for ubiquitinated. I'd need to think more about lipoacylated. I think multiple modifications on the same site just gets treated as other at the moment but if we want this to be a special case it might be possible. Definitely will need special handling in code because its coloring based on multiple modifications not just one like all the rest.

By the way can you find an example of ubiquitinated (or other peptide linked for that matter) where the site is specified. I find many ubiquitinated forms but they don't have a specified site so don't show in the MSA. Same for finding a lipoacylated example.

@nataled
Copy link
Collaborator Author

nataled commented Sep 17, 2021

I intend to create a file that looks like mod.txt once we have the slims and prefixes decided. The mapping information (from MOD to MODslim is already done. Making mod.txt complete is why I'm doing the slims. It will list all the MOD identifiers (or the numeric part as is now done) that map to a slim (and therefore a color 'code'), though I'll leave out those that map to 'other' (I presume that what happens currently is that any unlisted identifier automatically becomes 'other').

Assigning the ub code to all the peptide/protein-linked cases is what I intended. The lipoacylated would be treated as a single modification because I already did the combining behind the scenes (in my case I created fake MOD IDs, but that's just as easily converted to a color code identifier).

As far as finding the cases you mentioned, bear in mind that we might not have any in PRO yet, but they will come. That's why I included Reactome as a source when making the counts; the cases are there and will be imported into PRO in the next year or so.

@karenross
Copy link
Collaborator

karenross commented Sep 17, 2021 via email

@nataled
Copy link
Collaborator Author

nataled commented Sep 17, 2021

That's covered here: #256

In short, there's a bug that's preventing these from rendering as they should.

@nataled
Copy link
Collaborator Author

nataled commented Sep 24, 2021

Here is a new ranked list that accounts only for those in Reactome plus the non-Reactome subset of PRO that is also position-specific (estimated as those that are organism-modification with PRO-proteoform-std). I removed the non-slim hits that are not generalized.

6232 SLIM  MOD:00677 hydroxylated residue
4533 SLIM  MOD:00696 phosphorylated residue
3956 SLIM  MOD:00764 glycoconjugated residue
1109 SLIM  MOD:90002 peptide-linked
437 SLIM  MOD:01152 carboxylated residue
270 SLIM  MOD:02078 acetylated residue
268 SLIM  MOD:90001 lipoacylated
226 SLIM  MOD:00427 methylated residue
205 SLIM  MOD:00033 crosslinked residues
189 SLIM  MOD:00703 isoprenylated residue
135 SLIM  MOD:00674 amidated residue
48 SLIM  MOD:00701 nucleotide or nucleic acid modified residue
43 SLIM  MOD:00675 oxidized residue
20 SLIM  MOD:00649 acylated residue
2  MOD:01699 protonated residue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants