Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RfC: Whether to allow gap characters in consensus submissions or not, and if so, how to handle them (strip or keep) #3560

Open
corneliusroemer opened this issue Jan 21, 2025 · 3 comments

Comments

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jan 21, 2025

As part of testing the new gapped sequence error message I checked what our docs said on Pathoplexus about support of gaps (or prohibition):

  • We don't actually say what characters are supported in submission fastas afaict: we should add this to docs: both Loculus and Pathoplexus
  • We currently error on - in fasta submissions, but we might not want to do this actually. - is allowed in FASTA, so we definitely need to make it very explicit that we except FASTA with a restricted alphabet. I can see a case for allowing submission with -, we should check what INSDC does and align with it as closely as possible.
  • If we do allow gaps, we need to decide how to handle them in alignment outputs, but alignments are generally less important to end users (as they are processed and not ground truth)

Prior discussion around gaps:

This has been factored out into a separate issue
  • Same goes for non-ACTGN characters, the Loculus docs state we don't support ambiguous characters, but I'm pretty sure INSDC does, so we should also do that. Either we already support ambiguous and the docs are wrong, or we should change the code to allow them and also change the docs.
@theosanderson
Copy link
Member

On gaps specifically:

We have discussed gaps at some length e.g. https://loculus.slack.com/archives/C06JCAZLG14/p1727307173560199 and I think probably in meetings around then. FASTA supports gaps because FASTA supports sequence alignments. But we don't, as far as I know, want people submitting sequence alignments to Pathoplexus, and that would be unexpected behaviour. FASTA supports protein sequences too, but they shouldn't be submitted. I don't think we need to document that you shouldn't submit amino acid sequences, and I think you can make a similar (but somewhat less strong) case about gaps. IMO there isn't an issue with the current behaviour/docs but if submitters flag it as confusing them, we should of course look at it again.

@theosanderson

This comment has been minimized.

@corneliusroemer

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants