-
Reading through the specs, it appears that normalisation does not work as intended. The example in https://vrs.ga4gh.org/en/1.1/impl-guide/normalization.html#id3 with Another issue is that NCBI’s Variant Overprecision Correction Algorithm only works for isolated variants. If there multiple variants then the normalisation procedure will change the haplotype sequence. Take the following example:
This is commonly represented as {pos 5, A->G},{pos 6, A -> AA}. If the latter gets normalised to {pos 2: AAAAAAA ->AAAAAAA} then the information about the G occurring before the inserted A is lost. Traditional left-alignment has this same problem and the normalisation process with change the haplotype if the normalisation move a variant past the position of another variant. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Thanks for this comment, @d-cameron. I have to say I really appreciate you digging into these details and checking the assumptions of the VRS model. In short, it is my view that normalization does work as intended, but I understand the points your are making and will give my perspective on how VRS was designed to handle the concerns you are raising. Also tagging @reece @larrybabb and @andreasprlic who have been designing VRS since the early days of VMC and may have additional insight. In your first point, you mention:
I'm not sure if your concern is semantic or algorithmic. If it is just about the definition of a Sequence Location: A SequenceLocation is a Location defined by an interval on a referenced Sequence. I would say that since Location exists on a sequence, it is valid to use the entire sequence to check for normalized representation. I do not think this is your actual concern, since the notion of changes across multiple locations on shared reference sequences are common among all variant calling paradigms. Assuming then that your concern is algorithmic, due to the changing coordinates post normalization: I think this is also the same type of assumption as made by any normalization process for variants that are not known to be pre-normalized. In other words, a check is made to see if a variant for which multiple representations could exist is adjusted to the preferred representation per the conventions of the specification. In HGVS, your coordinates may shift right (differing from input location), in VCF, they may shift left (differing from input location). In SPDI and VRS they shift inwards or outwards (differing from input location). In all cases they create equivalent representations. This is intentional design, and step 5 of the algorithm explicitly updates the position and sequence of the alleles to account for the fact that they change following the normalization procedure. If I misinterpreted your concern on this first point, please let me know! I think your second concern speaks to something that was addressed in the design of VRS, which I will illustrate using your example. But first, a small tweak to your example:
I am going to revise this slightly to match your example description. In your example there are 7
In this case, if you have enough information to call
What I think your question (quite rightly) gets at is why is this not a Haplotype of two Alleles? And you are correct that if composed as two Alleles in a Haplotype, we would lose information about where that However, this is not a problem in VRS because VRS explicitly does not allow overlapping Alleles within a Haplotype:
If this were the case, the variant must first be represented as an MNV Allele, and then normalized. The question about whether or not Alleles that are non-adjacent and non-overlapping should be represented as SNVs in a Haplotype or MNVs with intervening ref-match sequence is a different class of problem that is a policy choice, and one that persists across representation formats. I think @andreasprlic is working on writing up some draft guidance on this point, but to date we leave it to the implementer to decide how to handle that. @d-cameron please let me know if I misinterpreted the problem you were raising or the above response does not address your concerns. I truly value your feedback and if your fresh eyes on VRS are catching something we missed in VRS 1.x, I want to make sure we fix it in 2.0! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the great write up @ahwagner! I agree that the example would best get represented as an
|
Beta Was this translation helpful? Give feedback.
-
My first concern was about the reference allele sequence only being defined for input start/end. Rereading the normalisation procedure, I think the current definition is ok and I was misinterpreting the algorithm as described. The second concern isn't entirely addressed either. Even though VRS disallows overlapping Alleles within a Haplotype, there are many instances where the variants are unphased. VRS really wants variants to be normalised (everything short of saying Alleles MUST be normalised). Normalising the unphased variants as (not absolutely) required by VRS makes the interpretation of these variants ambiguous. VRS also doesn't specify what should happen when this normalisation would create ambiguity. There's no approach that is 'correct' in all circumstances but VRS could do better than just say "normalized to a fully justified form unless there is a compelling reason to do otherwise" and we should explicitly define what should be done in this particular compelling circumstance. Essentially, we need an extension to SPDI that describes how to handle any combination of phased and unphased variants for which normalisation of any subset of these variants* would cause an overlap in the SPDI normalisation. Is there a VRS equivalent of the VCF reference A third related issue is that the presence of the A-to-G breaks the computed identifier for the ins-A. If the ins-A is an ancestral variant then one would naively assume that a descendent sample with a second mutation isn't going to cause the computed identifier for the ancestral variant to change. This is not the case. I was under the impression that the goal of normalisation is for the same variant to always be represented with the same computed identifier.
|
Beta Was this translation helpful? Give feedback.
-
@ahwagner I'm assigning this to you because I'm still unclear on issue 3 that @d-cameron is raising above. I do think we should 'find a way' to address how intentionally ambigous non-normalized based variation and location digests are handled (e.g. Maybe we provide some type of digest that is clearly not a valid digest? - probably a dumb idea). |
Beta Was this translation helpful? Give feedback.
Thanks for this comment, @d-cameron. I have to say I really appreciate you digging into these details and checking the assumptions of the VRS model. In short, it is my view that normalization does work as intended, but I understand the points your are making and will give my perspective on how VRS was designed to handle the concerns you are raising. Also tagging @reece @larrybabb and @andreasprlic who have been designing VRS since the early days of VMC and may have additional insight.
In your first point, you mention: