Normalisation for Ancestral Variant Lineages #470

d-cameron · 2023-05-01T16:15:25Z

d-cameron
May 1, 2023
Collaborator

Reading through the specs, it appears that normalisation does not work as intended.

The example in https://vrs.ga4gh.org/en/1.1/impl-guide/normalization.html#id3 with S:g.5_6delinsCAGCA has start/end at 4,6. The example then goes on to left_roll and right_roll outside the bounds defined by the SequenceLocation. This is problematic as the normalisation procedure on that place explicitly defines the reference as the subsequenced defined by that SequenceLocation.

Another issue is that NCBI’s Variant Overprecision Correction Algorithm only works for isolated variants. If there multiple variants then the normalisation procedure will change the haplotype sequence.

Take the following example:

POS: 12345678901
REF: TAAAAAAAGC
ALT: TAAAGAAAAGC

This is commonly represented as {pos 5, A->G},{pos 6, A -> AA}. If the latter gets normalised to {pos 2: AAAAAAA ->AAAAAAA} then the information about the G occurring before the inserted A is lost. Traditional left-alignment has this same problem and the normalisation process with change the haplotype if the normalisation move a variant past the position of another variant.

Answered by ahwagner

May 3, 2023

Thanks for this comment, @d-cameron. I have to say I really appreciate you digging into these details and checking the assumptions of the VRS model. In short, it is my view that normalization does work as intended, but I understand the points your are making and will give my perspective on how VRS was designed to handle the concerns you are raising. Also tagging @reece @larrybabb and @andreasprlic who have been designing VRS since the early days of VMC and may have additional insight.

In your first point, you mention:

[Expanding sequence bounds beyond the input SequenceLocation] is problematic as the normalisation procedure on that place explicitly defines the reference as the subsequenc…

View full answer

ahwagner · 2023-05-03T03:24:52Z

ahwagner
May 3, 2023
Maintainer

Thanks for this comment, @d-cameron. I have to say I really appreciate you digging into these details and checking the assumptions of the VRS model. In short, it is my view that normalization does work as intended, but I understand the points your are making and will give my perspective on how VRS was designed to handle the concerns you are raising. Also tagging @reece @larrybabb and @andreasprlic who have been designing VRS since the early days of VMC and may have additional insight.

In your first point, you mention:

[Expanding sequence bounds beyond the input SequenceLocation] is problematic as the normalisation procedure on that place explicitly defines the reference as the subsequenced defined by that SequenceLocation.

I'm not sure if your concern is semantic or algorithmic. If it is just about the definition of a Sequence Location:

A SequenceLocation is a Location defined by an interval on a referenced Sequence.

I would say that since Location exists on a sequence, it is valid to use the entire sequence to check for normalized representation. I do not think this is your actual concern, since the notion of changes across multiple locations on shared reference sequences are common among all variant calling paradigms.

Assuming then that your concern is algorithmic, due to the changing coordinates post normalization:

I think this is also the same type of assumption as made by any normalization process for variants that are not known to be pre-normalized. In other words, a check is made to see if a variant for which multiple representations could exist is adjusted to the preferred representation per the conventions of the specification. In HGVS, your coordinates may shift right (differing from input location), in VCF, they may shift left (differing from input location). In SPDI and VRS they shift inwards or outwards (differing from input location). In all cases they create equivalent representations. This is intentional design, and step 5 of the algorithm explicitly updates the position and sequence of the alleles to account for the fact that they change following the normalization procedure.

If I misinterpreted your concern on this first point, please let me know!

I think your second concern speaks to something that was addressed in the design of VRS, which I will illustrate using your example. But first, a small tweak to your example:

POS: 12345678901
REF: TAAAAAAAGC
ALT: TAAAGAAAAGC

I am going to revise this slightly to match your example description. In your example there are 7 A residues in both ref and alt, but from your description I think you meant for there to be 8 total A residues in the alt sequence with the inserted A to be 3-prime of the G, which would look like this:

POS: 123456789012
REF: TAAAAAAAGC
ALT: TAAAGAAAAAGC

In this case, if you have enough information to call TAAAGAAAAAGC as a haplotype, then the resultant VRS representation (in interresidue coordinates) would be an Allele with the following attributes:

Location:    # simplified to start and end for brevity
  Start: 4
  End:   4
State: 'GA'  # simplified to string for brevity

What I think your question (quite rightly) gets at is why is this not a Haplotype of two Alleles? And you are correct that if composed as two Alleles in a Haplotype, we would lose information about where that G should go.

However, this is not a problem in VRS because VRS explicitly does not allow overlapping Alleles within a Haplotype:

Alleles within a Haplotype MUST not overlap

If this were the case, the variant must first be represented as an MNV Allele, and then normalized.

The question about whether or not Alleles that are non-adjacent and non-overlapping should be represented as SNVs in a Haplotype or MNVs with intervening ref-match sequence is a different class of problem that is a policy choice, and one that persists across representation formats. I think @andreasprlic is working on writing up some draft guidance on this point, but to date we leave it to the implementer to decide how to handle that.

@d-cameron please let me know if I misinterpreted the problem you were raising or the above response does not address your concerns. I truly value your feedback and if your fresh eyes on VRS are catching something we missed in VRS 1.x, I want to make sure we fix it in 2.0!

0 replies

andreasprlic · 2023-05-03T05:01:07Z

andreasprlic
May 3, 2023

Thanks for the great write up @ahwagner! I agree that the example would best get represented as an insGA variant. Just to refer also to the next sentence after "Alleles within a Haplotype must not overlap" from the VRS haplotypes guidance:

Alleles that create a net insertion or deletion of sequence MUST NOT change the location of “downstream” Alleles.

0 replies

d-cameron · 2023-05-04T06:25:36Z

d-cameron
May 4, 2023
Collaborator Author

My first concern was about the reference allele sequence only being defined for input start/end. Rereading the normalisation procedure, I think the current definition is ok and I was misinterpreting the algorithm as described.

The second concern isn't entirely addressed either. Even though VRS disallows overlapping Alleles within a Haplotype, there are many instances where the variants are unphased. VRS really wants variants to be normalised (everything short of saying Alleles MUST be normalised). Normalising the unphased variants as (not absolutely) required by VRS makes the interpretation of these variants ambiguous.

VRS also doesn't specify what should happen when this normalisation would create ambiguity. There's no approach that is 'correct' in all circumstances but VRS could do better than just say "normalized to a fully justified form unless there is a compelling reason to do otherwise" and we should explicitly define what should be done in this particular compelling circumstance. Essentially, we need an extension to SPDI that describes how to handle any combination of phased and unphased variants for which normalisation of any subset of these variants* would cause an overlap in the SPDI normalisation. Is there a VRS equivalent of the VCF reference <*> allele? That would also impact the normalisation bounds.

A third related issue is that the presence of the A-to-G breaks the computed identifier for the ins-A. If the ins-A is an ancestral variant then one would naively assume that a descendent sample with a second mutation isn't going to cause the computed identifier for the ancestral variant to change. This is not the case. I was under the impression that the goal of normalisation is for the same variant to always be represented with the same computed identifier.

We need handle the any combination for scenario such as :

REF: ACACACAC
VCF variants:
1 A AC (no ambiguity by itself)
2 C AC (no ambiguity by itself)
->
ALT: ACACACACAC (haplotype is expansion by one AC repeat unit)

1 reply

ahwagner Mar 5, 2024
Maintainer

Linking the reply to this follow-up, below.

larrybabb · 2024-01-10T16:02:27Z

larrybabb
Jan 10, 2024
Maintainer

@ahwagner I'm assigning this to you because I'm still unclear on issue 3 that @d-cameron is raising above. I do think we should 'find a way' to address how intentionally ambigous non-normalized based variation and location digests are handled (e.g. Maybe we provide some type of digest that is clearly not a valid digest? - probably a dumb idea).

1 reply

ahwagner Mar 5, 2024
Maintainer

Thanks for re-raising this to my attention @larrybabb. It looks like @d-cameron is clear on the first issue from my earlier write-up.

The second issue, regarding whether Alleles SHOULD or MUST normalize is one that has been discussed in the past. I think our compromise solution was that Alleles MUST be normalized to use VRS computed identifiers; but lacking any metadata that states "this VRS object follows the VRS conventions" every implementation is left to re-normalizing ingested Alleles to ensure these conventions are met. I agree with Daniel that the better world is the one in which we tighten this constraint to MUST; this is no different than similar constraints supplied by SPDI or HGVS, and it wouldn't stop resources from doing whatever VRS-like things they wanted to do anyways; ClinVar-style and MaveDB-style hgvs are good examples of this, as are gnomAD-style VCF strings and CIViC representative alleles.

The final issue, regarding ancestral Alleles, I do understand. Daniel is saying that an ancestor has a variant, passes it on to a descendent, who acquires a recombined or de novo variant directly adjacent to the ancestral variant. If that is then expressed as an Allele, positional information about the ancestral variant is lost in the new, "combined" Allele. It is a situation where you would likely want to represent both Alleles in the descendent as distinct, adjacent Alleles in the Haplotype. This gets complicated if you wish to do that for an unambiguous insertion in the middle of an ancestral, ambiguous Allele, as they are then overlapping. You can make the ambiguous "side" of the ancestral Allele a new, smaller ambiguity in a Haplotype model, but that would be a poor policy as it doesn't allow you to convey 1:1 parallelism to the ancestral state.

Fortunately, VRS concerns itself only with the current state of a sequence–it does not convey information about current state in relation to ancestral state, only in relation to a reference sequence. IMO, the way to handle this specific scenario is to do what VRS does best: use the standard VRS components to create a more complex document that handles this type of ancestral variant lineage. I think @d-cameron has already recognized the need to approach it in this way.

One thing that may be a beneficial add to VRS 2.x is a nested variation block; a way of specifying variation that occurs within an existing variant. This will be helpful for graph-based variant representation as well.

I am going to convert this to a Discussion, since there are not any clear outcomes defined here yet, but this does highlight an important area of consideration for future development.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalisation for Ancestral Variant Lineages #470

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Normalisation for Ancestral Variant Lineages #470

d-cameron May 1, 2023 Collaborator

Replies: 4 comments · 2 replies

ahwagner May 3, 2023 Maintainer

andreasprlic May 3, 2023

d-cameron May 4, 2023 Collaborator Author

ahwagner Mar 5, 2024 Maintainer

larrybabb Jan 10, 2024 Maintainer

ahwagner Mar 5, 2024 Maintainer

d-cameron
May 1, 2023
Collaborator

Replies: 4 comments 2 replies

ahwagner
May 3, 2023
Maintainer

andreasprlic
May 3, 2023

d-cameron
May 4, 2023
Collaborator Author

ahwagner Mar 5, 2024
Maintainer

larrybabb
Jan 10, 2024
Maintainer

ahwagner Mar 5, 2024
Maintainer