Defining kmax and problems with gIMble optimize #145

RSchley · 2024-11-08T15:10:38Z

RSchley
Nov 8, 2024

First of all, thanks so much for developing this great method.

I am working on an extremely rapid tropical tree radiation, and am aiming to infer barriers to gene flow between several co-occurring species pairs, using whole-genome resequencing data at 20x. These trees are expected to display rampant incomplete lineage sorting because of their large Nc and the fact that they are outcrossing, but they do have relatively short generation times considering they are trees.

I have gone through the gIMble pipeline, but the best-selected models that gIMble optimize returns differs depending on which parameter ranges I specify. Specifically, my initial prior ranges often result in the parameter estimate hitting the upper or lower prior bounds (often making no biological sense), and when I relax or change these priors, the best selected model changes.

I was wondering if this is because I am using the 2,2,2,2 kmax mutation configuration in gIMble tally - in the gIMble paper, it says "... the choice of --kmax involves a trade-off between the information gained by distinguishing the probabilities of rare bSFS configurations and the increase in computational complexity."

If I increased this to, for example, 4,4,4,4, would I capture more information in my sequence data, meaning that I have more power to distinguish different models and make more reliable parameter estimates?

Or is there anything else you might suggest to help me make the most robust inferences I can?

Thanks a lot!
Rowan

Answered by A-J-F-Mackintosh

Nov 10, 2024

Hi Rowan,

Your question was likely directed to @DRL and @GertjanBisschop , but I will post my answer given that I have been in a similar situation.

It is true that increasing the kmax value would generate a more informative bSFS. However, there should be more than enough information with kmax=2,2,2,2 to fit the IM model (as well as the simpler nested models). Setting higher values is unlikely to lead to very different results (most blocks should be unaffected by the kmax value), but it would increase the computation time substantially. For example, I would expect kmax=4,4,4,4 to be prohibitively slow and introduce more error (blocks with >10 SNPs are likely to be mapping artefacts) than u…

View full answer

A-J-F-Mackintosh · 2024-11-10T20:20:45Z

A-J-F-Mackintosh
Nov 10, 2024
Collaborator

Hi Rowan,

Your question was likely directed to @DRL and @GertjanBisschop , but I will post my answer given that I have been in a similar situation.

It is true that increasing the kmax value would generate a more informative bSFS. However, there should be more than enough information with kmax=2,2,2,2 to fit the IM model (as well as the simpler nested models). Setting higher values is unlikely to lead to very different results (most blocks should be unaffected by the kmax value), but it would increase the computation time substantially. For example, I would expect kmax=4,4,4,4 to be prohibitively slow and introduce more error (blocks with >10 SNPs are likely to be mapping artefacts) than useful information. So, I don't think you need to change this at the moment.

The optimisation behaviour you describe suggests that your bSFS cannot be well described by an IM model with sensible parameter values. There could be a few different reasons for this:

It could be that the true history is not IM-like (ghost introgression? intra-pop structure?) and so the observed bSFS cannot be well approximated by the models available in gIMble.
It could be because your blocks are too big. Large blocks with lots of hidden recombination events would lead to a 'underdispersed' bSFS that is hard to fit a sensible model to.
The divergence time could be too recent. If the split is << Ne generations ago then the bSFS contains little information about the Ne values of the two contemporary populations, and the estimates become biased due to hidden recombination. In particular, intra-block recombination leads to a deficit of monomorphic blocks, and this deficit would translate to massive NA and NB estimates when T is very small.

I encountered the problem of too recent a split time when analysing a pair of subspecies with T ~ 0.1 Ne. There is no real solution other than to sync the three Ne values, or alternatively leave them free but interpret the values with caution.

My suggestion would be to start by reducing the block size (unless it is already set to a small value). There is no standard way to determine the right block size, but when rates of mutation and recombination are similar it is best to set the block size so that the number of segregating sites per-block is something like 1, 1.5 or 2. If you reduce the block size and your parameter estimates change dramatically then this suggests that the problem is (at least partially) due to hidden recombination.

Cheers,

Alex

0 replies

RSchley · 2024-11-11T09:40:29Z

RSchley
Nov 11, 2024
Author

Thank you so much Alex- that makes a lot of sense and your response is very much appreciated. I will start by reducing the block size.

My current block size is 64 - is this deemed relatively small?

0 replies

A-J-F-Mackintosh · 2024-11-11T21:13:15Z

A-J-F-Mackintosh
Nov 11, 2024
Collaborator

It depends on the genetic diversity within (and between) the populations that you are analysing. However, diversity would have to be very high for 64 bases to be particularly problematic.

64 bases was used for the Heliconius butterflies in the gIMble paper (where per-site pi is around 0.016 in both species and dxy is 0.022) and I think it worked well there. How does diversity in your populations compare to those Heliconius species (you can get this information from gIMble info)? If it is similar, then perhaps the block size is not the issue. But if it is much higher or lower then it could be worth decreasing or increasing the block size, respectively.

If you have multiple population pairs then you should consider setting the block length independently for each. In this paper we set the block length so that they contained an average of 1.5 segregating sites, and did this separately for each dataset.

0 replies

KLohse · 2024-11-11T22:28:13Z

KLohse
Nov 11, 2024
Maintainer

Thanks for yor interest in gIMble and your question @RSchley and thanks for the quick and excellent answer @A-J-F-Mackintosh. As @A-J-F-Mackintosh explains, it is best to think of the block length as a relative choice. A useful (but still arbitrary) starting point is l = 2/d_xy, ie. choose your block length such that a pair-block contains on average two differences between haplotypes sampled from different species.

0 replies

RSchley · 2024-11-12T09:01:55Z

RSchley
Nov 12, 2024
Author

Thanks so much to you both for the insight. My four problematic population pairs have a Dxy that ranges from 0.020-0.022, with per-site pi ranging from 0.018-0.020 per species. All other aspects of these datasets seem normal (Four-gamete-violation blocks <1%, missing data <20% in all individuals). In all, it seems quite similar to the Heliconius system in those aspects.

In that case, perhaps it is something to do with the levels of divergence between these species. This would make sense, given that they are likely to have very large Nc (also suggesting large Ne?) and phylogenetic estimates suggest they have diverged in the last ~1Ma. Therefore, as you mention @A-J-F-Mackintosh, it's possible that the split is << Ne generations ago.

Thanks again!
Rowan

0 replies

RSchley · 2024-11-15T11:36:05Z

RSchley
Nov 15, 2024
Author

With the above in mind, is there anything that I can do with gIMble using these species pairs, or is it best to jettison them?
Thanks
Rowan

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining kmax and problems with gIMble optimize #145

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Defining kmax and problems with gIMble optimize #145

RSchley Nov 8, 2024

Replies: 6 comments

A-J-F-Mackintosh Nov 10, 2024 Collaborator

RSchley Nov 11, 2024 Author

A-J-F-Mackintosh Nov 11, 2024 Collaborator

KLohse Nov 11, 2024 Maintainer

RSchley Nov 12, 2024 Author

RSchley Nov 15, 2024 Author

RSchley
Nov 8, 2024

A-J-F-Mackintosh
Nov 10, 2024
Collaborator

RSchley
Nov 11, 2024
Author

A-J-F-Mackintosh
Nov 11, 2024
Collaborator

KLohse
Nov 11, 2024
Maintainer

RSchley
Nov 12, 2024
Author

RSchley
Nov 15, 2024
Author