Defining kmax and problems with gIMble optimize #145
-
First of all, thanks so much for developing this great method. I am working on an extremely rapid tropical tree radiation, and am aiming to infer barriers to gene flow between several co-occurring species pairs, using whole-genome resequencing data at 20x. These trees are expected to display rampant incomplete lineage sorting because of their large Nc and the fact that they are outcrossing, but they do have relatively short generation times considering they are trees. I have gone through the gIMble pipeline, but the best-selected models that gIMble optimize returns differs depending on which parameter ranges I specify. Specifically, my initial prior ranges often result in the parameter estimate hitting the upper or lower prior bounds (often making no biological sense), and when I relax or change these priors, the best selected model changes. I was wondering if this is because I am using the 2,2,2,2 kmax mutation configuration in gIMble tally - in the gIMble paper, it says "... the choice of --kmax involves a trade-off between the information gained by distinguishing the probabilities of rare bSFS configurations and the increase in computational complexity." If I increased this to, for example, 4,4,4,4, would I capture more information in my sequence data, meaning that I have more power to distinguish different models and make more reliable parameter estimates? Or is there anything else you might suggest to help me make the most robust inferences I can? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Hi Rowan, Your question was likely directed to @DRL and @GertjanBisschop , but I will post my answer given that I have been in a similar situation. It is true that increasing the kmax value would generate a more informative bSFS. However, there should be more than enough information with The optimisation behaviour you describe suggests that your bSFS cannot be well described by an IM model with sensible parameter values. There could be a few different reasons for this:
I encountered the problem of too recent a split time when analysing a pair of subspecies with T ~ 0.1 Ne. There is no real solution other than to sync the three Ne values, or alternatively leave them free but interpret the values with caution. My suggestion would be to start by reducing the block size (unless it is already set to a small value). There is no standard way to determine the right block size, but when rates of mutation and recombination are similar it is best to set the block size so that the number of segregating sites per-block is something like 1, 1.5 or 2. If you reduce the block size and your parameter estimates change dramatically then this suggests that the problem is (at least partially) due to hidden recombination. Cheers, Alex |
Beta Was this translation helpful? Give feedback.
-
Thank you so much Alex- that makes a lot of sense and your response is very much appreciated. I will start by reducing the block size. My current block size is 64 - is this deemed relatively small? |
Beta Was this translation helpful? Give feedback.
-
It depends on the genetic diversity within (and between) the populations that you are analysing. However, diversity would have to be very high for 64 bases to be particularly problematic. 64 bases was used for the Heliconius butterflies in the gIMble paper (where per-site pi is around 0.016 in both species and dxy is 0.022) and I think it worked well there. How does diversity in your populations compare to those Heliconius species (you can get this information from If you have multiple population pairs then you should consider setting the block length independently for each. In this paper we set the block length so that they contained an average of 1.5 segregating sites, and did this separately for each dataset. |
Beta Was this translation helpful? Give feedback.
-
Thanks for yor interest in gIMble and your question @RSchley and thanks for the quick and excellent answer @A-J-F-Mackintosh. As @A-J-F-Mackintosh explains, it is best to think of the block length as a relative choice. A useful (but still arbitrary) starting point is l = 2/d_xy, ie. choose your block length such that a pair-block contains on average two differences between haplotypes sampled from different species. |
Beta Was this translation helpful? Give feedback.
-
Thanks so much to you both for the insight. My four problematic population pairs have a Dxy that ranges from 0.020-0.022, with per-site pi ranging from 0.018-0.020 per species. All other aspects of these datasets seem normal (Four-gamete-violation blocks <1%, missing data <20% in all individuals). In all, it seems quite similar to the Heliconius system in those aspects. In that case, perhaps it is something to do with the levels of divergence between these species. This would make sense, given that they are likely to have very large Nc (also suggesting large Ne?) and phylogenetic estimates suggest they have diverged in the last ~1Ma. Therefore, as you mention @A-J-F-Mackintosh, it's possible that the split is << Ne generations ago. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
With the above in mind, is there anything that I can do with gIMble using these species pairs, or is it best to jettison them? |
Beta Was this translation helpful? Give feedback.
Hi Rowan,
Your question was likely directed to @DRL and @GertjanBisschop , but I will post my answer given that I have been in a similar situation.
It is true that increasing the kmax value would generate a more informative bSFS. However, there should be more than enough information with
kmax=2,2,2,2
to fit the IM model (as well as the simpler nested models). Setting higher values is unlikely to lead to very different results (most blocks should be unaffected by the kmax value), but it would increase the computation time substantially. For example, I would expectkmax=4,4,4,4
to be prohibitively slow and introduce more error (blocks with >10 SNPs are likely to be mapping artefacts) than u…