-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IUPAC sites were not removed #6
Comments
Hi Cui, I will look into this next week. I apologize for the delay, but I am swamped this week. Just wanted to give you a heads-up that it might be a bit. If I happen to forget to follow up on this, feel free to send me a reminder message. -Bradley |
Thanks a lot Bradley! That's okay, please take your time! thanks, |
I just wanted to add here that I also experienced the same thing; the invariable sites (E.g. all samples having A nucleotides and one R, meaning A/G), are not removed, and raxml-ng is complaining about a large number of remaining invariable sites in the MSA. |
I added a section to the def filter_invariants(dframe):
initial_dframe_shape=dframe.shape
bases = ["A","G","C","T","R","Y","S","W","K","M","B","D","H","V"]
IUPAC_bases = ["R","Y","S","W","K","M","B","D","H","V","N"]
IUPAC_dict = {'B': 'CGT', 'D': 'AGT', 'H': 'ACT', 'K': 'GT', 'M': 'AC', 'N': 'ACGT', 'S': 'CG', 'R': 'AG', 'W': 'AT', 'V': 'ACG', 'Y': 'CT', 'X': 'ACGT'}
# collections::Counter library
stamatakis_cnt = Counter()
fels_cnt = 0
invariant_lst = list()
# Loop through each dataframe column
for i in dframe.columns:
# Gets unique values at each column and saves to list
column_unique = [x.upper() for x in dframe[i].unique().tolist()]
# Intersects column_unique with bases list
intersect = [value for value in bases if value in column_unique]
# If column contains only ambigous or IUPAC characters
# Save the column index for dropping later
if not any(value for value in bases if value in column_unique):
invariant_lst.append(i)
# If site is invariant (only A, C, G, or T); ignores N's and "-"
if len(intersect) == 1:
# Uses collections::Counter to get Stamatakis counts
stamatakis_cnt[intersect[0]] += 1
# Counts number of invariant sites for Felsenstein count
fels_cnt += 1
# Saves column indexes to list
invariant_lst.append(i)
# Assesses whether the column is considered invariate because a pattern like AAAAAARAAAAA (R being A or G)
# Save the column index for dropping later
IUPAC_bases_this_column = [x for x in intersect if x in IUPAC_bases]
column_unique_without_IUPAC = [x for x in intersect if x not in IUPAC_bases_this_column]
if len(IUPAC_bases_this_column) == 1 and \
len(column_unique_without_IUPAC) == 1 and \
column_unique_without_IUPAC[0] in list(IUPAC_dict[IUPAC_bases_this_column[0]]):
## if 1 IUPAC base AND 1 non-IUPAC base AND the non-IUPAC base is part of the IUPAC base set stored in IUPAC_dict
print(column_unique_without_IUPAC[0])
invariant_lst.append(i)
if len(IUPAC_bases_this_column) > 1:
## if there are more than one IUPAC bases in this column
invariant_lst.append(i)
# Drops invariant sites from dataframe
invariant_set=set(invariant_lst)
dframe.drop(invariant_set, axis=1, inplace=True)
print("## dimensions of the initial alignment (samples, nr of positions):",initial_dframe_shape)
print("##",len(invariant_set), "invariable sites removed")
print("## dimensions of the remaining alignment (samples, nr of positions):", dframe.shape)
return stamatakis_cnt, fels_cnt, dframe |
Awesome thanks! Sorry for the trouble. This was one of the first python scripts I ever wrote and I haven't touched it in a long time. @pvdam3 do you think you could do a pull request and I'll merge it with my original code so that this issue could be fixed? |
Hi,
I was using the ascbias.py script to remove the invariant sites. It seems to not work for me. I first converted my vcf file to phylip format using the script https://raw.githubusercontent.com/edgardomortiz/vcf2phylip/master/vcf2phylip.py, and then using this script to remove invariant sites for raxml phylogeny inference. But the phylip file after processing still contains IUPAC sites, and raxml was conplaining there are invariant sites in the input file that I shouldn't user the model GTR+ASC_LEWIS.
Do you know what could be the reason?
thanks,
Cui
The text was updated successfully, but these errors were encountered: