You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks a lot for building such an excellent eco system of tools, your hard work is very appreciated by everyone in the community :)
I've been checking out this pre-print, and wanted to build a CHM13 af index. This has sent me down a rabbit hole and hope you could advise or give an opinion if I'm going in the complete wrong direction.
Using CHM13_CellRanger.gtf.gz and chm13v2.0_maskedY.fa as input to simpleaf index resulted in an error suggesting that some exons for a given transcript were on different chromosomes or strands. So a little python to pick those out
# check_strands.py
import pandas as pd
# Define the column names for a GTF file
col_names = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attributes"]
# Read the GTF file
df = pd.read_csv('CHM13_CellRanger.gtf', sep='\t', comment='#', names=col_names)
# Extract the transcript_id from the attributes column
df['transcript_id'] = df['attributes'].str.extract('transcript_id "([^"]+)"')
# Group by transcript_id and check if there is more than one unique strand per group
grouped = df.groupby('transcript_id')['strand'].nunique()
multi_strand_transcripts = grouped[grouped > 1]
print(multi_strand_transcripts)
This resulted in a new error on the index build, around the lower case letters in the genomic fasta file (ie the softmasking). For lack of better options, I uppercased.
Hi all,
Thanks a lot for building such an excellent eco system of tools, your hard work is very appreciated by everyone in the community :)
I've been checking out this pre-print, and wanted to build a CHM13 af index. This has sent me down a rabbit hole and hope you could advise or give an opinion if I'm going in the complete wrong direction.
I'm using the bioconda release of simpleaf:
I've checked out the links in the CHM13 github repo, which offers a number of genomic fasta options and gff files. The paper uses the chm13v2.0_maskedY.fa fasta and GENCODE 35 annotations from this repo.
Converting the GFF3 file to a friendly format which encompasses all the fields needed has been tricky, but a bit of gffread magic helped:
To try and mimic annotation as close to expected though, I took the CellRanger human GTF and used liftover which the provided chain file
Using
CHM13_CellRanger.gtf.gz
andchm13v2.0_maskedY.fa
as input tosimpleaf index
resulted in an error suggesting that some exons for a given transcript were on different chromosomes or strands. So a little python to pick those outI then opted to remove those transcripts from the GTF
This resulted in a new error on the index build, around the lower case letters in the genomic fasta file (ie the softmasking). For lack of better options, I uppercased.
...and then the index builds successfully:
Any thoughts or suggestions are welcome, or if I've missed an existing build somewhere, a point in the right direction would be greatly appreciated.
Thanks,
Andrew
The text was updated successfully, but these errors were encountered: