Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHM13 Index #100

Open
AndrewSkelton opened this issue Aug 1, 2023 · 0 comments
Open

CHM13 Index #100

AndrewSkelton opened this issue Aug 1, 2023 · 0 comments

Comments

@AndrewSkelton
Copy link

Hi all,

Thanks a lot for building such an excellent eco system of tools, your hard work is very appreciated by everyone in the community :)

I've been checking out this pre-print, and wanted to build a CHM13 af index. This has sent me down a rabbit hole and hope you could advise or give an opinion if I'm going in the complete wrong direction.

I'm using the bioconda release of simpleaf:

alevin-fry v0.8.2
piscem v0.6.0
salmon v1.10.2
simpleaf v0.14.1

I've checked out the links in the CHM13 github repo, which offers a number of genomic fasta options and gff files. The paper uses the chm13v2.0_maskedY.fa fasta and GENCODE 35 annotations from this repo.

Converting the GFF3 file to a friendly format which encompasses all the fields needed has been tricky, but a bit of gffread magic helped:

gunzip -c CHM13_Gencode_v38.gff3.gz \
| gffread -F -T -o - \
| sed 's/gene://g; s/transcript://g' \
| gzip > CHM13_Gencode_v38.gtf.gz

To try and mimic annotation as close to expected though, I took the CellRanger human GTF and used liftover which the provided chain file

liftover -gff refdata-gex-GRCh38-2020-A/genes/genes.gtf grch38-chm13v2.chain CHM13_CellRanger.gtf CHM13_CellRanger.unmapped.gtf
pigz CHM13_CellRanger.gtf CHM13_CellRanger.unmapped.gtf grch38-chm13v2.chain

Using CHM13_CellRanger.gtf.gz and chm13v2.0_maskedY.fa as input to simpleaf index resulted in an error suggesting that some exons for a given transcript were on different chromosomes or strands. So a little python to pick those out

# check_strands.py
import pandas as pd
# Define the column names for a GTF file
col_names = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attributes"]
# Read the GTF file
df = pd.read_csv('CHM13_CellRanger.gtf', sep='\t', comment='#', names=col_names)
# Extract the transcript_id from the attributes column
df['transcript_id'] = df['attributes'].str.extract('transcript_id "([^"]+)"')
# Group by transcript_id and check if there is more than one unique strand per group
grouped = df.groupby('transcript_id')['strand'].nunique()
multi_strand_transcripts = grouped[grouped > 1]
print(multi_strand_transcripts)
pigz -d CHM13_CellRanger.gtf.gz
python check_strands.py | grep ENST - | cut -f 1 -d ' ' | sort | uniq > strand.tsv
grep -Ff strand.tsv CHM13_CellRanger.gtf | awk -v FS="\t" '{split($9,a,"; "); for(i in a) {if(a[i] ~ /gene_name/) {print a[i]}}}' | awk -F\" '{print $2}' | sort | uniq > gene_names.tsv
## Genes with liftover strand issues
> cat gene_names.tsv
C5orf60
CCDC200
CDR2
DAZ1
DAZ2
DAZ4
GPAT2
IFITM2
LINC01410
NPIPB8
PRAMEF18
TRIM49D1

I then opted to remove those transcripts from the GTF

grep -v -F -f strand.tsv CHM13_CellRanger.gtf > CHM13_CellRanger.filtered.gtf
pigz CHM13_CellRanger.gtf CHM13_CellRanger.filtered.gtf

This resulted in a new error on the index build, around the lower case letters in the genomic fasta file (ie the softmasking). For lack of better options, I uppercased.

pigz -d chm13v2.0_maskedY.fa.gz
awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' chm13v2.0_maskedY.fa > chm13v2.0_maskedY_upper.fa

...and then the index builds successfully:

simpleaf index \
            --output ./Index_SimpleAF_CHM13 \
            --fasta chm13v2.0_maskedY_upper.fa \
            --gtf CHM13_CellRanger.filtered.gtf.gz \
            --rlen 91 \
            --threads 8 \
            --use-piscem
tree Index_SimpleAF_CHM13 > Index_SimpleAF_CHM13.tree

Any thoughts or suggestions are welcome, or if I've missed an existing build somewhere, a point in the right direction would be greatly appreciated.

Thanks,

Andrew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant