Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

Open
xieduo7 opened this issue Oct 25, 2024 · 11 comments
Open

How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

xieduo7 opened this issue Oct 25, 2024 · 11 comments

Comments

@xieduo7
Copy link

xieduo7 commented Oct 25, 2024

Hi @Zhiliang-Bai ,

Thanks for your code.

I am wondering how to generate RV=$tmp/${sample}_R1_filtered.final.fastq.gz from raw reads before running st_pipeline_run.py. Trimming read 1 with Cutadapt? I was confused because I saw thatst_pipeline_run.py can also trim the reads.

Thank you!

Best,
Duo

@xieduo7 xieduo7 changed the title The command used to trim read 1 using Cutadapt How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz Oct 25, 2024
@Sina-soton
Copy link

I am also stuck at the same point. Would greatly appreciate any advice!

Best wishes
Sina

@Zhiliang-Bai
Copy link
Owner

Hi Duo and Sina,

The raw data I uploaded to GEO has already been filtered, with the R1 file for each sample ready to be used as input for the st_pipeline. However, the GEO team has requested the upload of the original, untrimmed raw data, and I am currently working on addressing this request.

Best,
Zhiliang

@xieduo7
Copy link
Author

xieduo7 commented Nov 30, 2024

Hi Dr.@Zhiliang-Bai ,

Thanks for your reply.

I think we still need to trim the TSO primer (AAGCAGTGGTATCAACGCAGAGTGAATrGrG+G, denoted in sup table3) as in 10X pipeline before running st_pipeline?
Because I saw lots of TSO primers in the GEO R1:
image

st_pipeline doesn't include this trimming. So if we don't trim these TSO primers, it will lead to a lot of mismatchs and may cause these reads cannot be mapped to the reference genome.

Did you do this before running st_pipeline?

Thank you!

Best,
Duo

@Sina-soton
Copy link

Hi Zhiliang and Duo

Thank you very much for your reply Zhiliang.

I agree with Duo, are there any specific trimming requirements needed before alignment? I am having trouble getting any significant mapping of the read 1 transcripts to the reference genome.

Best wishes
Sina

@jjznn
Copy link

jjznn commented Dec 9, 2024

I trim the TSO primer, using command: cutadapt -j 16 -g ^AAGCAGTGGTATCAACGCAGAGTGAATGGG -m 10 -o R1_trimmed.fastq.gz mouseE13_1.fastq.gz > trimming_report1.txt
And then I run st_pipeline. However, when goes to mapping using STAR, the uniquely mapping rate is still low.
image
So I suppose there's still something wrong. Can anybody help me out?

@xieduo7
Copy link
Author

xieduo7 commented Dec 9, 2024

Dear @jjznn ,

Did you show the GSM8454077_MouseE13_Rep1_50um sample? What is the number of reads before mapping (after trimming)?

Best,
Duo

@jjznn
Copy link

jjznn commented Dec 9, 2024 via email

@xieduo7
Copy link
Author

xieduo7 commented Dec 9, 2024

Hi @jjznn ,
What is the fraction of reads were removed by trimming? Can you post the full log file?

@jjznn
Copy link

jjznn commented Dec 10, 2024

Hi @xieduo7 ,
My command lines are as following:
#downsample
seqtk sample -s100 mouseE13_1.fastq.gz 0.1 > mouseE13_1_downsampled.fastq
seqtk sample -s100 mouseE13_2.fastq.gz 0.1 > mouseE13_2_downsampled.fastq
gzip mouseE13_1_downsampled.fastq
gzip mouseE13_2_downsampled.fastq
#cut TSO
cutadapt -j 16 -g ^AAGCAGTGGTATCAACGCAGAGTGAATGGG -m 10 -o R1_trimmed.fastq.gz "/data2/sjy/patho-DBiT/data/mouseE13_1_downsampled.fastq.gz" > trimming_report1.txt
sbatch ST_Pipeline.sh

And my log file of cutadapt:
This is cutadapt 4.6 with Python 3.8.5
Command line parameters: -j 16 -g ^AAGCAGTGGTATCAACGCAGAGTGAATGGG -m 10 -o R1_trimmed.fastq.gz /data2/sjy/patho-DBiT/data/mouseE13_1_downsampled.fastq.gz
Processing single-end reads on 16 cores ...
Finished in 45.909 s (1.999 µs/read; 30.01 M reads/minute).

=== Summary ===

Total reads processed: 22,961,933
Reads with adapters: 21,987,666 (95.8%)

== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads written (passing filters): 22,961,933 (100.0%)

Total basepairs processed: 3,444,289,950 bp
Total written (filtered): 2,785,198,536 bp (80.9%)

=== Adapter 1 ===

Sequence: AAGCAGTGGTATCAACGCAGAGTGAATGGG; Type: anchored 5'; Length: 30; Trimmed: 21987666 times

No. of allowed errors: 3

Overview of removed sequences
length count expect max.err error counts
27 7706 0.0 2 0 0 0 7706
28 75085 0.0 2 0 0 40068 35017
29 603824 0.0 2 0 374695 136569 92560
30 21072849 0.0 3 18286219 1879507 606209 300914
31 218603 0.0 3 0 124522 62534 31547
32 8854 0.0 3 0 0 2895 5959
33 745 0.0 3 0 0 0 745

st_pipeline log file:
INFO:STPipeline:ST Pipeline 1.8.1
INFO:STPipeline:Output directory: /../patho-DBiT
INFO:STPipeline:Temporary directory: /../patho-DBiT/tmp
INFO:STPipeline:Dataset name: E13
INFO:STPipeline:Forward(R1) input file: mouseE13_2_downsampled.fastq.gz
INFO:STPipeline:Reverse(R2) input file: R1_trimmed.fastq.gz
INFO:STPipeline:Reference mapping STAR index folder: /../patho-DBiT/ref/STARindex
INFO:STPipeline:Reference annotation file: /../patho-DBiT/ref/mm10.knownGene.gtf
INFO:STPipeline:CPU Nodes: 16
INFO:STPipeline:Ids(barcodes) file: /../patho-DBiT/Spatial_barcode_50x50.txt
INFO:STPipeline:TaggD allowed mismatches: 2
INFO:STPipeline:TaggD kmer size: 5
INFO:STPipeline:TaggD overhang: 0
INFO:STPipeline:TaggD metric: Subglobal
INFO:STPipeline:Mapping reverse trimming: 0
INFO:STPipeline:Mapping inverse reverse trimming: 0
INFO:STPipeline:Mapping tool: STAR
INFO:STPipeline:Mapping minimum intron size allowed (splice alignments) with STAR: 1
INFO:STPipeline:Mapping maximum intron size allowed (splice alignments) with STAR: 1
INFO:STPipeline:STAR genome loading strategy NoSharedMemory
INFO:STPipeline:Annotation tool: HTSeq
INFO:STPipeline:Annotation mode: intersection-nonempty
INFO:STPipeline:Annotation strandness yes
INFO:STPipeline:UMIs start position: 16
INFO:STPipeline:UMIs end position: 26
INFO:STPipeline:UMIs allowed mismatches: 1
INFO:STPipeline:UMIs clustering algorithm: AdjacentBi
INFO:STPipeline:Allowing an offset of 250 when clustering UMIs by strand-start in a gene-spot
INFO:STPipeline:Allowing 6 low quality bases in an UMI
INFO:STPipeline:Discarding reads that after trimming are shorter than 18
INFO:STPipeline:Removing polyA sequences of a length of at least: 10
INFO:STPipeline:Removing polyT sequences of a length of at least: 10
INFO:STPipeline:Removing polyG sequences of a length of at least: 10
INFO:STPipeline:Removing polyC sequences of a length of at least: 10
INFO:STPipeline:Removing polyN sequences of a length of at least: 10
INFO:STPipeline:Allowing 0 mismatches when removing homopolymers
INFO:STPipeline:Remove reads whose AT content is 90%
INFO:STPipeline:Remove reads whose GC content is 90%
INFO:STPipeline:Starting the pipeline: 2024-12-09 16:53:40.359164
INFO:STPipeline:Start filtering raw reads 2024-12-09 16:53:40.384107
INFO:STPipeline:Trimming stats total reads (pair): 22961933
INFO:STPipeline:Trimming stats 5722308 reads have been dropped!
INFO:STPipeline:Trimming stats you just lost about 24.92% of your data
INFO:STPipeline:Trimming stats reads remaining: 17239625
INFO:STPipeline:Trimming stats dropped pairs due to incorrect UMI: 0
INFO:STPipeline:Trimming stats dropped pairs due to low quality UMI: 213426
INFO:STPipeline:Trimming stats dropped pairs due to high AT content: 5324537
INFO:STPipeline:Trimming stats dropped pairs due to high GC content: 215
INFO:STPipeline:Trimming stats dropped pairs due to presence of artifacts: 148828
INFO:STPipeline:Trimming stats dropped pairs due to being too short: 35302
INFO:STPipeline:Starting genome alignment 2024-12-09 17:00:04.360453
INFO:STPipeline:Mapping stats:
INFO:STPipeline:Mapping stats are computed from all the pair reads present in the raw files
INFO:STPipeline: Uniquely mapped reads number | 8118993
INFO:STPipeline: Uniquely mapped reads % | 47.09%
INFO:STPipeline: Number of reads mapped to multiple loci | 5053456
INFO:STPipeline: % of reads mapped to multiple loci | 29.31%
INFO:STPipeline: % of reads unmapped: too short | 20.46%
INFO:STPipeline:Total mapped reads: 13172449
INFO:STPipeline:Starting barcode demultiplexing 2024-12-09 17:04:10.328277
INFO:STPipeline:Demultiplexing Mapping stats:
INFO:STPipeline:# Total reads: 13172449
INFO:STPipeline:# Total reads written: 10940135
INFO:STPipeline:# Ambiguous matches: 189111 [1.4356555868995964%]
INFO:STPipeline:# - Non-unique ambiguous matches: 463857
INFO:STPipeline:# Unmatched: 275796 [2.0937336709369685%]
INFO:STPipeline:Starting annotation 2024-12-09 17:16:36.205658
INFO:STPipeline:Annotated reads: 2459510
INFO:STPipeline:Starting creating dataset 2024-12-09 17:33:29.893733
INFO:STPipeline:Number of reads present: 2421189
INFO:STPipeline:Number of unique events (gene-spot) present: 957467
INFO:STPipeline:Number of unique genes present: 64344
INFO:STPipeline:Max number of genes over all spots: 1824
INFO:STPipeline:Min number of genes over all spots: 1
INFO:STPipeline:Max number of reads over all spots: 5463.0
INFO:STPipeline:Min number of reads over all spots: 1.0
INFO:STPipeline:Average number genes per spot: 391.9226361031519
INFO:STPipeline:Average number reads per spot: 991.0720425706099
INFO:STPipeline:Std. number genes per spot: 378.19517881383473
INFO:STPipeline:Std. number reads per spot: 1075.5654300880701
INFO:STPipeline:Number of discarded reads (possible duplicates): 38321
INFO:STPipeline:Total Execution Time: 0:46:00.227009

It seems I have problem in mapping rate. Thank you for taking a look for me.

@xieduo7
Copy link
Author

xieduo7 commented Dec 10, 2024

Hi @jjznn ,

My scenario is very similar to yours. I am also trying to find ways to improve the mapping rate. Given that the mapping rate based on trimmed reads is relatively acceptable (about 80%), I think one key to this problem is to improve the number of reads after trimming. I'm still doing some experiments, but haven't made any improvements.

Best,
Duo

@Zhiliang-Bai
Copy link
Owner

Dear all,

Thank you for your valuable feedback. We utilized our in-house developed algorithms, extending beyond the capabilities of the st-pipeline. However, we were unable to fully finalize this aspect at the time of publication. Our team is actively working on this, and we plan to submit a comprehensive computational work addressing this within the next two months. We kindly ask for your patience as we complete this important component.

Thank you for your understanding!

Best,
Zhiliang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants