How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

xieduo7 · 2024-10-25T09:50:04Z

Thanks for your code.

I am wondering how to generate RV=$tmp/${sample}_R1_filtered.final.fastq.gz from raw reads before running st_pipeline_run.py. Trimming read 1 with Cutadapt? I was confused because I saw thatst_pipeline_run.py can also trim the reads.

Thank you!

Best,
Duo

The text was updated successfully, but these errors were encountered:

Sina-soton · 2024-11-29T08:48:22Z

I am also stuck at the same point. Would greatly appreciate any advice!

Best wishes
Sina

Zhiliang-Bai · 2024-11-29T23:58:30Z

Hi Duo and Sina,

The raw data I uploaded to GEO has already been filtered, with the R1 file for each sample ready to be used as input for the st_pipeline. However, the GEO team has requested the upload of the original, untrimmed raw data, and I am currently working on addressing this request.

Best,
Zhiliang

xieduo7 · 2024-11-30T03:07:50Z

Hi Dr.@Zhiliang-Bai ,

Thanks for your reply.

I think we still need to trim the TSO primer (AAGCAGTGGTATCAACGCAGAGTGAATrGrG+G, denoted in sup table3) as in 10X pipeline before running st_pipeline?
Because I saw lots of TSO primers in the GEO R1:

st_pipeline doesn't include this trimming. So if we don't trim these TSO primers, it will lead to a lot of mismatchs and may cause these reads cannot be mapped to the reference genome.

Did you do this before running st_pipeline?

Thank you!

Best,
Duo

Sina-soton · 2024-12-02T10:11:50Z

Hi Zhiliang and Duo

Thank you very much for your reply Zhiliang.

I agree with Duo, are there any specific trimming requirements needed before alignment? I am having trouble getting any significant mapping of the read 1 transcripts to the reference genome.

Best wishes
Sina

jjznn · 2024-12-09T13:33:32Z

I trim the TSO primer, using command: cutadapt -j 16 -g ^AAGCAGTGGTATCAACGCAGAGTGAATGGG -m 10 -o R1_trimmed.fastq.gz mouseE13_1.fastq.gz > trimming_report1.txt
And then I run st_pipeline. However, when goes to mapping using STAR, the uniquely mapping rate is still low.

So I suppose there's still something wrong. Can anybody help me out?

xieduo7 · 2024-12-09T15:10:31Z

Dear @jjznn ,

Did you show the GSM8454077_MouseE13_Rep1_50um sample? What is the number of reads before mapping (after trimming)?

Best,
Duo

jjznn · 2024-12-09T15:21:36Z

I downsampled GSM8454077_MouseE13_Rep1_50um for quick check, so the number of reads are small.

…

---- Replied Message ---- | From | Duo ***@***.***> | | Date | 12/09/2024 23:10 | | To | ***@***.***> | | Cc | ***@***.***>***@***.***> | | Subject | Re: [Zhiliang-Bai/Patho-DBiT] How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz (Issue #1) | Dear @jjznn , Did you show the GSM8454077_MouseE13_Rep1_50um sample? What is the number of reads before mapping (after trimming)? Best, Duo — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

xieduo7 · 2024-12-09T21:45:10Z

Hi @jjznn ,
What is the fraction of reads were removed by trimming? Can you post the full log file?

jjznn · 2024-12-10T02:00:22Z

Hi @xieduo7 ,
My command lines are as following:
#downsample
seqtk sample -s100 mouseE13_1.fastq.gz 0.1 > mouseE13_1_downsampled.fastq
seqtk sample -s100 mouseE13_2.fastq.gz 0.1 > mouseE13_2_downsampled.fastq
gzip mouseE13_1_downsampled.fastq
gzip mouseE13_2_downsampled.fastq
#cut TSO
cutadapt -j 16 -g ^AAGCAGTGGTATCAACGCAGAGTGAATGGG -m 10 -o R1_trimmed.fastq.gz "/data2/sjy/patho-DBiT/data/mouseE13_1_downsampled.fastq.gz" > trimming_report1.txt
sbatch ST_Pipeline.sh

And my log file of cutadapt:
This is cutadapt 4.6 with Python 3.8.5
Command line parameters: -j 16 -g ^AAGCAGTGGTATCAACGCAGAGTGAATGGG -m 10 -o R1_trimmed.fastq.gz /data2/sjy/patho-DBiT/data/mouseE13_1_downsampled.fastq.gz
Processing single-end reads on 16 cores ...
Finished in 45.909 s (1.999 µs/read; 30.01 M reads/minute).

=== Summary ===

Total reads processed: 22,961,933
Reads with adapters: 21,987,666 (95.8%)

== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads written (passing filters): 22,961,933 (100.0%)

Total basepairs processed: 3,444,289,950 bp
Total written (filtered): 2,785,198,536 bp (80.9%)

=== Adapter 1 ===

Sequence: AAGCAGTGGTATCAACGCAGAGTGAATGGG; Type: anchored 5'; Length: 30; Trimmed: 21987666 times

No. of allowed errors: 3

Overview of removed sequences
length count expect max.err error counts
27 7706 0.0 2 0 0 0 7706
28 75085 0.0 2 0 0 40068 35017
29 603824 0.0 2 0 374695 136569 92560
30 21072849 0.0 3 18286219 1879507 606209 300914
31 218603 0.0 3 0 124522 62534 31547
32 8854 0.0 3 0 0 2895 5959
33 745 0.0 3 0 0 0 745

st_pipeline log file:
INFO:STPipeline:ST Pipeline 1.8.1
INFO:STPipeline:Output directory: /../patho-DBiT
INFO:STPipeline:Temporary directory: /../patho-DBiT/tmp
INFO:STPipeline:Dataset name: E13
INFO:STPipeline:Forward(R1) input file: mouseE13_2_downsampled.fastq.gz
INFO:STPipeline:Reverse(R2) input file: R1_trimmed.fastq.gz
INFO:STPipeline:Reference mapping STAR index folder: /../patho-DBiT/ref/STARindex
INFO:STPipeline:Reference annotation file: /../patho-DBiT/ref/mm10.knownGene.gtf
INFO:STPipeline:CPU Nodes: 16
INFO:STPipeline:Ids(barcodes) file: /../patho-DBiT/Spatial_barcode_50x50.txt
INFO:STPipeline:TaggD allowed mismatches: 2
INFO:STPipeline:TaggD kmer size: 5
INFO:STPipeline:TaggD overhang: 0
INFO:STPipeline:TaggD metric: Subglobal
INFO:STPipeline:Mapping reverse trimming: 0
INFO:STPipeline:Mapping inverse reverse trimming: 0
INFO:STPipeline:Mapping tool: STAR
INFO:STPipeline:Mapping minimum intron size allowed (splice alignments) with STAR: 1
INFO:STPipeline:Mapping maximum intron size allowed (splice alignments) with STAR: 1
INFO:STPipeline:STAR genome loading strategy NoSharedMemory
INFO:STPipeline:Annotation tool: HTSeq
INFO:STPipeline:Annotation mode: intersection-nonempty
INFO:STPipeline:Annotation strandness yes
INFO:STPipeline:UMIs start position: 16
INFO:STPipeline:UMIs end position: 26
INFO:STPipeline:UMIs allowed mismatches: 1
INFO:STPipeline:UMIs clustering algorithm: AdjacentBi
INFO:STPipeline:Allowing an offset of 250 when clustering UMIs by strand-start in a gene-spot
INFO:STPipeline:Allowing 6 low quality bases in an UMI
INFO:STPipeline:Discarding reads that after trimming are shorter than 18
INFO:STPipeline:Removing polyA sequences of a length of at least: 10
INFO:STPipeline:Removing polyT sequences of a length of at least: 10
INFO:STPipeline:Removing polyG sequences of a length of at least: 10
INFO:STPipeline:Removing polyC sequences of a length of at least: 10
INFO:STPipeline:Removing polyN sequences of a length of at least: 10
INFO:STPipeline:Allowing 0 mismatches when removing homopolymers
INFO:STPipeline:Remove reads whose AT content is 90%
INFO:STPipeline:Remove reads whose GC content is 90%
INFO:STPipeline:Starting the pipeline: 2024-12-09 16:53:40.359164
INFO:STPipeline:Start filtering raw reads 2024-12-09 16:53:40.384107
INFO:STPipeline:Trimming stats total reads (pair): 22961933
INFO:STPipeline:Trimming stats 5722308 reads have been dropped!
INFO:STPipeline:Trimming stats you just lost about 24.92% of your data
INFO:STPipeline:Trimming stats reads remaining: 17239625
INFO:STPipeline:Trimming stats dropped pairs due to incorrect UMI: 0
INFO:STPipeline:Trimming stats dropped pairs due to low quality UMI: 213426
INFO:STPipeline:Trimming stats dropped pairs due to high AT content: 5324537
INFO:STPipeline:Trimming stats dropped pairs due to high GC content: 215
INFO:STPipeline:Trimming stats dropped pairs due to presence of artifacts: 148828
INFO:STPipeline:Trimming stats dropped pairs due to being too short: 35302
INFO:STPipeline:Starting genome alignment 2024-12-09 17:00:04.360453
INFO:STPipeline:Mapping stats:
INFO:STPipeline:Mapping stats are computed from all the pair reads present in the raw files
INFO:STPipeline: Uniquely mapped reads number | 8118993
INFO:STPipeline: Uniquely mapped reads % | 47.09%
INFO:STPipeline: Number of reads mapped to multiple loci | 5053456
INFO:STPipeline: % of reads mapped to multiple loci | 29.31%
INFO:STPipeline: % of reads unmapped: too short | 20.46%
INFO:STPipeline:Total mapped reads: 13172449
INFO:STPipeline:Starting barcode demultiplexing 2024-12-09 17:04:10.328277
INFO:STPipeline:Demultiplexing Mapping stats:
INFO:STPipeline:# Total reads: 13172449
INFO:STPipeline:# Total reads written: 10940135
INFO:STPipeline:# Ambiguous matches: 189111 [1.4356555868995964%]
INFO:STPipeline:# - Non-unique ambiguous matches: 463857
INFO:STPipeline:# Unmatched: 275796 [2.0937336709369685%]
INFO:STPipeline:Starting annotation 2024-12-09 17:16:36.205658
INFO:STPipeline:Annotated reads: 2459510
INFO:STPipeline:Starting creating dataset 2024-12-09 17:33:29.893733
INFO:STPipeline:Number of reads present: 2421189
INFO:STPipeline:Number of unique events (gene-spot) present: 957467
INFO:STPipeline:Number of unique genes present: 64344
INFO:STPipeline:Max number of genes over all spots: 1824
INFO:STPipeline:Min number of genes over all spots: 1
INFO:STPipeline:Max number of reads over all spots: 5463.0
INFO:STPipeline:Min number of reads over all spots: 1.0
INFO:STPipeline:Average number genes per spot: 391.9226361031519
INFO:STPipeline:Average number reads per spot: 991.0720425706099
INFO:STPipeline:Std. number genes per spot: 378.19517881383473
INFO:STPipeline:Std. number reads per spot: 1075.5654300880701
INFO:STPipeline:Number of discarded reads (possible duplicates): 38321
INFO:STPipeline:Total Execution Time: 0:46:00.227009

It seems I have problem in mapping rate. Thank you for taking a look for me.

xieduo7 · 2024-12-10T14:22:23Z

Hi @jjznn ,

My scenario is very similar to yours. I am also trying to find ways to improve the mapping rate. Given that the mapping rate based on trimmed reads is relatively acceptable (about 80%), I think one key to this problem is to improve the number of reads after trimming. I'm still doing some experiments, but haven't made any improvements.

Best,
Duo

Zhiliang-Bai · 2024-12-20T14:18:12Z

Dear all,

Thank you for your valuable feedback. We utilized our in-house developed algorithms, extending beyond the capabilities of the st-pipeline. However, we were unable to fully finalize this aspect at the time of publication. Our team is actively working on this, and we plan to submit a comprehensive computational work addressing this within the next two months. We kindly ask for your patience as we complete this important component.

Thank you for your understanding!

Best,
Zhiliang

xieduo7 changed the title ~~The command used to trim read 1 using Cutadapt~~ How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

xieduo7 commented Oct 25, 2024 •

edited

Loading

Sina-soton commented Nov 29, 2024

Zhiliang-Bai commented Nov 29, 2024

xieduo7 commented Nov 30, 2024 •

edited

Loading

Sina-soton commented Dec 2, 2024

jjznn commented Dec 9, 2024

xieduo7 commented Dec 9, 2024

jjznn commented Dec 9, 2024 via email

xieduo7 commented Dec 9, 2024 •

edited

Loading

jjznn commented Dec 10, 2024

xieduo7 commented Dec 10, 2024 •

edited

Loading

Zhiliang-Bai commented Dec 20, 2024

How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

How to produce RV=$tmp/${sample}_R1_filtered.final.fastq.gz #1

Comments

xieduo7 commented Oct 25, 2024 • edited Loading

Sina-soton commented Nov 29, 2024

Zhiliang-Bai commented Nov 29, 2024

xieduo7 commented Nov 30, 2024 • edited Loading

Sina-soton commented Dec 2, 2024

jjznn commented Dec 9, 2024

xieduo7 commented Dec 9, 2024

jjznn commented Dec 9, 2024 via email

xieduo7 commented Dec 9, 2024 • edited Loading

jjznn commented Dec 10, 2024

xieduo7 commented Dec 10, 2024 • edited Loading

Zhiliang-Bai commented Dec 20, 2024

xieduo7 commented Oct 25, 2024 •

edited

Loading

xieduo7 commented Nov 30, 2024 •

edited

Loading

xieduo7 commented Dec 9, 2024 •

edited

Loading

xieduo7 commented Dec 10, 2024 •

edited

Loading