Looking for correct simpleaf commands for 10X4 with GEX/HTO and piscem #165

cs-tum · 2024-11-13T15:47:53Z

cs-tum
Nov 13, 2024

Hello,

Thank you for your efforts in building this ecosystem! I'd like to try simpleaf on my dataset (previously analyzed using cellranger 8.0.1). I have a 10x GEM-X (3' V4) mouse dataset with GEX and HTO files. I have the following questions:

According to [1], the rlen parameter is based on the R2 read length for that chemistry. As far as I can see in [2], this would be 90 for 10X3'V4. However, in the FASTQ files, all reads (GEX+HTO) are 150bp long rather than 90 (see my question 4 for an example of R2). Do I have to adjust any settings because of this? For now, I have generated the GEX index via simpleaf index --output $IDX_DIR --fasta /refdata/refdata-gex-GRCm39-2024-A/fasta/genome.fa --gtf /refdata/refdata-gex-GRCm39-2024-A/genes/genes.gtf --rlen 90 --threads 24 --use-piscem
What is the correct command to generate the index for the HTOs using piscem? I have two HTOs (Biolegend TotalSeq B) in the dataset. So far I have only found instructions for salmon [3,4] but not simpleaf+piscem. Also I'm confused as to when to use which whitelists (if any) and what kind of mapping is needed to then reconcile the different GEX vs HTO barcodes [4,5] if this is still needed today (does cellranger count do this internally? Because HTO and GEX cell barcodes match well in their output matrices).
The way I understand it, after indexing, I then have to run two quant commands, one for GEX and one for HTO. How would these commands look like today with simpleaf and for my dataset? I'm assuming chemistry=10xv4-3p (even though in the source code [6] it says that this is identical to 10XV3, despite the different read length 91 vs 90 [2]), but I'm lost for what to set for expected-ori, resolution etc.
Specifically for the HTO quant step, I saw that, at least in the past, setting a custom chemistry was needed (1{b[16]u[10]x:}2{x[10]r[15]x:} [7] if my R2 read for the barcode is like this: GNGTGTTACA*CCTATGGACTTGGAC*TGTGCCCCCGCTTTAAGGCCGGTCCTAGCAACGACGACTGCCACTGCACAGATGGTTGCCTGTCTCTTATACACATCTGACGCTGCCGACGAACTTGTGTCAGTGTAGATCTCGGTGGTCGCCGT - is this still the case with piscem today?

I realize that there is a template [7,8], but I'd like to use individual commands to have a better understanding of the process. Also the barcode_translation link hardcoded in the template is no longer valid [9].

References
[1] https://combine-lab.github.io/alevin-fry-tutorials/2023/simpleaf-piscem/
[2] https://www.10xgenomics.com/support/single-cell-gene-expression/documentation/steps/sequencing/sequencing-requirements-for-single-cell-3
[3] https://divingintogeneticsandgenomics.com/post/how-to-use-salmon-alevin-to-preprocess-cite-seq-data/
[4] https://combine-lab.github.io/alevin-fry-tutorials/2021/af-feature-bc/
[5] COMBINE-lab/salmon#576
[6] Source code
[7] COMBINE-lab/alevin-fry#136
[8] https://github.com/COMBINE-lab/protocol-estuary/blob/main/protocols/10x-feature-barcode-antibody/10x-feature-barcode-antibody.jsonnet
[9] (dead) https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/translation/3M-february-2018.txt.gz