Skip to content

TEspeX - pipeline for Transposable Elements expression quantification

License

Notifications You must be signed in to change notification settings

fansalon/TEspeX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TEspeX

Read about TEspeX in:

Ansaloni et al., Bioinformatics, 2022: https://doi.org/10.1093/bioinformatics/btac526
Ansaloni et al., BMC Bioinformatics, 2019: https://doi.org/10.1186/s12859-019-3088-7

Overview

TEspeX (Transposable Elements SPEcific eXpression) is a tool for the TE expression quantification from Illummina short RNA-seq reads. The rationale of the pipeline is to map the reads against a reference transcriptome composed by i) TE consensus sequences, ii) coding transcripts and iii) non-coding transcripts and to quantify the expression of TEs avoiding counting reads deriving from TE fragments embedded in coding/non-coding non-TE annotated transcripts.

Techincally, TEspeX:

  • merges the 3 fasta files generating a reference transcriptome
  • builds the transcriptome index (STAR)
  • maps the reads to the transcriptome assigning the best alignment score (AS) to all the alignments showing the best AS
  • selects only the best scoring alignments (samtools)
  • discards all the reads mapping with best scoring alignments on coding transcripts/non-coding transcripts
  • counts only the TE-specific reads

Possible scenarios:

  • A read maps with best alignment score to TE consensus sequences but not to coding/non-coding transcripts --> counted as TE-specific
  • A read maps with best alignment score to TE consensus sequences but not to coding/non-coding transcripts. However, the reads multi-map to >10 loci and it is aligned to all of them with an alignment score (AS) ranging between the maxAS and the maxAS-1 --> discarded as not assignable to a specific TE subfamily
  • A read maps to BOTH TE consensus sequences and coding/non-coding transcripts. However, the alignment(s) on TEs is flagged as best whereas the one(s) on non-TE transcripts is not --> counted as TE-specific
  • A read maps with best alignment score to BOTH TE consensus sequences and coding/non-coding transcripts --> discarded as it may be generated from TE fragments embedded in coding/non-coding transcripts
  • A read maps with best alignment score to coding/non-coding transcripts --> discarded as it is (probably) transcribed from coding/non-coding transcripts.

The pipeline has been written in python3 and it has been tested on Ubuntu, CentOS and Mac OS X systems.

How to install TEspeX on Unix systems

Install TEspeX and TEspeX prerequisites

0. Prerequisites

Please note how this manual assumes you have already installed and in $PATH:

  • gcc
  • make
  • bash

You can easily check this by typing which <name>. If they are not installed, please install them before installing TEspeX.

1. conda

  • first check if conda, activate and deactivate are already installed and in $PATH typing:
    • which conda
    • which activate
    • which deactivate

If paths are printed, you can directly go to 2., otherwise follow the steps below:

  • download and install miniconda
    • download the installer wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    • install it bash Miniconda3-latest-Linux-x86_64.sh
    • follow the prompts on the installer screens
    • there is no need to run conda init. What is, however, required is the add /path/to/miniconda3/bin to the $PATH
    • this depends on where you have installed miniconda3 so first declare where the miniconda3 is conda_dir=</path/to/miniconda3/>
    • then add it to the .bashrc echo "export PATH=\${PATH}:${conda_dir}/bin" >> ~/.bashrc
    • refresh it source ~/.bashrc
    • check conda is in $PATH which conda

2. java (JDK)

  • test if java JDK is already installed and in path:
    • java -version

If this prints something like:

openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)

Skip to 3.; otherwise:

  • conda install -c anaconda openjdk

3. git

  • test if git is already installed:
    • which git

If the git version is printed you can directly go to 4., otherwise type:

  • conda install git

4. zlib

It is likely that you already have zlib installed and properly configured somewhere. To check this type:

  • echo $LD_LIBRARY_PATH
  • echo $CFLAGS
  • echo $LDFLAGS

If paths to zlib lib and include folders are returned, everything should be OK - skip to 5.

Otherwise:

  • cd </path/to/miniconda3/>
  • wget https://www.zlib.net/fossils/zlib-1.2.11.tar.gz
  • tar -zxvf zlib-1.2.11.tar.gz
  • cd zlib-1.2.11/
  • ./configure --prefix=$PWD/packages
  • make
  • make install
  • export LD_LIBRARY_PATH=$PWD/packages/lib/:$LD_LIBRARY_PATH
  • export CFLAGS="-I$PWD/packages/include"
  • export LDFLAGS="-L$PWD/packages/lib"

5. clone TEspeX and create working directory variable

  • cd </path/where/install/TEspeX>
  • git clone https://github.com/fansalon/TEspeX
  • cd TEspeX/
  • tespex=$PWD

6. Picard

A file called 'picard.jar' is contained in the 'bin/picard' directory.
To check whether java is properly installed on your machine and picard properly works, type:

  • java -jar $tespex/bin/picard/picard.jar

If the picard help is printed everything is fine. If an error rises java is not (properly) installed on your machine. Possible solutions: i) go back to 2. and check conda has successfully installed java, ii) check that the java installed by conda is in $PATH and iii) check that the invoked java is really the one installed by conda.

7. STAR

STAR executables should be in the 'bin/STAR-2.6.0c' directory.
To check whether STAR is properly working, type:

  • $tespex/bin/STAR-2.6.0c/STAR --version

this should return: STAR_2.6.0c

8. samtools

install samtools-1.3.1

cd $tespex/bin
wget -O samtools-1.3.1.tar.bz2 https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
tar xjf samtools-1.3.1.tar.bz2
cd samtools-1.3.1
./configure --prefix=$PWD/
(if during the --configure step you encounter an error like: "configure: error: curses development files not found" please relaunch the command adding the "--without-curses" flag)
make
make install
bin/samtools --version

this should return something like:

samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.

9. create a conda environment where the required version of python, pysam and pandas are installed

To create the conda environment and install the required libraries type:

conda create -n TEspeX_deps --override-channels -c bioconda -c defaults python=3.6 pandas=0.23.0 pysam'>=0.15.0,<=0.15.1'
## --> you will be asked to let conda download and install new packages: type Y

# activate the environment - to be done every time TEspeX is used
# **PLEASE DO NOT CHANGE THE ENVIRONMENT NAME AS IT WILL COMPROMISE THE FUNCTIONING OF THE PIPELINE**
source activate TEspeX_deps

# to check everything properly worked
which python3
## --> /path/to/envs/TEspeX_deps/bin/python3
which pip3
## --> /path/to/envs/TEspeX_deps/bin/pip3
python3 --version
## --> Python 3.6.13 :: Anaconda, Inc.
pip3 --version
## --> pip 21.1.3 from /path/to/envs/TEspeX_deps/lib/python3.6/site-packages/pip (python 3.6)

# deactivate the environment
conda deactivate

10. Test TEspeX has been successfully installed

In the folder 'example' a copy of the files used to perform the TE expression analysis on 2 C. elegans embryonic fastq files (Tintori SC, et al. - Dev. Cell - 2016 - https://www.ncbi.nlm.nih.gov/pubmed/27554860) is deposited. Please, please, DO test whether the pipeline is working properly following the steps below:

  • change to TEspeX dir and create the list of fq/fq.gz to be analysed:

    • cd $tespex
    • ls $tespex/example/*.fastq.gz > $tespex/example/reads.txt
  • activate TEspeX, this has to be done every time TEspeX is launched

    • source activate TEspeX_deps
  • launch the pipeline typing the following command:

    python3 TEspeX.py --TE example/ce.Dfam.fa.gz \
    --cdna example/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz \
    --ncrna example/Caenorhabditis_elegans.WBcel235.ncrna.fa.gz \
    --sample example/reads.txt --paired F --length 50 --out test --strand no
    
  • check the output files are as expected:

    • cd $tespex
    • diff test/outfile.txt example/outfile.txt
    • diff test/mapping_stats.txt example/mapping_stats.txt

    If nothing is printed it means all went fine. Otherwise, error has rose.

Input files

  • TE consensus sequences in fasta (fa/fa.gz) format (--TE argument). Follow these instructions to generate a proper TE consensus sequences input file (https://github.com/fansalon/TEconsensus)
  • coding transcripts in fasta format (we suggest the '*pc_transcripts.fa' fasta file downloaded from gencode) (--cdna argument)
  • non coding trasncripts in fasta format (we suggest the '*.lncRNA_transcripts.fa' fasta file downloaded from gencode) (--ncrna argument)
  • please, pay specific attention to the selection of the input files. In gencode, the *transcripts.fa and *.lncRNA_transcripts.fa contain duplicated sequences. TEspeX requires no duplicated sequence headers are found across the 3 input fasta files. To this end, prior to start the analysys, TEspeX will check for duplicated sequence headers across the 3 input fasta files and stop the run if found
  • RNA-seq data in fastq (fq/fq.gz) format. TEspeX expects the full path of the fq/fq.gz files to be written in a plain txt file (1 file per row. If paired-end files, TEspeX expects the fq/fq.gz to be listed in two tab-separated columns - PE1 in column 1, PE2 of the same fq/fq.gz in column 2)
# If single-ended
/path/to/fq1
/path/to/fq2
/path/to/fq3
/path/to/fq4

# If paired-ended
/path/to/fq1_1  /path/to/fq1_2
/path/to/fq2_1  /path/to/fq2_2
/path/to/fq3_1  /path/to/fq3_2
/path/to/fq4_1  /path/to/fq4_2

Output files

  • outfile.txt: txt file containing the raw counts of reads mapping specifically on TEs. The first column contains the TE names as they are in the TE consensus fasta file, the other columns contain the read counts for each fq input file
  • mapping_stats.txt: txt file containing mapping statistics

How to run TEspeX

TEspeX can be run calling directly the script from the command line. However, for those users disposing a queue managment system we suggest to use a wrapper script that can launch several jobs in parallel saving consistent amount of time. Wrapper file we use on our slurm queue managment system is provided (see the section TEspeX in wrapper mode for more info).

TEspeX in standard mode

cd $tespex
source activate TEspeX_deps
python3 TEspeX.py --help

This command shows the help that should be something very similar to:

usage: TEspeX.py [-h] --TE TE --cdna CDNA --ncrna NCRNA --sample SAMPLE
                 --paired PAIRED --length LENGTH --out OUT --strand STRAND
                 [--num_threads NUM_THREADS] [--remove REMOVE] [--index INDEX]
                 [--mask MASK] [--version]

optional arguments:
  -h, --help            show this help message and exit
  --TE TE               fa/fa.gz file containing TE consensus sequences in
                        fasta format [required]
  --cdna CDNA           fa/fa.gz file containing cdna Ensembl sequences in
                        fasta format [required]
  --ncrna NCRNA         fa/fa.gz file containing ncrna Ensembl sequences in
                        fasta format [required]
  --sample SAMPLE       txt file containing fq/fq.gz FULL PATHS. If reads are
                        single end, one path should be written in each line.
                        If reads are paired end the two mates should be
                        written in the same line separated by \t [required]
  --paired PAIRED       T (true) or F (false). T means the reads are paired
                        and consequently the sample file is expected to
                        contain 2 columns. F means the reads are not paired,
                        sample file is expected to contain 1 single column
                        [required]
  --length LENGTH       length of the read given as input. This is used to
                        calculate STAR index parameters. If your fq/fq.gz file
                        contains reads with different length specify the most
                        frequent read length [required]
  --out OUT             directory where the output files will be written. This
                        directory is created by the pipeline, specificy a non-
                        yet-existing directory
  --strand STRAND       strandeness of the RNAseq library. no =
                        unstranded/htseqcount 'no', yes = htseqcount 'yes',
                        reverse = htseqcount 'reverse'
  --num_threads NUM_THREADS
                        number of threads used by STAR and samtools [2]
  --remove REMOVE       T (true) or F (false). If this parameter is set to T
                        all the bam files are removed. If it is F they are not
                        removed [T]
  --index INDEX         If you want TEspeX to build the index for you, leave
                        the default value [recommended]. Otherwise provide
                        FULL path to a directoray containing STAR indexes
                        [not_recommended] [F]
  --mask MASK           fasta file containing sequences to be masked. If this
                        file is provided, the sequences contained within it
                        are considered as coding/non-coding transcripts and
                        are added to the --cdna and --ncrna fasta files. This
                        might be of help if the users wish to consider some
                        specific regions as belonging to coding/non-coding
                        transcripts even though they are not reported in
                        --cdna and --ncrna fasta files. (e.g., N kb downstream
                        to the transcript TTS for a better handling of
                        readthrough process or non-genic TE-derived sequences
                        known to be passively transcribed from criptic
                        promoters). [F]
  --multimap MULTIMAP   maximum number of loci a read/read pair is allowed to
                        map to. This value will be provided to STAR
                        --outFilterMultimapNmax and --winAnchorMultimapNmax
                        parameters. It is warmly suggested not to change the
                        default value [10]
  --version             show the version number and exit

All the arguments, except fot --num_threads, --remove, --index, --mask and --multimap are required.
We suggest to use as argument of --TE argument a fasta file containing TE consensus sequences and as arguments of the --cdna and --ncrna arguments the transcriptome files containing cdna and ncrna from ensembl (or genecode if working with human or mouse data).
The --length of the read is only needed to build the index of the reference transcriptome - in case of trimmed reads just provide the most frequent read lenght.
As --strand TEspeX expects the same nomenclature as the one used by htseq-count (if you are unsure about the strandedness of your data please take a look at: https://chipster.csc.fi/manual/library-type-summary.html) .
By default --num_threads is set to 2, --remove is set to T by default (meaning all the bam files are removed) and --index is set to F (meaning TEspeX will take care about index building).

TEspeX in wrapper mode

The file wrapper_slurm.py is the SLURM wrapper file.

To see the help, type:

cd $tespex
source activate TEspeX_deps
python3 wrapper_slurm.py -h

This should print to screen:

usage: wrapper_slurm.py [-h] --script SCRIPT --TE TE --cdna CDNA --ncrna NCRNA
                        --sample SAMPLE --paired PAIRED --length LENGTH --out
                        OUT --strand STRAND --job JOB --q Q --walltime
                        WALLTIME [--num_threads NUM_THREADS] [--remove REMOVE]
                        [--mask MASK] [--module MODULE] [--version]

optional arguments:
  -h, --help            show this help message and exit
  --script SCRIPT       path to the TEspeX*.py pipeline for calculation of TE
                        expression [required]
  --TE TE               fa/fa.gz file containing TE consensus sequences in
                        fasta format [required]
  --cdna CDNA           fa/fa.gz file containing cdna Ensembl sequences in
                        fasta format [required]
  --ncrna NCRNA         fa/fa.gz file containing ncrna Ensembl sequences in
                        fasta format [required]
  --sample SAMPLE       txt file containing fq/fq.gz FULL PATHS. If reads are
                        single end, one path should be written in each line.
                        If reads are paired end the two mates should be
                        written in the same line separated by \t [required]
  --paired PAIRED       T (true) or F (false). T means the reads are paired
                        and consequently the sample file is expected to
                        contain 2 columns. F means the reads are not paired,
                        sample file is expected to contain 1 single column
                        [required]
  --length LENGTH       length of the read given as input. This is used to
                        calculateSTAR index parameters. If your fq/fq.gz file
                        contains reads with different length specify the most
                        frequent read length [required]
  --out OUT             directory where the output files will be written. This
                        directory is created by the pipeline, specificy a non-
                        yet-existing directory
  --strand STRAND       strandeness of the RNAseq library. no =
                        unstranded/htseqcount 'no', yes = htseqcount 'yes',
                        reverse = htseqcount 'reverse'
  --job JOB             number of jobs that can be run at the same time
  --q Q                 name of the the queue of your SLURM system you want
                        TEspeX to be run (SBATCH -p parameter) [required]
  --walltime WALLTIME   walltime of each job TEspeX will launch on your SLURM
                        system reported in hh:mm:ss format (SBATCH -t
                        parameter) [required]
  --num_threads NUM_THREADS
                        number of threads used by STAR and samtools. Minimum
                        number of threads: 4 [4]
  --remove REMOVE       T (true) or F (false). If this parameter is set to T
                        all the bam files are removed. If it is F they are not
                        removed [T]
  --mask MASK           fasta file containing sequences to be masked. If this
                        file is provided, the sequences contained within it
                        are considered as coding/non-coding transcripts and
                        are added to the --cdna and --ncrna fasta files. This
                        might be of help if the users wish to consider some
                        specific regions as belonging to coding/non-coding
                        transcripts even though they are not reported in
                        --cdna and --ncrna fasta files. (e.g., N kb downstream
                        to the transcript TTS for a better handling of
                        readthrough process or non-genic TE-derived sequences
                        known to be passively transcribed from criptic
                        promoters). [F]
  --module MODULE       list of modules to be uploaded within each job by
                        'module load'. If more than one modules need to be
                        loaded provide a comma separated list (e.g.,
                        java,samtools) Reported modules should be available
                        through the 'module av' command. [F]
  --multimap MULTIMAP   maximum number of loci a read/read pair is allowed to
                        map to. This value will be provided to STAR
                        --outFilterMultimapNmax and --winAnchorMultimapNmax
                        parameters. It is warmly suggested not to change the
                        default value [10]
  --version             show the version number and exit

The parameters are exactly the same of TEspeX.py script except for 3 parameters:

  • --module: it might be that while running job some installed modules need to be upoloaded. If this is the case, provide the list of modules to this parameter. TEspeX will then take care to upload the listed modules in each job by module load <module name>
  • --script: it requires the path to TEspeX.py script
  • --q: the name of the SLURM queue to be used for the TEspeX job
  • --walltime: walltime of each job TEspeX will launch on your SLURM system reported in hh:mm:ss format (SBATCH -t parameter)
  • --job: it requires the number of jobs you want to run at the same time. This depends on the settings of your system. If you can run 40 jobs at the same time and you set --job 40 and you have 80 fq/fq.gz written in the txt file given as input to --sample the wrapper_slurm.py script will:
    • subset the --sample file in 40 sub-files containing 2 (80/40) fq/fq.gz each (named: sample0, sample1, .., sample40)
    • create 40 folders (named: 0, 1, .., 40)
    • launch 40 different jobs (named: job_0, job_1, .., job_40)
    • when all the jobs have finished the cleanup.py job is automatically launched and all the output files are merged together

When all is done you should have in your --out folder: slurm output files and 4 output files (TE_transc_reference.fa, TE_transc_reference.fai and the 2 output files mapping_stats_total.txt and outfile_total.txt) and 3 directories (index/, mappings/ and tmp/). In the mappings/ directory there is one directory for each fq/fq.gz analyzed containing all the temporary output files.

How to test for differentially expressed TEs

TEspeX provides a raw-count output file that can be potentially used as input for any of the several tools developed to detect differential expression of genes among two biological conditions (e.g. DESeq2 and edgeR). However, the large majority (all?) of such tools estimate the library size (i.e., sequencing depth) of each analysed sample by summing together the reads mapped on all the genes. While this assumption is perfectly working when analysing gene expression data, this might be unprecise when analysing TE expression data (i.e. the sum of the reads mapping on TE consensus is not a good aproximation of the total library size).

Thus, we strongly recommend not to allow the tool to automatically calculate the library size, providing instead as library size the total number of reads TEspeX has succesfully mapped on the reference transcriptome (i.e., TE consensus+cdna+ncrna). This information is contained in the 3rd column of the mapping_stats.txt/mapping_stats_total.txt TEspeX outputs in the working directory.

In edgeR this can be easily done with the following commands:

norm <- read.table("mapping_stats_total.txt",header=T,sep='\t')
y <- DGEList(counts=counts,group=group,lib.size = norm$mapped)  # assuming conuts is the matrix containing the raw counts and group the factor containing metadata information

It is likely that the perfect tool for testing for DE TEs does not exist yet as most of the assumptions made on gene expression data are not necessarily true in the TE scenario. We have internally tested the DE analysis downstream to TEspeX with i) edgeR, ii) DESeq2 and iii) applying a Welch t.test to normalised counts (RPM: raw counts / tot mapped reads *1M).

Our tests suggest that edgeR works slightly better than both DESeq2 (as it better handles the TEs showing no expression in multiple samples) and t.test (as it is less sensible to the sample size) and we thus suggest to use edgeR. DE testing downstream to TEspeX has been, nevertheless, tested with all the 3 methods providing consistent results in all the tested scenarios.

If you are interested in the development of the perfect tool for testing for DE TEs - or at least to try - feel free to contact Federico at (federico.ansaloni@gmail.com).

Development and help

The TEspeX pipeline has been developed by Federico Ansaloni, former PhD student in the Computational Genomics lab (SISSA/ISAS - Trieste - Italy) of prof. Remo Sanges (https://www.sangeslab.eu).
Nicolo' Gualandi and Mauro Esposito contributed to the pipeline testing.
To report bugs or suggestions please feel free to write to the TEspeX supporting group https://groups.google.com/g/tespex-help

Overview

If you found TEspeX usefull for your research, please cite:

Ansaloni et al., Bioinformatics, 2022: https://doi.org/10.1093/bioinformatics/btac526
Ansaloni et al., BMC Bioinformatics, 2019: https://doi.org/10.1186/s12859-019-3088-7

About

TEspeX - pipeline for Transposable Elements expression quantification

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages