- FastQC (https://github.com/s-andrews/FastQC)
- cutadapt (https://cutadapt.readthedocs.io/en/stable/)
- trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
- sortMeRNA (http://bioinfo.lifl.fr/RNA/sortmerna/)
- bwa (https://sourceforge.net/projects/bio-bwa/)
- tophat/STAR (https://ccb.jhu.edu/software/tophat/index.shtml; https://github.com/alexdobin/STAR)
- samtools (http://samtools.sourceforge.net/)
- htseq-count (https://htseq.readthedocs.io/en/release_0.9.1/) `
The host_microbe_mapper was developed to analyse dual RNA-seq expression data with a host and a pathogen. The entire script with all commands assigned are summarized in "host_pathogen_mapping.sh" which includes the commands for running Tophat and Bowtie. All the required input parameters can be specified in a seperate Graphical User Interface (GUI) which is provided within this data package. For any questions please use the Github Tracking option.
Furthermore, we can extract the output of the sortMeRNA output form mapped reads to the 16S rRNA gene and used that as input for the REAGO, in case that paired end sequencing information was used. This is done by using the script 'data/extract_16.py' and by pointing to the output directory from the host_microbe_mapper 'python data/extract_16.py <OUTPUT_DIR>'
'python GUI_host_pathogen_mapping.py' open a GUI where the FastQ files can be integrated and where the reference sequences can be defined. It creates a 'pipeline.sh' file, that can be started and does the mapping with help of the 'data/host_microbe_mapping.sh' script.
We used human, mouse and the pathogen Neisseria for demonstrating the pipeline using chromosomes chr19 and by generating reads with help of ArtificialFastqGenerator.jar (Frampton et al. 2012). Human (NC_000019.10), mouse (NC_000085.6) and N. meningitidis (NC_003112.2).
Species | Gene Bank Identifier |
Homo sapiens | GCF_000001405.37 - 38.p11 |
Mus musculus | GCF_000001635.26 - 38.p6 |
Neisseria meningitidis | GCF_000008805.1 - ASM880v1 |
It is possible to run the host_microbe_mapper also without the GUI, which requires the following.
./data/host_pathogen_mapping.sh -F ${FWD} -R ${RVS} -P ${REF} -C 4 -X /naslx//HOST_MICROBE_MAPPER/host_pathogen_mapping-master/output2 -H $HUMAN -I $MOUSE
where F represents the forwards reads, F represents the reverse, P represents the microbial genome, whereas C represents the processor amount, X represents the output directory, H represents the first host genome (here: human) reference and I represents the second host reference (here: mouse).
Parameter | Meaning |
F | forward reads in the FASTQ format |
R | reverse reads in the FASTQ format |
P | microbe reference in FASTA format |
C | processor amount |
X | output directory |
H | human mapping reference |
I | mouse mapping reference |
14 DECEMBER 2017
Output dir: /naslx//HOST_MICROBE_MAPPER/host_pathogen_mapping-master/output2_28feb18
Pathogen genome: ../read_simulator/references/Neisseria_meningitidis_genome
Eukaryotic genome: ../read_simulator/references/Homo_sapiens_chr19.fasta
Eukaryotic genome: ../read_simulator/references/Mus_musculus_chr19.fasta
The mapping tools 'tophat' was selected for mapping RNA-seq in the host genomes
WORKING directory: /gpfs/proj/abc/tmp.qtfwssmZHr
no subsets - complete set to use
step-0: fastqc
step-1: cutadapt
step-2: trimmomatic
TrimmomaticPE: Started with arguments:
-threads 4 -phred33 /gpfs/proj/abc/tmp.qtfwssmZHr/FWD.fastq /gpfs/proj/abc/tmp.qtfwssmZHr/RVS.fastq /gpfs/proj/abc/tmp.qtfwssmZHr/trimmomatic_forward_paired.fq.gz /gpfs/proj/abc/tmp.qtfwssmZHr/trimmomatic_forward_unpaired.fq.gz /gpfs/proj/abc/tmp.qtfwssmZHr/trimmomatic_reverse_paired.fq.gz /gpfs/proj/abc/tmp.qtfwssmZHr/trimmomatic_reverse_unpaired.fq.gz LEADING:8 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:50
Input Read Pairs: 7500 Both Surviving: 7500 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
step-3: sortMeRNA
step-4: bwa (mapping to bacterium)
14970 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
3858 + 0 mapped (25.77% : N/A)
14970 + 0 paired in sequencing
7485 + 0 read1
7485 + 0 read2
2674 + 0 properly paired (17.86% : N/A)
2848 + 0 with itself and mate mapped
1010 + 0 singletons (6.75% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
step-5: tophat (mapping to Host)
adding: gpfs/proj/abc/tmp.qtfwssmZHr/tophat/accepted_hits.bam (deflated 9%)
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
step-6: tophat (mapping to Host-2)
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
No count extraction for - pathogen
No count extraction for - host1
No count extraction for - host2
INFO: Calculations are stored under: tmp.qtfwssmZHr