- This repository contains a protocol to analyse RNA-seq data, focusing on alternative splicing & polyadenylation, authored by Oliver Ziff.
- The contents are based on multiple resources including:
- RNAseq worksheet
- Biostars handbook
- rnaseq.wiki
- RNA-seqlopedia
- RNA Seq blog
- Bioconductor Course Materials
- Data Camp
- Coursera
- and most importantly the experience of established experts in RNAseq analysis within the Luscombe lab - my host laboratory.
- http://127.0.0.1:13884/library/rnaseqGene/doc/rnaseqGene.html RNA seq workflow
- The protocol utilises a combination of bash
unix
commmand line andR
scripts. - FAQs https://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1004393.s009
- Tools: https://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1004393.s004
- RNA seq workflow
- Wet-lab RNA sequencing phase
- Accessing sequencing data
- QC of sequencing files
- Alignment
- Visualisation in IGV browser
- QE of aligned reads
- Read quantification
- Differential expression analysis
- Splicing analysis
- Gene enrichment analysis
The aim of RNA-seq is to interrogate relative transcript abundance and diversity. It's accuracy is superior to microarray and similar to qPCR
- transcript discovery
- genome annotation
- alternative expression analysis
- gene fusion detection
- viral detection
- detect RNA editing (CRISP/Cas9)
- Extract & isolate RNA
- Prepare library: break RNA into small fragments, enrich nonribosomal RNA, convert to cDNA, construct fragment library (add sequencing adapters, PCR amplify)
- High-throughput Sequence the cDNA library: generate single or paired end reads of 30-300bp in length. Flow cell, base calling & quality score, replicates (technical = multiple lanes in flow cell; biological = multiple samples from each condition)
https://www.biostarhandbook.com/rnaseq/rnaseq-intro.html
- Process raw Reads: FATQ files download SRA, quality scores (Phred), paired vs single end sequence, FASTQC quality control, variability, spike-ins, blocking & randomise, filter out low quality reads & artifacts (adapter sequence reads).
- Align (map) reads to reference genome (FASTA, GFF, GTF): annotation file (BED), alignment program (STAR, HISAT), reference genomes (GenCODE, Ensemble), generate genome index, create & manipulate BAM/SAM files containing sequence alignment data
- Visualise & explore alignment data in IGV and R studio: ggplot2, bias identification QoRTs,
- Estimate Read Quantification (abundance) with gene based read counting
- Compare abundances between conditions & replicates (differential expression): Normalise, adjust each gene read counts for the total aligned reads within each sample. Summarise data with pairwise correlation, hierarchical clustering, PCA analysis - look for differences between samples & identify outliers to consider excluding.
On the CAMP cluscd ter most packages are preinstalled but to use them you need to use the module load function:
ml STAR
ml ncbi-vdb
ml fastq-tools
ml SAMtools
ml RSeQC
ml QoRTs
ml multiqc
ml Subread
ml Java
Use module spider
to search for packages.
Install conda
and activate bioconda
Installing packages in R
install.package("package name")
Bioconductor is a free software project for genomic analyses based on R programming.
Install Bioconductor
Source
source ("https://bioconductor.org/biocLite.R")
biocLite (“package_name“)
biocLite("erccdashboard")
# erccdashboard (for artificial spike in quantification)
biocLite("DESeq")
Even though packages have been installed into R locally, then need to be brought into the working memory before using them:
library("erccdashboard")
library("DESeq")