Skip to content

Code to produce an IntroVerse-like database from any of the RNA-sequencing projects available on recount3.

License

Notifications You must be signed in to change notification settings

joshuagi/recount3-database-project

 
 

Repository files navigation

DOI

Splicing accuracy varies across human introns, tissues, age and disease

Sonia Garcia-Ruiz, David Zhang, Emil K Gustavsson, Guillermo Rocamora-Perez, Melissa Grant-Peters, Aine Fairbrother-Browne, Regina H Reynolds, Jonathan W Brenton, Ana L Gil-Martinez, Zhongbo Chen, Donald C Rio, Juan A Botia, Sebastian Guelfi, Leonardo Collado-Torres, Mina Ryten

bioRxiv 2023.03.29.534370; doi: https://doi.org/10.1101/2023.03.29.534370

Overview

The recount3-database-project repository contains the code used to generate all the databased produced for the manuscript Splicing accuracy varies across human introns, tissues and age. It contains R scripts designed to generate three different SQLite databases from junction data. The scripts utilize publicly available datasets to facilitate biological data analysis.

Table of Contents

  1. Installation
  2. Usage
  3. Function Descriptions
  4. Dependencies
  5. Environments
  6. License

Installation

To use the scripts in this repository, ensure you have R installed. You can install the required packages by running:

From R:

install.packages(c("dplyr", "DBI", "RSQLite", "recount3"))

From Bash command-line:

git clone https://github.com/SoniaRuiz/recount3-database-project.git
cd recount3-database-project

Usage

The init.R script downloads publicly available junction data from any recount3 project and builds a junction database.

## Please, remember to update the variable `recount3_project_IDs`, to indicate the recount3 project ID that you would like to download and database.
source("init.R")

The init_age.R script utilizes previously downloaded junction data from the GTEx project, stratifying samples by age before constructing a junction database.

It starts the age sample clustering using the age groups "20-39", "40-59" and "60-79" years-old. Then, using the previously downloaded exon-exon junction data used for the creation of the Splicing database (init.R), this script clusters exon-exon split reads and count matrices across the samples of each age category. It pairs the split reads from the annotated category with the split reads from the novel donor and novel acceptor junctions across the samples of each age cluster at the tissue level. Finally, it creates the "Age Stratification" intron database.

## To run the age stratification of the GTEx samples, it is necessary to have downloaded, processed and databased the GTEx v8 junctions using the `init.R` script.
source("init_age.R")

The init_ENCODE.R script downloads BAM files from the ENCODE platform, extracts junctions, and generates a junction database. BAM files downloaded correspond to the RNA-binding proteins involved in post-transcriptional processes published by Van Nostrand et at. 2020.

source("init_ENCODE.R")

Function Descriptions

init.R:

Data download and processing:

  • download_recount3_data(): downloads, process and annotates, using a given Ensembl annotation, the exon-exon junction split read data from the recount3 project indicated.
  • prepare_recount3_data(): groups the processed split-read data by sample cluster for a given recount3 project.

Junction pairing:

  • junction_pairing(): pairs novel junctions and annotated introns across the samples of each sample cluster.

Junction processing prior databasing:

  • get_all_annotated_split_reads(): loops through the samples clusters from the current recount3 project and obtains all unique split reads found across their samples.
  • get_all_raw_jxn_pairings(): loops through the samples clusters from the current recount3 project and obtains all junction pairings.
  • tidy_data_pior_sql(): discards ambiguous junctions and prepares the data prior generation of the SQL database.
  • generate_transcript_biotype_percentage(): calculates the percentage of protein-coding transcripts in which a given junction may appear.
  • generate_recount3_tpm(): obtains and transforms (scaled by library size) raw counts data. Calculates the median gene TPM across all samples from each sample cluster.

SQLITE database generation:

init_age.R:

  • age_stratification_init_data(): clusters the GTEx v8 samples by age supergroup, i.e. "20-39", "40-59" and "60-79" years-old.
  • age_stratification_annotate(): creates the split reads annotation files per age group.
  • age_stratification_junction_pairing(): performs the split read junction pairing between novel junctions and annotated introns across the samples of each age group.
  • get_all_annotated_split_reads()
  • get_all_raw_jxn_pairings()
  • generate_recount3_tpm()
  • tidy_data_pior_sql()
  • generate_transcript_biotype_percentage()
  • sql_database_generation()

init_ENCODE.R:

  • ENCODE_download_metadata(): downloads metadata from the gene-silencing knockdown experiments of RBPs from the ENCODE platform. Code adapted from https://github.com/guillermo1996.
  • ENCODE_download_bams(): downloads the BAM files corresponding to the gene-silencing knockdown experiments of RBPs from the ENCODE platform. Code adapted from https://github.com/guillermo1996.
  • prepare_encode_data(): extracts the exon-exon junction split read data from each knockdown ENCODE experiments, processes and annotates them.
  • junction_pairing()
  • get_all_annotated_split_reads()
  • get_all_raw_jxn_pairings()
  • tidy_data_pior_sql()
  • generate_transcript_biotype_percentage()
  • sql_database_generation()

Testing Framework

The testing framework was developed by the talented Guillermo Rocamora. Library used: https://testthat.r-lib.org/

Dependencies

  1. regtools (visit: https://regtools.readthedocs.io/en/latest/)
$ git clone https://github.com/griffithlab/regtools
$ cd regtools/
$ mkdir build
$ cd build/
$ cmake ..
$ make
  1. samtools (visit: https://www.htslib.org/download/)
$ wget https://github.com/samtools/samtools/releases/download/1.21/samtools-1.21.tar.bz2                  
$ tar -xvjf samtools-1.21.tar.bz2   
$ cd samtools-1.21/
$ ./configure --prefix=/absolute_path/to_your_samtools_folder/samtools-1.21/
$ make
$ make install
  1. 'hg38-blacklist.v2.bed' File that contains the ENCODE blacklist regions found in the hg38 (paper: https://www.nature.com/articles/s41598-019-45839-z).
$ wget https://github.com/Boyle-Lab/Blacklist/raw/refs/heads/master/lists/hg38-blacklist.v2.bed.gz
$ gzip -d hg38-blacklist.v2.bed.gz
  1. 'Homo_sapiens.GRCh38.111.chr.gtf'
$ wget http://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.chr.gtf.gz
$ gzip -d Homo_sapiens.GRCh38.111.chr.gtf.gz
  1. 'MANE.GRCh38.v1.0.ensembl_genomic.gtf' MANE info: https://www.ncbi.nlm.nih.gov/refseq/MANE/
$ wget https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz
$ gzip -d MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz
  1. 'Homo_sapiens.GRCh38.dna.primary_assembly.fa', 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai'
$ wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz --no-check-certificate
$ gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ samtools faidx ./Homo_sapiens.GRCh38.dna.primary_assembly.fa
  1. MAXENTSCAN score software Software downloaded from 'http://hollywood.mit.edu/burgelab/software.html'. More info: http://hollywood.mit.edu/burgelab/maxent/download/
$ wget http://hollywood.mit.edu/burgelab/maxent/download/fordownload.tar.gz
$ tar -zxvf fordownload.tar.gz
  1. hg38.phastCons17way.bw
$ wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/phastCons17way/hg38.phastCons17way.bw
  1. Context-dependent tolerance scores (CDTS) This are the context-dependent tolerance scores (CDTS) or constraint scores that are computed from the paper: The human noncoding genome defined by genetic diversity - Iulio et al. - 2018 Download page: http://www.hli-opendata.com/noncoding/ Download link for N7794unrelated.txt.gz: http://www.hli-opendata.com/noncoding/coord_CDTS_percentile_N7794unrelated.txt.gz.

  2. 'major_introns_tidy.rds' and 'minor_introns_tidy.rds' Original BED files downloaded from the IAOD database:

$ wget https://introndb.lerner.ccf.org/static/bed/GRCh38_U12.bed
$ wget https://introndb.lerner.ccf.org/static/bed/GRCh38_U2.bed
  1. 'clinvar.vcf' Data downloaded from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
$ wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
$ gzip -d clinvar.vcf.gz
  1. Bedtools bedtools software - more info: https://bedtools.readthedocs.io/en/latest/content/installation.html
$ wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
$ tar -zxvf bedtools-2.29.1.tar.gz
$ cd bedtools2
$ make

Alternatively:

$ apt-get install bedtools

Environments

The code included within this repository has been successfully tested on:

  • Ubuntu version "16.04.7 LTS (Xenial Xerus)"
  • Ubuntu version "22.04.2 LTS (Jammy Jellyfish)"

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

Aligning Science Across Parkinson's (ASAP)

About

Code to produce an IntroVerse-like database from any of the RNA-sequencing projects available on recount3.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 73.3%
  • R 26.7%