-
Notifications
You must be signed in to change notification settings - Fork 7
Building RNA3DB from scratch
If you wish to build your own version of RNA3DB from scratch, please follow the below steps carefully.
If you want to install and reproduce RNA3DB simply run:
$ git clone https://github.com/marcellszi/rna3db.git && cd rna3db
$ python -m pip install -e .
We recommend Python 3.10.* for running RNA3DB.
The following non-Python dependencies are also required to reproduce all steps:
First, you must download the mmCIF files to scan for RNAs. We scan all crystal structures in the PDB for RNA chains for our data set.
To make this simple we provide scripts/download_pdb_mmcif.sh
, which can be used to download all crystal structures in the PDB. You can do this via:
$ scripts/download_pdb_mmcif.sh data/cif/
Note: This step requires downloading and extracting 343GB+ of gzip files.
Before parsing mmCIF files, a modifications_cache.json
must be generated. This file contains conversions from three_letter_code
to one_letter_code
for modified residues. To do this, we start by dowloading the latest version of components.cif
from the Chemical Component Dictionary, and running:
$ scripts/generate_modifications_cache.py path/to/components.cif data/modifications_cache.json
The next step is to parse the mmCIF files and extract all RNA chains.
$ python -m rna3db parse data/pdb_mmcif/mmcif_files output/parse.json
As the next step, we perform homology search on all RNAs in the PDB. However, if you don't want this information for all chains, you can significantly reduce the number of searched sequences (therby speeding up the search) by filtering redundant, short, or low resolution sequences first. See the Filtering step below.
We must generate a FASTA file containing the RNAs we want to scan. We provide a convenient script to do this via:
$ scripts/json_to_fasta.py output/parse.json output/parse.fasta
Next, we must use Infernal's cmscan
to do homology search. First, we need to download the latest version of Rfam's covariance models. Then we can simply run:
$ cmscan -o output/cmscans/parse.o --tbl output/cmscans/parse.tbl data/Rfam.cm output/parse.fasta
To attempt to find hits for those chains without significant hits, we rescan them with the --max
flag. First, generate a FASTA of chains without hits in parse.tbl
:
$ scripts/get_nohits.py output/parse.json output/nohits.fasta output/cmscans/parse.tbl
Then run cmscan again (this time with the --max
flag):
$ cmscan --max -o output/cmscans/nohits.o --tbl output/cmscans/nohits.tbl data/Rfam.cm output/nohits.fasta
Note: Both of these scans are prohibitively slow to run on most consumer-grade hardware. We recommend running this step on a compute cluster. The latest homology search took 110 hours on a single Intel Xeon Platinum 8358 processor with 32 cores.
This step can be performed prior to the homology search if you are not interested in the filtered chains's homology search. In any case, to filter a set of raw chains, you can run:
$ python rna3db filter output/parse.json output/filter.json
Note: This step requires that you have the
.tbl
files produced by Infernal's cmscan. Please see the Homology step above.
Once you have gone through the filtering step, you must cluster by both sequence and structure. To do this, you can run,
$ python rna3db cluster output/filter.json output/cluster.json --tbl_dir output/cmscans/
where output/cmscans
is your folder that contains all .tbl
files used for homology. If RNA3DB cannot find MMseqs2, you may need to manually provide a path to the binary with the --mmseqs_binary_path
flag.
The final step is to create a training/testing split. This can be done by running,
$ python -m rna3db split output/cluster.json output/split.json