Skip to content

COSMIS is a framework for quantifying the mutational constraint on amino acid sites in 3D spatial neighborhoods. The framework currently maps the landscape of 3D mutational constraint on 6.1 amino acid sites covering >80% (16,533) of human proteins.

License

Notifications You must be signed in to change notification settings

CapraLab/cosmis

Repository files navigation

COSMIS

COSMIS is a novel framework for quantifying the 3D mutational constraint on amino acid sites in the human proteome. If you find COSMIS useful in your work, please consider citing the following paper:

What's in each folder?

cosmis

The cosmis folder contains utility code that the top level cosmis code cosmis.py, cosmis_batch.py, and cosmis_sp.py depend on.

figure-code-data

The figure-code-data folder contains standlone R code that can be run to reproduce figures in the main text and supplementary document of the manuscript.

cosmis-scores

The cosmis-scores folder constains precomputed scores for all 16,533 proteins of the human reference proteome currently covered by the framework.

helpers

The helpers folder contains utility scripts written in Python that were called to obtain processed datasets in the mapping-files folder.

structures

The structures folder contains all protein structure in the PDB format based on which COSMIS scores were computed.

supplementary-data

The supplementray-data folder contains all supplementary tables referred to in the published COSMIS paper.

scripts

The scripts folder contains the main application scripts that can be run to compute COSMIS scores depending on use cases.

Using the COSMIS framework

It is recommended that interested users of the COSMIS framework download precomputed COSMIS scores from this repository. However, should you need to run COSMIS using custom-built protein structural models, or to compute COSMIS scores based on protein-protein complexes, please follow the steps below. Also note that we are in the process of repackaging COSMIS into an installable Python package, please check later for updates.

Clone COSMIS

Clone COSMIS to a local directory.

git clone https://github.com/CapraLab/cosmis.git

Note that cloning might fail as this repository is tracked with Git Large File Storage and is over the data quota currently allowed by Git LFS. Please check later as we are sorting out this quota issue.

Install from source

To install cosmis from source, go to the directory where cosmis was cloned and run the following command:

python3 setup.py install

export PYTHONPATH="$PYTHONPATH:</path/to/cosmis>"

Replace </path/to/cosmis> with the actual path to cosmis on your system. You should then be able to locate the main applications scripts in ./build/scripts-3.8 or ./build/scripts-3.9 depending on the Python version on your system. You can copy these scripts to a convenient location of your choice to run them.

Run within a virtual environment

One can also run cosmis application scripts within a virtual environment. It is easiest to set up a separate conda environment to install all required packages and to run these scripts. All required packages can be installed when creating the conda environment, using the following commands:

# run this command to create a conda environment
conda create --name cosmis --file requirements.txt

# activate the environment
conda activate cosmis

# then also use pip to install wget under the environment
pip install wget

Obviously, you will need to install Miniconda or Anaconda before running the command above.

Download required datasets

Some of the required datasets are already made available within this repository (in the database_files folder). However, due to limits on file size, we had to made larger files available through other means. Follow the following steps to get all required input datasets.

  1. Get all transcript coding sequences from Ensembl (this dataset is already available in database_files).
wget http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz
  1. Get the amino acid sequences of human reference proteome as annotated by UniProt (this dataset is already available in database_files).
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
  1. Download counts of unique variants at each amino acid position for all gnomAD annotated transcripts.

    We preprocessed the gnomAD database and created a JSON formatted file that maps Ensembl stable transcript IDs to unique variant counts and variant types (missense or synonymous) for all position in the human proteome where a SNP variant was annotated by gnomAD. We have made this dataset available through FigShare.

  2. Get the mapping table from UniProt protein IDs to Ensembl stable transcript IDs (this dataset is already available in database_files).

    A mapping from UniProt protein IDs to Ensembl stable transcript IDs is also required to run COSMIS. We have created such a mapping table and made it available through FigShare

  3. Get transcript-level mutation probabilities (this dataset is already available in database_files).

    Transcript-level mutation probabilities are required to run COSMIS. You can get them from FigShare.

Run COSMIS

Depending on whether you want run COSMIS on a single monomeric protein or homo-multimeric protein, or a list of monomeric proteins, the script and setup are slightly different.

  1. Run COSMIS on a single monomeric or homo-multimeric protein.

1.1 Use the following JSON formatted template to supply paths to database files.

{
    "ensembl_cds": "/path/to/Homo_sapiens.GRCh38.cds.all.fa.gz",
    "uniprot_pep": "/path/to/UP000005640_9606.fasta.gz",
    "gnomad_variants": "/path/to/gnomad_filtered/gnomad_variant_counts_hg38.json",
    "uniprot_to_enst": "/path/to/uniprot_to_enst.json",
    "enst_mp_counts": "/path/to/mutation_probs.tsv"
}

Then, save the file as data_paths.json, for example.

1.2 If you'd like to compute COSMIS score WITHOUT accounting for contacts from neighboring subunits. Run this command

python cosmis_sp.py -c data_paths.json -u <UniProt ID> -p <PDB file> --chain <chain ID of subunit> -o monomeric_cosmis.tsv

1.3 If you'd to compute COSMIS score accounting for contacts from neighboring subunits. Add --multimer to the command above, i.e.

python cosmis_sp.py -c data_paths.json -u <UniProt ID> -p <PDB file> --chain <chain ID of subunit> -o multimeric_cosmis.tsv --multimer

1.4 One can also run the following command to compute COSMIS scores for a subunit which is part of a hetero-oligomeric protein complex. However, cosmis_complex.py has not been thoroughly tested or benchmarked. Interpret the results with caution and let us know if you find anything buggy.

python cosmis_complex.py -c data_path.json -i <chain_to_uniprot_mapping file> -p <PDB file> --chain <chain ID of subunit> -o <output file>

# example
cd examples/
python ../cosmis_complex.py -c cosmis_config.json -i KCNQ1_chain_to_uniprot.txt -p KCNQ1.pdb -o KCNQ1_cosmis.tsv --chain A
  1. Run COSMIS on a list of monomeric proteins whose structures were obtained from AlphaFold database or SWISS-MODEL repository.

2.1 Use the following JSON formatted template to supply paths to database files.

{
    "ensembl_cds": "/path/to/Homo_sapiens.GRCh38.cds.all.fa.gz",
    "uniprot_pep": "/path/to/UP000005640_9606.fasta.gz",
    "gnomad_variants": "/path/to/gnomad_filtered/gnomad_variant_counts_hg38.json",
    "uniprot_to_enst": "/path/to/uniprot_to_enst.json",
    "enst_mp_counts": "/path/to/mutation_probs.tsv"
    "pdb_dir": "/path/to/pdb_files/",
    "output_dir": "./"
}

Then, save the file as data_paths.json, for example.

2.2 If the structures were obtained from AlphaFold database, run the following command

python cosmis_batch.py -c data_paths.json -i <input.txt> -d AlphaFold -l af_cosmis.log

The input file input.txt contains on each line a pair of UniProt ID and PDB filename. For example

A0A024R1R8 A0/A0/AF-A0A024R1R8-F1-model_v1.pdb
A0A024RBG1 A0/A0/AF-A0A024RBG1-F1-model_v1.pdb
A0A024RCN7 A0/A0/AF-A0A024RCN7-F1-model_v1.pdb
A0A075B6H5 A0/A0/AF-A0A075B6H5-F1-model_v1.pdb
A0A075B6H7 A0/A0/AF-A0A075B6H7-F1-model_v1.pdb
A0A075B6H8 A0/A0/AF-A0A075B6H8-F1-model_v1.pdb
A0A075B6H9 A0/A0/AF-A0A075B6H9-F1-model_v1.pdb
A0A075B6I0 A0/A0/AF-A0A075B6I0-F1-model_v1.pdb
A0A075B6I1 A0/A0/AF-A0A075B6I1-F1-model_v1.pdb
A0A075B6I3 A0/A0/AF-A0A075B6I3-F1-model_v1.pdb

assuming that the base directory where the PDB files are stored is /path/to/pdb_files/.

2.3 If the structures were obtained from SWISS-MODEL repository, run the following command

python cosmis_batch.py -c data_paths.json -i <input.txt> -d SWISS-MODEL -l swiss_model_cosmis.log

The input file input.txt contains on each line a pair of UniProt ID and PDB filename. For example

A8MWA4 A8/MW/A4/swissmodel/109_299_5v3m.1.C.pdb
A8MWD9 A8/MW/D9/swissmodel/3_76_4wzj.3.G.pdb
A8MWL7 A8/MW/L7/swissmodel/11_110_2loo.1.A.pdb
A8MX76 A8/MX/76/swissmodel/21_683_3bow.1.A.pdb
A8MXE2 A8/MX/E2/swissmodel/68_331_7jhi.1.A.pdb
A8MXQ7 A8/MX/Q7/swissmodel/136_377_4qfv.1.A.pdb
A8MXT2 A8/MX/T2/swissmodel/95_311_2wa0.1.A.pdb
A8MXU0 A8/MX/U0/swissmodel/23_60_2lwl.1.A.pdb
A8MXY4 A8/MX/Y4/swissmodel/200_532_5v3m.1.C.pdb
A8MYX2 A8/MY/X2/swissmodel/110_172_5cwg.1.A.pdb

again, assuming that the base directory where the PDB files are stored is /path/to/pdb_files/.

About

COSMIS is a framework for quantifying the mutational constraint on amino acid sites in 3D spatial neighborhoods. The framework currently maps the landscape of 3D mutational constraint on 6.1 amino acid sites covering >80% (16,533) of human proteins.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published