Skip to content

VisualPhysiologyDB/optics

Repository files navigation

Code: License: GPL v3 Data: License: GPL v3 VPOD_1.2 DOI: DOI

Opsin Phenotype Tool for Inference of Color Sensitivity (OPTICS)

Example Box Plot Output for Bootstrap Predictions of Opsin λmax by OPTICS


Description

  • OPTICS is an open-source tool that predicts the Opsin Phenotype (λmax) from unaligned opsin amino-acid sequences.
  • OPTICS leverages machine learning models trained on the Visual Physiology Opsin Database (VPOD).
  • OPTICS is also avaliable as an online tool here, hosted on our Galaxy Project server.

Key Features

  • λmax Prediction: Predicts the peak light absorption wavelength (λmax) for opsin proteins.
  • Model Selection: Choose from different pre-trained models for prediction.
  • Encoding Methods: Select between one-hot encoding or amino-acid property encoding for model training and prediction.
  • BLAST Analysis: Optionally perform BLASTp analysis to compare query sequences against reference datasets.
  • Bootstrap Predictions: Optionally enable bootstrap predictions for enhanced accuracy assessment (suggested limit to 10 sequences for bootstrap visulzations).

Installation

  1. Clone the repository:

     git clone https://github.com/VisualPhysiologyDB/optics.git
    
  2. Install dependencies: [Make sure you are working in the repository directory from here-after]

    A. Create a Conda environment for OPTICS (make sure you have Conda installed)

    conda create --name optics_env python=3.11 

    B. Use the 'requirements.txt' file to download base package dependencies for OPTICS

    pip install -r requirements.txt
    • THEN
    conda activate optics_env

    C. Download MAFFT and BLAST

    IF working on MAC or LINUX device:

    • Install BLAST and MAFFT directly from the bioconda channel
      conda install bioconda::blast bioconda::mafft

    IF working on WINDOWS device:

  3. Usage

    • MAKE SURE YOU HAVE ALL DEPENDENCIES DOWNLOADED ARE IN THE FOLDER DIRECTORY FOR OPTICS BEFORE RUNNING ANY SCRIPTS!

    • Parameters

        -in - FASTA file containing unaligned opsin sequences.
      
        -rd - Name for job; used to create output folder. (deafult = optics_on_unamed_{date_and_time_label})
      
        -out - Name for output file. (deafult = 'optics_predictions.txt')
      
        -m - Select model to use for prediction. Options are 'whole-dataset', 'vertebrate', 'invertebrate', 'wildtype', or 'wildtype-vert'
      
        -e - Select preferred encoding method used to train model and make predictions. Options are 'one-hot' or 'aa_prop'
      
        -b - Option to enable/disable Blastp analysis on query sequences. [True/False]
      
        -ir - Name for the blastp report output file. (deafult = 'blastp_report.txt')
      
        -r - Select reference sequence used for position numbering for blastp analysis. Options are 'bovine', 'squid', or 'custom'
      
        -f - Custom reference sequence file used for blastp analysis - **ONLY NEED TO PROVIDE IF SELECTED 'CUSTOM' FOR REFERENCE SEQUENCE**
      
        -s - Option to enable/disable bootstrap predictions on query sequences. **NOTE: VISULIZATION ONLY PRODUCED FOR 10 SEQUENCES OR LESS**
      
        -bsv - Name for the pdf file output file for visualizing bootstrap predictions. (default = 'bootstrap_viz.pdf')
      
    • Example Command Line Usage vvv

       python optics_predictions.py -in ./examples/optics_ex_short.txt -rd ex_test_of_optics -out ex_predictions.tsv -m wildtype -e aa_prop -b True -ir ex_blastp_report.tsv -r squid -s True -bsv ex_bs_viz
      

Input

  • Unaligned FASTA file containing opsin amino-acid sequences.
  • Example FASTA Entry:
      >NP_001014890.1_rhodopsin_Bos taurus
      MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLYVTVQHKKLRT 
      PLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVC 
      KPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVV 
      HFIIPLIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQG 
      SDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA   
    

Output

  • Predictions (TSV): λmax values, model used, and encoding method.

  • BLAST Results (TXT, optional): Comparison of query sequences to reference datasets.

  • Bootstrap Graphs (PDF, optional): Visualization of bootstrap prediction results.

  • Job Log (TXT): Log file containing input command to OPTICS, including encoding method and model used.

    Note - All outputs are written into sub-folders within the 'prediction_outputs' folder, and are marked by time and date.


License

All data and code is covered under a GNU General Public License (GPL)(Version 3), in accordance with Open Source Initiative (OSI)-policies

Citation

  • IF citing this GitHub and its contents use the following DOI provided by Zenodo...

    10.5281/zenodo.10667840
    
  • IF you use OPTICS in your research, please cite the following paper:

    Seth A. Frazer, Mahdi Baghbanzadeh, Ali Rahnavard, Keith A. Crandall, & Todd H Oakley. Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD). GigaScience, 2024.09.01. https://doi.org/10.1093/gigascience/giae073
    

Contact

Contact information for author questions or feedback.

Todd H. Oakley - ORCID ID

oakley@ucsb.edu

Seth A. Frazer - ORCID ID

sethfrazer@ucsb.edu

Additional Notes/Resources

  • Want to use OPTICS without the hassle of the setup? -> CLICK HERE to visit our Galaxy Project server and use our tool!

  • OPTICS v1.0 uses VPOD v1.2 for training.

  • Here is a link to a bibliography of the publications used to build VPOD_1.2 (Full version not yet released)

  • If you know of publications for training opsin ML models not included in the VPOD_1.2 database, please send them to us through this form

  • Check out the VPOD GitHub repository to learn more about our database and ML models!