Skip to content

FusionDTI utilises a Token-level Fusion module to effectively learn fine-grained information for Drug-Target Interaction Prediction.

Notifications You must be signed in to change notification settings

ZhaohanM/FusionDTI

Repository files navigation

FusionDTI

Website Paper Demo

Introduction

FusionDTI utilises a Token-level Fusion module to effectively learn fine-grained information for Drug-Target Interaction Prediction. In particular, our proposed model uses the SELFIES representation of drugs to mitigate sequence fragment invalidation and incorporates the structure-aware (SA) vocabulary of target proteins to address the limitation of amino acid sequences in structural information, additionally leveraging pre-trained language models (PLMs) extensively trained on large-scale biomedical datasets as encoders to capture the complex information of drugs and targets.

Framework

FusionDTI

Installation Guide

Clone this Github repo and set up a new conda environment.

# create a new conda environment
$ conda create --name FusionDTI python=3.8
$ conda activate FusionDTI

# install requried python dependencies
$ conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
$ pip install transformers
$ pip install wandb

# clone the source code of FusionDTI
$ git https://github.com/ZhaohanM/FusionDTI.git
$ cd FusionDTI

Datasets

All data used in FusionDTI are from public resource: BindingDB [1], BioSNAP [2] and Human [3]. The dataset can be downloaded from here.

Train

For the experiments with FusionDTI, you can directly run the following command. The dataset could either be BindingDB, Biosnap, and Human.

$ python main_token.py --dataset BindingDB

Inference

After training the FusionDTI model, the best saved model is used to inference a single drug and target pair. In visualize_attention.ipynb, we provide the function of entering protein and drug sequences to visualise attention weights.

$ python attention.py --dataset BindingDB

How to obtain the structure-aware sequence of protein?

The structure-aware sequence of protein is based on 3D structure file (.cif) using Foldseek from the AlphafoldDB database. SaProt provides a function to convert a protein structure into a structure-aware sequence. The function calls the foldseek binary file to encode the structure. You can download the binary file from here and place it in the utils folder.

The following three steps are the obtainment process:

Step 1: If you do not have protein structure files, you will need to obtain them from the AlphafoldDB database via the UniProt IDs on the UniProt website. The UniProt IDs are then saved as a comma-delimited text file.

Step 2: Retrieve protein structure files from AlphafoldDB through corresponding UniProt IDs.

$ python get_alphafold.py

Step 3: The structure-aware protein sequences are obtained with 3D structure files (cif).

$ python generate_stru_seq.py

How to obtain SELFIES of drug?

Install the Python package that converts SMILES strings to SELFIES strings.

$ pip install selfies 
$ pip install pandarallel

Run the following code to generate SELFIES based on your SMILES.

$ python generate_selfies.py

Results

FusionDTI

Citation

Please cite our paper if you find our work useful in your own research.

@inproceedings{meng2024fusiondti,
title={Fusion{DTI}: Fine-grained Binding Discovery with Token-level Fusion for Drug-Target Interaction},
author={Zhaohan Meng, Zaiqiao Meng, Ke Yuan and Iadh Ounis},
booktitle={ICML 2024 AI for Science Workshop},
year={2024},
url={https://openreview.net/forum?id=SRdvBPDdXB}
}

About

FusionDTI utilises a Token-level Fusion module to effectively learn fine-grained information for Drug-Target Interaction Prediction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published