CTL is a toolkit for transfer-learning with the Chemprop message-passing neural network model builder via the package chemprop_transfer. If using this package please cite J. Chem. Inf. Model. 2022, 62, 22, 5397–5410. https://doi.org/10.1021/acs.jcim.2c00841. Scripts for generating pre-trained models and figures used in the publication can be found in the directory "JCIM_2022". For example, the script "run_chemprop_model.py" in "/JCIM_2022/Table_8_Figure_6_Table_9_Table_S2" can be used to train the model used to generate results in Tables 8, 9, and S2 as well as Figure 6. The file "ani_properties_filtered_and_normalized.csv" in "/data" contains all data used in co-training for the best models. The necessary chemprop package can be obtained via the command "git clone --branch transfer_learning https://github.com/JLans/chemprop.git".
Many chemical datasets, particularly experimental datsets, are small. Often limited to hundreds or even just dozens of datapoints. To enable property prediction using Chemprop with these types of datasets, an emerging methadology is to apply transfer-learning techniques. In the journal publication this software accompanies, take an approach of co-training the experimental property with computationally computed properties for 10 to 100 times as many molecules.
This package facilitates the building of a working transfer-model. Relevant tools include software for filtering large datasets of molecules via similarity to the desired chemical dataset, sorting the filtered molecules to build a maximize inter-dataset similarity to the desired chemical dataset while limiting intra-similarity with itself, and computing properties via the ANI-c1xx force field for co-training. A multi-processing class is also provided for running software on multiple cores. In addition, a class for transfer the parameters from one model to another and freezing individual layers of a model before re-training are also included.
A setup.py file is provided for installation from source.
cd chemprop_transfer
pip install .
The correct chemprop branch can be installed from source with the following commands.
git clone --branch transfer_learning https://github.com/JLans/chemprop.git
cd chemprop
pip install .
See examples folder.
Import multiprocessing function class a DATASET class. Create an instance of the multiprocessing class.
import multiprocessing as mp
from chemprop_transfer.data import DATASET
from chemprop_transfer.data import MP_functions
num_cpus = mp.cpu_count() - 2
mp_func = MP_functions(num_cpus)
Remove all molecules from the desired dataset that contain elements outside the relevant sccope and canonacalize the smiles strings.
if __name__ == '__main__':
mp_func.set_filter_atoms(['C', 'H', 'N', 'O'])
data_path = '../data/Mathieu_2020.csv'
data = DATASET(data_path, chunksize=155)
data.load_data()
names = data.get_column_names()
mp_func.apply_function(mp_func.filter_and_canonicalize, data.data
, in_columns=names
, out_file='../data/Mathieu_2020_CHNO.csv'
, out_columns=names)
Filter through "large_data.csv" to identify molecules find the 1000 molecules most similar to each molecule in "Mathieu_2020_CHNO.csv". "large_data.csv" comes from PNNL as described in the main text of the associated paper. It can be any .csv file with SMILES strings in a column labeled "smiles".
if __name__ == '__main__':
comparison_data_path = '../data/Mathieu_2020_CHNO.csv'
data_path = '../data/large_data.csv'
data = DATASET(data_path, chunksize=5000)
data.load_data()
mp_func.get_similar_mols(comparison_data_path, data.data
, '../../similar_molecules.csv', 1000
, rate=1
, fast=False)
Sort the molecules so that a diverse co-training dataset that maximizes that minimizes intra-group similarity can be made.
from chemprop_transfer.data import DATASET
data_path = r'../data/similar_molecules.csv'
out_file = r'../data/sorted_molecules.csv'
data = DATASET(data_path)
data.load_data()
data.order_dataset(out_file, new_molecules='new_mol'
, target_molecules=['match_mol', 'max_mol']
, sim_names = ['match_sim', 'max_sim']
, sim_func='sum')
Calculate properties with ANI to be used for co-training. A set list of properties can be calculated and is provided to out_columns.
import multiprocessing as mp
from chemprop_transfer.data import MP_functions
import torchani
from chemprop_transfer.data import DATASET
from chemprop_transfer.property_generator import PROPERTY_GENERATOR
model = torchani.models.ANI1ccx(periodic_table_index=True).double()
PG = PROPERTY_GENERATOR(model)
if __name__ == '__main__':
out_file = '../data/ani_properties_sorted.csv'
out_columns = ['smiles', 'energy', 'fmax', 'SYM', 'MOI1', 'MOI2', 'MOI3'
, 'Hvib75', 'Hvib150', 'Hvib300', 'Hvib600', 'Hvib1200'
, 'TSvib75', 'TSvib150', 'TSvib300', 'TSvib600', 'TSvib1200'
, 'FC_1', 'FC_2', 'FC_3']
in_column = 'new_mol'
data_path = '../data/sorted_molecules.csv'
data = DATASET(data_path, chunksize=10000)
data.load_data()
num_cpus = mp.cpu_count() - 2
mp_func = MP_functions(num_cpus)
mp_func.apply_function(PG.get_properties,data.data, in_column, out_file
,out_columns, verbose=True)
Load required modules
from chemprop_transfer.data import split_data
from chemprop_transfer.data import combine_files
import shutil
from chemprop.args import TrainArgs
from chemprop.train import cross_validate, run_training
from chemprop_transfer.utils import Transfer_Model
import os
from chemprop_transfer.data import DATASET
Extract, select, normalize, and canonicalize data.
data_path = r'../data/ani_properties_sorted.csv'
data = DATASET(data_path)
data.load_data()
data.data = data.data[data.data['fmax'] < 0.05]
data.data = data.data[['smiles', 'energy']]
data.normalize()
data.canonicalize_data(column='smiles')
data_path_FF = './ani_energy_10000r.csv'
data.save(data_path_FF, rows=10000)
Split data into training, validation, and test sets. Combine co-triaining and primary data.
seed=1
data_path_Mathieu = '../data/log_Mathieu_2020_CHNOFCl.csv'
num_folds=5
temp_dir = './temp'
val = 0.1
test = .29333
directory1 = './Mathieu_5+energy_10000r'
split_data(data_path_Mathieu, (1-val-test, val, test), seed=seed, num_folds=num_folds
, save_dir=directory1)
split_data(data_path_FF, (1, 0, 0), seed=seed, num_folds=num_folds
, save_dir=temp_dir)
#combine files
combined_dir = './combined_5M+energy_10000r'
combine_files(combined_dir, [directory1, temp_dir], multiply=[5,1])
shutil.rmtree(temp_dir)
Run the base model.
for weighting in ['.001', '.01', '.05', '0.1', '0.5', '0.75', '1', '5']:
save_dir = r'./combined_5M+10000r_' + weighting+'w'
separate_test_path = os.path.join(combined_dir, 'test_full.csv')
fold_list = ['fold_' + str(i) for i in range(num_folds)]
base_model = Transfer_Model()
for fold in fold_list:
fold_folder = os.path.join(save_dir, fold)
data_folder = os.path.join(combined_dir, fold)
separate_val_path = os.path.join(data_folder, 'val_full.csv')
data_path = os.path.join(data_folder, 'train_full.csv')
if __name__ == '__main__': # and '__file__' in globals()
# training arguments
additional_args = [
'--data_path', data_path,
'--separate_val_path', separate_val_path,
'--separate_test_path', separate_test_path,
'--save_dir', fold_folder,
'--epochs', '10', #10
'--batch_size', '25', #25
'--final_lr', '0.00005', #.00005
'--init_lr', '0.00001', #.00001
'--max_lr', '0.001', #0.0005
#'--ffn_hidden_size', '300','20', '1000', '1000',
'--loss_weighting', weighting,
'--hidden_size', '300',
'--multi_branch_ffn', "(300, 300, 20, (50, 50), (50,50))"
]
train_args = base_model.get_train_args(additional_args)
args=TrainArgs().parse_args(train_args)
#train a model on DFT data for pretraining
mean_score, std_score = cross_validate(args=args
, train_func=run_training)
Transfer model parameters to a new model. Any number of model layers can be frozen. If the models are branched, any number of branches can be transferred.
from chemprop.train import cross_validate, run_training
from chemprop_transfer.utils import Transfer_Model
from chemprop.train.make_predictions import make_predictions
from chemprop.args import PredictArgs
import os
data_dir = r'./Mathieu_5+energy_10000r'
if __name__ == '__main__':
for weighting in ['.001', '.01', '.05', '0.1', '0.5']:
folder = './Mathieu_5M+10000r_'+weighting+'w'
combined_dir = './combined_5M+10000r_'+weighting+'w'
separate_test_path = os.path.join(data_dir, 'test_full.csv')
fold_list = ['fold_' + str(i) for i in range(5)]
for fold in fold_list:
data_folder = os.path.join(data_dir, fold)
fold_folder = os.path.join(folder,fold)
separate_val_path = os.path.join(data_folder, 'val_full.csv')
data_path = os.path.join(data_folder, 'train_full.csv')
# training arguments
additional_args = [
'--data_path', data_path,
'--separate_val_path', separate_val_path,
'--separate_test_path', separate_test_path,
'--save_dir', fold_folder,
'--epochs', '0'
]
#train a model on DFT data for pretraining
BaseModel_path = os.path.join(combined_dir,fold+'/fold_0/model_0/model.pt')
BaseModel = Transfer_Model(BaseModel_path)
transfer_model, args = BaseModel.get_transfer_model(frzn_ffn_layers='all'
,args=additional_args)
#args.target_stds[0] = 0.1
mean_score, std_score = cross_validate(args=args, train_func=run_training
,model_list = [transfer_model])
predict_args = ['--checkpoint_dir', fold_folder
, '--test_path', separate_val_path
, '--preds_path', os.path.join(fold_folder,'val_preds.csv')
,'--num_workers', '0'
]
prediction_args = PredictArgs().parse_args(predict_args)
make_predictions(args=prediction_args)
predict_args = ['--checkpoint_dir', folder
, '--test_path', separate_test_path
, '--preds_path', os.path.join(folder,'ensemble_preds.csv')
,'--num_workers', '0'
]
prediction_args = PredictArgs().parse_args(predict_args)
make_predictions(args=prediction_args)
See publication for details.
Contributors:
Joshua L. Lansford
Brian C. Barnes
This project is licensed under the MIT license.