Skip to content

Commit

Permalink
Update to v 0.2.4
Browse files Browse the repository at this point in the history
Mainly implementing logging instead of printing.
  • Loading branch information
niklases committed May 8, 2023
1 parent ab7054b commit 49dd496
Show file tree
Hide file tree
Showing 26 changed files with 716 additions and 52,169 deletions.
9 changes: 7 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
build/
dist/
pypef.egg-info/
*.egg-info/
.installed.cfg
*.egg
workflow/.ipynb_checkpoints/
.idea/
.idea/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ This repository contains the source files and supplementary information for the

Niklas E. Siedhoff<sup>*1,§*</sup>, Alexander-Maurice Illig<sup>*1,§*</sup>, Ulrich Schwaneberg<sup>*1,2*</sup>, Mehdi D. Davari<sup>*3,\**</sup>, <br>
PyPEF – An Integrated Framework for Data-Driven Protein Engineering, *J. Chem. Inf. Model.* 2021, 61, 3463-3476 <br>
https://doi.org/10.1021/acs.jcim.1c00099<br>
https://doi.org/10.1021/acs.jcim.1c00099 <br>

as well as additional framework features described in the preprint<br>

Expand Down Expand Up @@ -42,7 +42,7 @@ a framework written in Python 3 for performing sequence-based machine learning-a
<img src="workflow/test_dataset_aneh/exemplary_validation_color_plot.png" alt="drawing" width="500"/>
</p>

Protein engineering by rational or random approaches generates data that can aid the construction of self-learned sequence-function landscapes to predict beneficial variants by using probabilistic methods that can screen the unexplored sequence space with uncertainty *in silico*. Such predictive methods can be applied for increasing the success/effectivity of an engineering campaign while partly offering the prospect to reveal (higher-order) epistatic effects. Here we present an engineering framework termed PyPEF for assisting the supervised training and testing of regression models for predicting beneficial combinations of (identified) amino acid substitutions using machine learning algorithms from the [Scikit-learn](https://github.com/scikit-learn/scikit-learn) package. As training input, the developed framework requires the variant sequences and the corresponding screening results (fitness labels) of the identified variants as CSV (or FASTA-like datasets following a self-defined convention). Using linear or nonlinear regression methods (partial least squares (PLS), Ridge, Lasso, Elastic net, support vector machines (SVR), random forest (RF), and multilayer perceptron (MLP)-based regression), PyPEF trains on the given learning data while optimizing model hyperparameters (default: five-fold cross-validation) and can compute model performances on left-out test data. As sequences are encoded using amino acid descriptor sets taken from the [AAindex database](https://www.genome.jp/aaindex/), finding the best index-dependent encoding for a specific test set can be seen as a hyperparameter search on the test set. In addition, one-hot and [direct coupling analysis](https://en.wikipedia.org/wiki/Direct_coupling_analysis)-based feature generation are implemented as sequence encoding techniques, which often outperform AAindex-based encoding techniques. Finally, the selected or best identified encoding technique and regression model can be used to perform directed evolution walks *in silico* (see [Church-lab implementation](https://github.com/churchlab/UniRep) or the [reimplementation](https://github.com/ivanjayapurna/low-n-protein-engineering)) or to predict natural diverse or recombinant variant sequences that subsequently are to be designed and validated in the wet-lab.
Protein engineering by rational or random approaches generates data that can aid the construction of self-learned sequence-function landscapes to predict beneficial variants by using probabilistic methods that can screen the unexplored sequence space with uncertainty *in silico*. Such predictive methods can be applied for increasing the success/effectivity of an engineering campaign while partly offering the prospect to reveal (higher-order) epistatic effects. Here we present an engineering framework termed PyPEF for assisting the supervised training and testing of regression models for predicting beneficial combinations of (identified) amino acid substitutions using machine learning algorithms from the [Scikit-learn](https://github.com/scikit-learn/scikit-learn) package. As training input, the developed framework requires the variant sequences and the corresponding screening results (fitness labels) of the identified variants as CSV (or FASTA-Like (FASL) datasets following a self-defined convention). Using linear or nonlinear regression methods (partial least squares (PLS), Ridge, Lasso, Elastic net, support vector machines (SVR), random forest (RF), and multilayer perceptron (MLP)-based regression), PyPEF trains on the given learning data while optimizing model hyperparameters (default: five-fold cross-validation) and can compute model performances on left-out test data. As sequences are encoded using amino acid descriptor sets taken from the [AAindex database](https://www.genome.jp/aaindex/), finding the best index-dependent encoding for a specific test set can be seen as a hyperparameter search on the test set. In addition, one-hot and [direct coupling analysis](https://en.wikipedia.org/wiki/Direct_coupling_analysis)-based feature generation are implemented as sequence encoding techniques, which often outperform AAindex-based encoding techniques. Finally, the selected or best identified encoding technique and regression model can be used to perform directed evolution walks *in silico* (see [Church-lab implementation](https://github.com/churchlab/UniRep) or the [reimplementation](https://github.com/ivanjayapurna/low-n-protein-engineering)) or to predict natural diverse or recombinant variant sequences that subsequently are to be designed and validated in the wet-lab.

For detailed information, please refer to the above-mentioned publications and related Supporting Information.

Expand Down Expand Up @@ -96,7 +96,7 @@ pypef mklsts -w WT_SEQUENCE.FASTA -i VARIANT-FITNESS_DATA.CSV

Training and testing a model (encoding technique = {`aaidx`, `onehot`, `dca`}, regression model = {`pls`, `ridge`, `lasso`, `elasticnet`, `svr`, `rf`, `mlp`}):
```
pypef ml -e aaidx -l LEARNING_SET.FASTA -t TEST_SET.FASTA --regressor pls
pypef ml -e aaidx -l LEARNING_SET.FASL -t TEST_SET.FASL --regressor pls
```

Show the model performance(s) (reads and prints the created Model_Results.txt file):
Expand All @@ -106,7 +106,7 @@ pypef ml --show

Load a trained model, predict fitness of test sequences using that model, and plot the measured versus the predicted fitness values:
```
pypef ml -e aaidx -m MODEL -f TEST_SET.FASTA
pypef ml -e aaidx -m MODEL -f TEST_SET.FASL
```
`-m MODEL`is the saved model Pickle file name, for `-e aaidx` this will be the AAindex to use for encoding, e.g. `-m ARGP820101`, for `-e onehot` it will be `-m ONEHOTMODEL` and for `-e dca` it will be `-m DCAMODEL`.

Expand Down Expand Up @@ -149,7 +149,7 @@ pypef ml extrapolation -i VARIANT-FITNESS-ENCODING_DATA.CSV --regressor pls
The use of the hybrid model (`pypef hybrid`) - instead of a pure ML model (`pypef ml`) as described in the steps above - is quite similar in terms of commands, but does not require the definition of the `-e`/`--encoding` and the `--regressor` flags, since it depends only on the DCA-based encoding technique and (so far) only uses Ridge regression for modeling. However, DCA-based encoding of sequences always requires a parameter file as input, which comes from the preprocessing of a query-specific multiple sequence alignment (MSA) and results in the parameter file generated by [plmc](https://github.com/debbiemarkslab/plmc). E.g. for training a model on a learning set and testing it on a test set, the command for hybrid modeling is:

```
pypef hybrid -l LEARNING_SET.FASTA -t TEST_SET.FASTA --params PLMC_FILE.params
pypef hybrid -l LEARNING_SET.FASL -t TEST_SET.FASL --params PLMC_FILE.params
```

Sample files for testing PyPEF routines are provided in the workflow directory, which are also used when running the notebook tutorial. PyPEF's package dependencies are linked [here](https://github.com/niklases/PyPEF/network/dependencies).
Expand All @@ -159,7 +159,7 @@ As standard input files, PyPEF requires the target protein wild-type sequence in

<a name="tutorial"></a>
## Tutorial
Before starting running the tutorial, it is a good idea to set-up a new Python environment using Anaconda, https://www.anaconda.com/, e.g. using [Anaconda](https://www.anaconda.com/products/individual) ([Anaconda3-2020.11-Linux-x86_64.sh installer download](https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh)) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
Before starting running the tutorial, it is a good idea to set up a new Python environment using Anaconda, https://www.anaconda.com/, e.g. using [Anaconda](https://www.anaconda.com/products/individual) ([Anaconda3-2020.11-Linux-x86_64.sh installer download](https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh)) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
Change to the download directory and run the installation, e.g. in Linux:

```
Expand Down Expand Up @@ -339,11 +339,11 @@ plmc -o ANEH_72.6.params -le 72.6 -m 100 -g -f WT_ANEH ANEH_jhmmer.a2m

Done! The output parameter (.params) file can be used for encoding sequences with the DCA-based encoding technique (`-e dca`) by providing it to PyPEF; e.g. for pure ML modeling:
```
pypef ml -e dca -l LS.fasta -t TS.fasta --regressor pls --params ANEH_72.6.params
pypef ml -e dca -l LS.fasl -t TS.fasl --regressor pls --params ANEH_72.6.params
```
Or for hybrid modeling:
```
pypef hybrid -l LS.fasta -t TS.fasta --params ANEH_72.6.params
pypef hybrid -l LS.fasl -t TS.fasl --params ANEH_72.6.params
```

<a name="api-usage"></a>
Expand Down
2 changes: 1 addition & 1 deletion pypef/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@
# §Equal contribution


VERSION = '0.2.3-alpha'
__version__ = '0.2.4-alpha'
66 changes: 30 additions & 36 deletions pypef/dca/encoding.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@


from collections.abc import Iterable # originally imported 'from collections'
import logging
logger = logging.getLogger('pypef.dca.encoding')

import numpy as np
import ray
Expand All @@ -47,7 +49,7 @@


_SLICE = np.s_[:]
np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)
# np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning) # DEV

class InvalidVariantError(Exception):
"""
Expand All @@ -73,7 +75,7 @@ def __init__(self, variant: str):
super().__init__(self.message)


class ActiveSiteError(Exception):
class EffectiveSiteError(Exception):
"""
Description
-----------
Expand All @@ -93,10 +95,10 @@ def __init__(self, position: int, variant: str, verbose: bool = True):
self.position = position
self.variant = variant
self.verbose = verbose
message = f"Info: The position {self.position} of variant '{self.variant}' is " \
message = f"The position {self.position} of variant '{self.variant}' is " \
f"not an effective site in the DCA model and thus cannot be predicted."
if self.verbose:
print(message)
logger.info(message)
self.message = message
super().__init__(self.message)

Expand Down Expand Up @@ -285,19 +287,16 @@ def __read_plmc_v2(self, filename, precision):
self.wt_aa_pos = []
for aa, pos in zip(self._target_seq, self.index_list):
self.wt_aa_pos.append(str(aa) + str(pos))
print(f'Evaluating gap content of PLMC parameter file... '
logger.info(f'Evaluating gap content of PLMC parameter file... '
f'First amino acid position used in the MSA (PLMC params file) is '
f'{self._target_seq[0]}{self.index_list[0]} and the last position '
f'used is {self._target_seq[-1]}{self.index_list[-1]}.')
if len(not_valid) > 0:
print(f'\nFurther, non-included positions are:')
print(str(not_valid)[1:-1])
logger.info(f'Further, non-included positions are:\n{str(not_valid)[1:-1]}')
if self.verbose:
print(f'\nSummary of all effective positions represented in the MSA '
f'based on wild-type sequence ({len(valid)} encoded positions):')
for aa_pos in self.wt_aa_pos:
print(f'{aa_pos}', end=' ')
print('\n')
logger.info(f'Summary of all effective positions represented in the MSA '
f'based on wild-type sequence ({len(valid)} encoded positions):\n'
f'{str([aa_pos for aa_pos in self.wt_aa_pos])[1:-1]}'.replace("'", ""))

# single site frequencies f_i and fields h_i
self.f_i, = np.fromfile(
Expand Down Expand Up @@ -372,11 +371,8 @@ def target_seq(self, sequence):
Length of sequence must correspond to model length (self.L)
"""
if len(sequence) != self.L:
raise ValueError(
"Sequence length inconsistent with model length: {} {}".format(
len(sequence), self.L
)
)
raise ValueError(f"Sequence length inconsistent with model length: {len(sequence)} {self.L}")


if isinstance(sequence, str):
sequence = list(sequence)
Expand Down Expand Up @@ -405,11 +401,7 @@ def index_list(self, mapping):
Length of list must correspond to model length (self.L)
"""
if len(mapping) != self.L:
raise ValueError(
"Mapping length inconsistent with model length: {} {}".format(
len(mapping), self.L
)
)
raise ValueError(f"Mapping length inconsistent with model length: {len(mapping)} {self.L}")

self._index_list = np.array(mapping)
self.index_map = {b: a for a, b in enumerate(self.index_list)}
Expand All @@ -430,9 +422,7 @@ def __map(self, indices, mapping):
"""
if ((isinstance(indices, Iterable) and not isinstance(indices, str)) or
(isinstance(indices, str) and len(indices) > 1)):
return np.array(
[mapping[i] for i in indices]
)
return np.array([mapping[i] for i in indices])
else:
return mapping[indices]

Expand Down Expand Up @@ -461,6 +451,7 @@ def __4d_access(self, matrix, i=None, j=None, A_i=None, A_j=None):
j = self.__map(j, self.index_map) if j is not None else _SLICE
A_i = self.__map(A_i, self.alphabet_map) if A_i is not None else _SLICE
A_j = self.__map(A_j, self.alphabet_map) if A_j is not None else _SLICE

return matrix[i, j, A_i, A_j]

def __2d_access(self, matrix, i=None, A_i=None):
Expand All @@ -482,6 +473,7 @@ def __2d_access(self, matrix, i=None, A_i=None):
"""
i = self.__map(i, self.index_map) if i is not None else _SLICE
A_i = self.__map(A_i, self.alphabet_map) if A_i is not None else _SLICE

return matrix[i, A_i]

def Jij(self, i=None, j=None, A_i=None, A_j=None):
Expand Down Expand Up @@ -570,6 +562,7 @@ def Ji(self, i: int, A_i: str, sequence: np.ndarray) -> float:
Ji = 0.0
for j, A_j in zip(self.index_list, sequence):
Ji += self.Jij(i=i, A_i=A_i, j=j, A_j=A_j)

return Ji

@staticmethod
Expand Down Expand Up @@ -633,7 +626,7 @@ def encode_variant(self, variant: str) -> np.ndarray:

i = self._get_position_internal(position)
if not i:
raise ActiveSiteError(position, variant, self.verbose)
raise EffectiveSiteError(position, variant, self.verbose)

self.check_substitution_naming_against_wt(substitution, variant)
i_mapped = self.index_map[i]
Expand All @@ -645,7 +638,7 @@ def encode_variant(self, variant: str) -> np.ndarray:

return X_var

def collect_encoded_sequences(self, variants: list) -> list:
def collect_encoded_sequences(self, variants: list) -> np.ndarray:
"""
Description
-----------
Expand Down Expand Up @@ -677,10 +670,10 @@ def collect_encoded_sequences(self, variants: list) -> list:
for i, variant in enumerate(tqdm(variants, disable=set_silence)):
try:
encoded_sequences.append(self.encode_variant(variant))
except ActiveSiteError:
except EffectiveSiteError:
encoded_sequences.append([None])

return encoded_sequences
return np.array(encoded_sequences, dtype='object')


"""
Expand Down Expand Up @@ -711,7 +704,7 @@ def get_encoded_sequence(
"""
try:
encoded_seq = dca_encode.encode_variant(variant)
except ActiveSiteError: # position not included in processed MSA
except EffectiveSiteError: # position not included in processed MSA
return

return encoded_seq
Expand Down Expand Up @@ -748,8 +741,9 @@ def _get_data_parallel(
for i, (variant, fitness) in enumerate(zip(variants, fitnesses)):
try:
data.append([variant, dca_encode.encode_variant(variant), fitness])
except ActiveSiteError: # do not append non-encoded sequences and
except EffectiveSiteError: # do not append non-encoded sequences and
pass # associated fitness values

return data


Expand Down Expand Up @@ -785,12 +779,12 @@ def get_dca_data_parallel(
List of variant names that cannot be used for modelling as they are not effective
positions in the underlying MSA used for generating local and coupling terms.
"""
print(f'{len(variants)} input variants.')
print(f'Encoding variant sequences. This might take some time...')
logger.info(f'{len(variants)} input variants. Encoding variant sequences. '
f'This might take some time...')

idxs_nan = np.array([i for i, b in enumerate(np.isnan(fitnesses)) if b]) # find fitness NaNs
if idxs_nan.size > 0: # remove NaNs if present
print('Fitness NaNs are:', idxs_nan)
logger.info(f'Fitness NaNs are: {idxs_nan}')
fitnesses = np.delete(fitnesses, idxs_nan)
variants = np.delete(variants, idxs_nan)

Expand All @@ -810,7 +804,7 @@ def get_dca_data_parallel(
encoded_sequences = [item[1] for item in data]
fitnesses = [item[2] for item in data]

print(f'{len(data)} variants after NaN-valued and non-effective '
f'site-substituted variant (ActiveSiteError) dropping.')
logger.info(f'{len(data)} variants after NaN-valued and non-effective '
f'site-substituted variant (EffectiveSiteError) dropping.')

return variants, encoded_sequences, fitnesses
Loading

0 comments on commit 49dd496

Please sign in to comment.