Update to v 0.2.4

Mainly implementing logging instead of printing.
niklases · May 8, 2023 · 49dd496 · 49dd496
1 parent ab7054b
commit 49dd496
Show file tree

Hide file tree

Showing 26 changed files with 716 additions and 52,169 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,10 @@
 build/
 dist/
-pypef.egg-info/
+*.egg-info/
+.installed.cfg
+*.egg
 workflow/.ipynb_checkpoints/
-.idea/
+.idea/
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@ This repository contains the source files and supplementary information for the
 
 Niklas E. Siedhoff<sup>*1,§*</sup>, Alexander-Maurice Illig<sup>*1,§*</sup>, Ulrich Schwaneberg<sup>*1,2*</sup>, Mehdi D. Davari<sup>*3,\**</sup>, <br>
 PyPEF – An Integrated Framework for Data-Driven Protein Engineering, *J. Chem. Inf. Model.* 2021, 61, 3463-3476 <br>
-https://doi.org/10.1021/acs.jcim.1c00099<br>
+https://doi.org/10.1021/acs.jcim.1c00099 <br>
 
 as well as additional framework features described in the preprint<br>
 
@@ -42,7 +42,7 @@ a framework written in Python 3 for performing sequence-based machine learning-a
     <img src="workflow/test_dataset_aneh/exemplary_validation_color_plot.png" alt="drawing" width="500"/>
 </p>
 
-Protein engineering by rational or random approaches generates data that can aid the construction of self-learned sequence-function landscapes to predict beneficial variants by using probabilistic methods that can screen the unexplored sequence space with uncertainty *in silico*. Such predictive methods can be applied for increasing the success/effectivity of an engineering campaign while partly offering the prospect to reveal (higher-order) epistatic effects. Here we present an engineering framework termed PyPEF for assisting the supervised training and testing of regression models for predicting beneficial combinations of (identified) amino acid substitutions using machine learning algorithms from the [Scikit-learn](https://github.com/scikit-learn/scikit-learn) package. As training input, the developed framework requires the variant sequences and the corresponding screening results (fitness labels) of the identified variants as CSV (or FASTA-like datasets following a self-defined convention). Using linear or nonlinear regression methods (partial least squares (PLS), Ridge, Lasso, Elastic net, support vector machines (SVR), random forest (RF), and multilayer perceptron (MLP)-based regression), PyPEF trains on the given learning data while optimizing model hyperparameters (default: five-fold cross-validation) and can compute model performances on left-out test data. As sequences are encoded using amino acid descriptor sets taken from the [AAindex database](https://www.genome.jp/aaindex/), finding the best index-dependent encoding for a specific test set can be seen as a hyperparameter search on the test set. In addition, one-hot and [direct coupling analysis](https://en.wikipedia.org/wiki/Direct_coupling_analysis)-based feature generation are implemented as sequence encoding techniques, which often outperform AAindex-based encoding techniques. Finally, the selected or best identified encoding technique and regression model can be used to perform directed evolution walks *in silico* (see [Church-lab implementation](https://github.com/churchlab/UniRep) or the [reimplementation](https://github.com/ivanjayapurna/low-n-protein-engineering)) or to predict natural diverse or recombinant variant sequences that subsequently are to be designed and validated in the wet-lab.
+Protein engineering by rational or random approaches generates data that can aid the construction of self-learned sequence-function landscapes to predict beneficial variants by using probabilistic methods that can screen the unexplored sequence space with uncertainty *in silico*. Such predictive methods can be applied for increasing the success/effectivity of an engineering campaign while partly offering the prospect to reveal (higher-order) epistatic effects. Here we present an engineering framework termed PyPEF for assisting the supervised training and testing of regression models for predicting beneficial combinations of (identified) amino acid substitutions using machine learning algorithms from the [Scikit-learn](https://github.com/scikit-learn/scikit-learn) package. As training input, the developed framework requires the variant sequences and the corresponding screening results (fitness labels) of the identified variants as CSV (or FASTA-Like (FASL) datasets following a self-defined convention). Using linear or nonlinear regression methods (partial least squares (PLS), Ridge, Lasso, Elastic net, support vector machines (SVR), random forest (RF), and multilayer perceptron (MLP)-based regression), PyPEF trains on the given learning data while optimizing model hyperparameters (default: five-fold cross-validation) and can compute model performances on left-out test data. As sequences are encoded using amino acid descriptor sets taken from the [AAindex database](https://www.genome.jp/aaindex/), finding the best index-dependent encoding for a specific test set can be seen as a hyperparameter search on the test set. In addition, one-hot and [direct coupling analysis](https://en.wikipedia.org/wiki/Direct_coupling_analysis)-based feature generation are implemented as sequence encoding techniques, which often outperform AAindex-based encoding techniques. Finally, the selected or best identified encoding technique and regression model can be used to perform directed evolution walks *in silico* (see [Church-lab implementation](https://github.com/churchlab/UniRep) or the [reimplementation](https://github.com/ivanjayapurna/low-n-protein-engineering)) or to predict natural diverse or recombinant variant sequences that subsequently are to be designed and validated in the wet-lab.
 
 For detailed information, please refer to the above-mentioned publications and related Supporting Information.
 
@@ -96,7 +96,7 @@ pypef mklsts -w WT_SEQUENCE.FASTA -i VARIANT-FITNESS_DATA.CSV
 
 Training and testing a model (encoding technique = {`aaidx`, `onehot`, `dca`}, regression model = {`pls`, `ridge`, `lasso`, `elasticnet`, `svr`, `rf`, `mlp`}):
 ```
-pypef ml -e aaidx -l LEARNING_SET.FASTA -t TEST_SET.FASTA --regressor pls 
+pypef ml -e aaidx -l LEARNING_SET.FASL -t TEST_SET.FASL --regressor pls 
 ```
 
 Show the model performance(s) (reads and prints the created Model_Results.txt file):
@@ -106,7 +106,7 @@ pypef ml --show
 
 Load a trained model, predict fitness of test sequences using that model, and plot the measured versus the predicted fitness values:
 ```
-pypef ml -e aaidx -m MODEL -f TEST_SET.FASTA
+pypef ml -e aaidx -m MODEL -f TEST_SET.FASL
 ```
 `-m MODEL`is the saved model Pickle file name, for `-e aaidx` this will be the AAindex to use for encoding, e.g. `-m ARGP820101`, for `-e onehot` it will be `-m ONEHOTMODEL` and for `-e dca` it will be `-m DCAMODEL`.
 
@@ -149,7 +149,7 @@ pypef ml extrapolation -i VARIANT-FITNESS-ENCODING_DATA.CSV --regressor pls
 The use of the hybrid model (`pypef hybrid`) - instead of a pure ML model (`pypef ml`) as described in the steps above - is quite similar in terms of commands, but does not require the definition of the `-e`/`--encoding` and the `--regressor` flags, since it depends only on the DCA-based encoding technique and (so far) only uses Ridge regression for modeling. However, DCA-based encoding of sequences always requires a parameter file as input, which comes from the preprocessing of a query-specific multiple sequence alignment (MSA) and results in the parameter file generated by [plmc](https://github.com/debbiemarkslab/plmc). E.g. for training a model on a learning set and testing it on a test set, the command for hybrid modeling is:
 
 ```
-pypef hybrid -l LEARNING_SET.FASTA -t TEST_SET.FASTA --params PLMC_FILE.params
+pypef hybrid -l LEARNING_SET.FASL -t TEST_SET.FASL --params PLMC_FILE.params
 ``` 
 
 Sample files for testing PyPEF routines are provided in the workflow directory, which are also used when running the notebook tutorial. PyPEF's package dependencies are linked [here](https://github.com/niklases/PyPEF/network/dependencies).
@@ -159,7 +159,7 @@ As standard input files, PyPEF requires the target protein wild-type sequence in
 
 <a name="tutorial"></a>
 ## Tutorial
-Before starting running the tutorial, it is a good idea to set-up a new Python environment using Anaconda, https://www.anaconda.com/, e.g. using [Anaconda](https://www.anaconda.com/products/individual) ([Anaconda3-2020.11-Linux-x86_64.sh installer download](https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh)) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
+Before starting running the tutorial, it is a good idea to set up a new Python environment using Anaconda, https://www.anaconda.com/, e.g. using [Anaconda](https://www.anaconda.com/products/individual) ([Anaconda3-2020.11-Linux-x86_64.sh installer download](https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh)) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
 Change to the download directory and run the installation, e.g. in Linux:
 
 ```
@@ -339,11 +339,11 @@ plmc -o ANEH_72.6.params -le 72.6 -m 100 -g -f WT_ANEH ANEH_jhmmer.a2m
 
 Done! The output parameter (.params) file can be used for encoding sequences with the DCA-based encoding technique (`-e dca`) by providing it to PyPEF; e.g. for pure ML modeling:
 ```
-pypef ml -e dca -l LS.fasta -t TS.fasta --regressor pls --params ANEH_72.6.params
+pypef ml -e dca -l LS.fasl -t TS.fasl --regressor pls --params ANEH_72.6.params
 ```
 Or for hybrid modeling:
 ```
-pypef hybrid -l LS.fasta -t TS.fasta --params ANEH_72.6.params
+pypef hybrid -l LS.fasl -t TS.fasl --params ANEH_72.6.params
 ```
 
 <a name="api-usage"></a>

diff --git a/pypef/__init__.py b/pypef/__init__.py
@@ -17,4 +17,4 @@
 # §Equal contribution
 
 
-VERSION = '0.2.3-alpha'
+__version__ = '0.2.4-alpha'
diff --git a/pypef/dca/encoding.py b/pypef/dca/encoding.py
@@ -39,6 +39,8 @@
 
 
 from collections.abc import Iterable  # originally imported 'from collections'
+import logging
+logger = logging.getLogger('pypef.dca.encoding')
 
 import numpy as np
 import ray
@@ -47,7 +49,7 @@
 
 
 _SLICE = np.s_[:]
-np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)
+# np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)  # DEV
 
 class InvalidVariantError(Exception):
     """
@@ -73,7 +75,7 @@ def __init__(self, variant: str):
         super().__init__(self.message)
 
 
-class ActiveSiteError(Exception):
+class EffectiveSiteError(Exception):
     """
     Description
     -----------
@@ -93,10 +95,10 @@ def __init__(self, position: int, variant: str, verbose: bool = True):
         self.position = position
         self.variant = variant
         self.verbose = verbose
-        message = f"Info: The position {self.position} of variant '{self.variant}' is " \
+        message = f"The position {self.position} of variant '{self.variant}' is " \
                   f"not an effective site in the DCA model and thus cannot be predicted."
         if self.verbose:
-            print(message)
+            logger.info(message)
             self.message = message
             super().__init__(self.message)
 
@@ -285,19 +287,16 @@ def __read_plmc_v2(self, filename, precision):
             self.wt_aa_pos = []
             for aa, pos in zip(self._target_seq, self.index_list):
                 self.wt_aa_pos.append(str(aa) + str(pos))
-            print(f'Evaluating gap content of PLMC parameter file... '
+            logger.info(f'Evaluating gap content of PLMC parameter file... '
                   f'First amino acid position used in the MSA (PLMC params file) is '
                   f'{self._target_seq[0]}{self.index_list[0]} and the last position '
                   f'used is {self._target_seq[-1]}{self.index_list[-1]}.')
             if len(not_valid) > 0:
-                print(f'\nFurther, non-included positions are:')
-                print(str(not_valid)[1:-1])
+                logger.info(f'Further, non-included positions are:\n{str(not_valid)[1:-1]}')
             if self.verbose:
-                    print(f'\nSummary of all effective positions represented in the MSA '
-                          f'based on wild-type sequence ({len(valid)} encoded positions):')
-                    for aa_pos in self.wt_aa_pos:
-                        print(f'{aa_pos}', end=' ')
-                    print('\n')
+                    logger.info(f'Summary of all effective positions represented in the MSA '
+                                f'based on wild-type sequence ({len(valid)} encoded positions):\n'
+                                f'{str([aa_pos for aa_pos in self.wt_aa_pos])[1:-1]}'.replace("'", ""))
 
             # single site frequencies f_i and fields h_i
             self.f_i, = np.fromfile(
@@ -372,11 +371,8 @@ def target_seq(self, sequence):
             Length of sequence must correspond to model length (self.L)
         """
         if len(sequence) != self.L:
-            raise ValueError(
-                "Sequence length inconsistent with model length: {} {}".format(
-                    len(sequence), self.L
-                )
-            )
+            raise ValueError(f"Sequence length inconsistent with model length: {len(sequence)} {self.L}")
+
 
         if isinstance(sequence, str):
             sequence = list(sequence)
@@ -405,11 +401,7 @@ def index_list(self, mapping):
             Length of list must correspond to model length (self.L)
         """
         if len(mapping) != self.L:
-            raise ValueError(
-                "Mapping length inconsistent with model length: {} {}".format(
-                    len(mapping), self.L
-                )
-            )
+            raise ValueError(f"Mapping length inconsistent with model length: {len(mapping)} {self.L}")
 
         self._index_list = np.array(mapping)
         self.index_map = {b: a for a, b in enumerate(self.index_list)}
@@ -430,9 +422,7 @@ def __map(self, indices, mapping):
         """
         if ((isinstance(indices, Iterable) and not isinstance(indices, str)) or
                 (isinstance(indices, str) and len(indices) > 1)):
-            return np.array(
-                [mapping[i] for i in indices]
-            )
+            return np.array([mapping[i] for i in indices])
         else:
             return mapping[indices]
 
@@ -461,6 +451,7 @@ def __4d_access(self, matrix, i=None, j=None, A_i=None, A_j=None):
         j = self.__map(j, self.index_map) if j is not None else _SLICE
         A_i = self.__map(A_i, self.alphabet_map) if A_i is not None else _SLICE
         A_j = self.__map(A_j, self.alphabet_map) if A_j is not None else _SLICE
+
         return matrix[i, j, A_i, A_j]
 
     def __2d_access(self, matrix, i=None, A_i=None):
@@ -482,6 +473,7 @@ def __2d_access(self, matrix, i=None, A_i=None):
         """
         i = self.__map(i, self.index_map) if i is not None else _SLICE
         A_i = self.__map(A_i, self.alphabet_map) if A_i is not None else _SLICE
+
         return matrix[i, A_i]
 
     def Jij(self, i=None, j=None, A_i=None, A_j=None):
@@ -570,6 +562,7 @@ def Ji(self, i: int, A_i: str, sequence: np.ndarray) -> float:
         Ji = 0.0
         for j, A_j in zip(self.index_list, sequence):
             Ji += self.Jij(i=i, A_i=A_i, j=j, A_j=A_j)
+
         return Ji
 
     @staticmethod
@@ -633,7 +626,7 @@ def encode_variant(self, variant: str) -> np.ndarray:
 
             i = self._get_position_internal(position)
             if not i:
-                raise ActiveSiteError(position, variant, self.verbose)
+                raise EffectiveSiteError(position, variant, self.verbose)
 
             self.check_substitution_naming_against_wt(substitution, variant)
             i_mapped = self.index_map[i]
@@ -645,7 +638,7 @@ def encode_variant(self, variant: str) -> np.ndarray:
 
         return X_var
 
-    def collect_encoded_sequences(self, variants: list) -> list:
+    def collect_encoded_sequences(self, variants: list) -> np.ndarray:
         """
         Description
         -----------
@@ -677,10 +670,10 @@ def collect_encoded_sequences(self, variants: list) -> list:
         for i, variant in enumerate(tqdm(variants, disable=set_silence)):
             try:
                 encoded_sequences.append(self.encode_variant(variant))
-            except ActiveSiteError:
+            except EffectiveSiteError:
                 encoded_sequences.append([None])
 
-        return encoded_sequences
+        return np.array(encoded_sequences, dtype='object')
 
 
 """
@@ -711,7 +704,7 @@ def get_encoded_sequence(
     """
     try:
         encoded_seq = dca_encode.encode_variant(variant)
-    except ActiveSiteError:  # position not included in processed MSA
+    except EffectiveSiteError:  # position not included in processed MSA
         return
 
     return encoded_seq
@@ -748,8 +741,9 @@ def _get_data_parallel(
     for i, (variant, fitness) in enumerate(zip(variants, fitnesses)):
         try:
             data.append([variant, dca_encode.encode_variant(variant), fitness])
-        except ActiveSiteError:  # do not append non-encoded sequences and
+        except EffectiveSiteError:  # do not append non-encoded sequences and
             pass                 # associated fitness values
+
     return data
 
 
@@ -785,12 +779,12 @@ def get_dca_data_parallel(
         List of variant names that cannot be used for modelling as they are not effective
         positions in the underlying MSA used for generating local and coupling terms.
     """
-    print(f'{len(variants)} input variants.')
-    print(f'Encoding variant sequences. This might take some time...')
+    logger.info(f'{len(variants)} input variants. Encoding variant sequences. '
+                f'This might take some time...')
 
     idxs_nan = np.array([i for i, b in enumerate(np.isnan(fitnesses)) if b])  # find fitness NaNs
     if idxs_nan.size > 0:  # remove NaNs if present
-        print('Fitness NaNs are:', idxs_nan)
+        logger.info(f'Fitness NaNs are: {idxs_nan}')
         fitnesses = np.delete(fitnesses, idxs_nan)
         variants = np.delete(variants, idxs_nan)
 
@@ -810,7 +804,7 @@ def get_dca_data_parallel(
     encoded_sequences = [item[1] for item in data]
     fitnesses = [item[2] for item in data]
 
-    print(f'{len(data)} variants after NaN-valued and non-effective '
-          f'site-substituted variant (ActiveSiteError) dropping.')
+    logger.info(f'{len(data)} variants after NaN-valued and non-effective '
+                f'site-substituted variant (EffectiveSiteError) dropping.')
 
     return variants, encoded_sequences, fitnesses
Original file line number	Diff line number	Diff line change
Expand Up		@@ -17,4 +17,4 @@
		# §Equal contribution


		VERSION = '0.2.3-alpha'
		__version__ = '0.2.4-alpha'