serialization-deserialization bug #143

patrickleonardy · 2023-01-11T10:01:47Z

Bug Report

After serializing and de-serializing a PreProcessor with only contiguous variables (to check if it is also the case when categorical variables are present)

the preprocessor object can not be printed -> AttributeError
when trying to transform data the KBinsDiscretizer throws -> NotFittedError

Description

For the first point: It seems that the problem with the difference in the naming of the attributes and the parameters in the function definition. self._get_param_names() returns "categorical_data_processor" but getattr() only knows "_categorical_data_processor".
By changing the naming this problem is resolved is there no other way ?

For the second point: There is a problem when creating the pipeline_dictionary it seems that some keywords are empty even if they should have a value.

Steps to Reproduce

Load a dataset:
from sklearn.datasets import load_iris
import pandas as pd
X, y = load_iris(return_X_y=True, as_frame=True)
df = pd.concat([X,y])
df = df.rename({0:"target"}, axis=1)
Create preprocessor and fit it
from cobra.preprocessing import PreProcessor
preprocessor = PreProcessor.from_params()
continuous_vars = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
discrete_vars = []
preprocessor.fit( df, continuous_vars= continuous_vars, discrete_vars= discrete_vars, target_column_name="target" )
Serialize the preprocessor
pipeline_serialized = preprocessor.serialize_pipeline()
De-serialize
new_preprocessor = PreProcessor.from_pipeline(pipeline_serialized)
See what happens when printing
new_preprocessor
See what happens when transforming
new_preprocessor.transform( df, continuous_vars= continuous_vars, discrete_vars= [] )

Actual Results

I got ...

The text was updated successfully, but these errors were encountered:

joostneuj · 2023-05-10T12:53:52Z

The fact that you cannot print the de-serialized preprocessor is not necessarily an issue I think? It seems to follow the same behaviour before/after (de)serialization.

I was looking into the issue of why you cannot transform after de-serializing, and I think some information is lost in the (de-)serialization process.

To give an example:

After creating and fitting a preprocessor object (called preprocessor) on sample data (using only continuous variables), there is some information stored on the bins per column which becomes visible by running:
preprocessor._discretizer._bins_by_column.

The _bins_by_column element is not visible when just looking at the _discretizer, but it is still there.

After serializing and de-serializing the same preprocessor object, this information is lost. When calling the transform method from the kbins_discretizer class, the following test is ran (line 272):

    if len(self._bins_by_column) == 0:
        msg = ("{} instance is not fitted yet. Call 'fit' with "
                   "appropriate arguments before using this method.")
        raise NotFittedError(msg.format(self.__class__.__name__))

This is why (for me) I cannot directly transform new data after de-serializing. I already wanted to leave some information here, but will investigate this further. I can imagine this is also happening in the categorical data processor.

A way forward would probably be to make sure the full information gets (de-)serialized.

sandervh14 · 2023-05-24T14:11:10Z

Patrick's logged issue #176 might be a duplicate, @patrickleonardy to check if yours was on the model and Joost's on the preprocessor.
If questions on how to fix, you can also ask Benoît, who has struggled with the issue +- a month ago.

patrickleonardy · 2023-05-26T11:44:39Z

#176 relates to the serialization/de-serialization of the LinearRegressionModel and maybe the LogisticRegressionModel,
Where as this Issue is about the PreProcessing serialization/de-serialization

So no duplicates here

joostneuj · 2023-06-16T16:06:27Z

@patrickleonardy @sandervh14

For me the issue is solved now. The main issue was in target_encoder.py.

At line 126, there is a check on a parameter (_global_mean) of the target encoder. This is a floating number, in my case of type np.float64. In the if statement was only a check if type==float. This check failed, and hence the variable is left empty in the deserialization process. Therefore Cobra suspects the target_encoder was not fitted.

I extended the check to take into account different kinds of floating numbers using:
isinstance(params["_global_mean"], (np.floating, float))

I have tested the entire flow with continuous and categorical variables and everything seems to work fine now. The debugging is documented in a notebook which at the moment pushed to git as well.

patrickleonardy added the bug Something isn't working label Jan 12, 2023

sandervh14 assigned sandervh14 and unassigned sandervh14 Feb 22, 2023

sandervh14 added the good first issue Good for newcomers label Feb 22, 2023

sandervh14 added this to the 2023-03 milestone Mar 9, 2023

sandervh14 modified the milestones: 2023-03, 2023-04 Apr 7, 2023

joostneuj self-assigned this May 10, 2023

sandervh14 modified the milestones: 2023-04, 2023-05 May 24, 2023

joostneuj pushed a commit that referenced this issue Jun 16, 2023

#143 fix: serialization-deserialization bug

6978bd0

joostneuj added a commit that referenced this issue Jun 16, 2023

#143 deleted notebook

e124854

joostneuj added a commit that referenced this issue Jun 16, 2023

#143 delete file

82b378f

joostneuj pushed a commit that referenced this issue Jun 16, 2023

#143 fix: serialization-deserialization bug

fcdc5f3

joostneuj pushed a commit that referenced this issue Jun 16, 2023

#143 fix: serialization-deserialization bug

ea183cb

joostneuj linked a pull request Jun 16, 2023 that will close this issue

143 serialization deserialization bug #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serialization-deserialization bug #143

serialization-deserialization bug #143

patrickleonardy commented Jan 11, 2023

joostneuj commented May 10, 2023

sandervh14 commented May 24, 2023

patrickleonardy commented May 26, 2023

joostneuj commented Jun 16, 2023 •

edited

Loading

serialization-deserialization bug #143

serialization-deserialization bug #143

Comments

patrickleonardy commented Jan 11, 2023

Bug Report

Description

Steps to Reproduce

Actual Results

joostneuj commented May 10, 2023

sandervh14 commented May 24, 2023

patrickleonardy commented May 26, 2023

joostneuj commented Jun 16, 2023 • edited Loading

joostneuj commented Jun 16, 2023 •

edited

Loading