Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug transformer fitting #385

Merged
merged 31 commits into from
Jan 15, 2025
Merged

Bug transformer fitting #385

merged 31 commits into from
Jan 15, 2025

Conversation

stewarthe6
Copy link
Collaborator

Transformers for normalizing inputs and outputs and weights are being trained on the full dataset instead of just the training dataset. This is causing data leakage.

@stewarthe6 stewarthe6 marked this pull request as draft December 11, 2024 23:05
@@ -301,7 +301,7 @@ def load_featurize_data(self, params=None):
# is fitted to the training data only. The transformers are then applied to the training,
# validation and test sets separately.
if not params.split_only:
self.model_wrapper.create_transformers(self.data)
self.model_wrapper.create_transformers(trans.get_all_training_datasets(self.data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this get_all_training_datasets seems like a function that belongs in model_datasets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. The structure created by get_all_training_datasets is very specific to how transformers are implemented and I would like all that code in one place. Eventually I think there should be a TransformerManager object that can build all needed transformers given a dataset.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 80.71066% with 38 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
atomsci/ddm/pipeline/model_wrapper.py 78.30% 23 Missing ⚠️
atomsci/ddm/pipeline/perf_data.py 80.00% 9 Missing ⚠️
atomsci/ddm/pipeline/transformations.py 89.65% 3 Missing ⚠️
atomsci/ddm/pipeline/model_pipeline.py 33.33% 2 Missing ⚠️
atomsci/ddm/pipeline/featurization.py 0.00% 1 Missing ⚠️
@@            Coverage Diff             @@
##            1.7.0     #385      +/-   ##
==========================================
+ Coverage   20.29%   28.53%   +8.24%     
==========================================
  Files          47       47              
  Lines       12883    12919      +36     
==========================================
+ Hits         2615     3687    +1072     
+ Misses      10268     9232    -1036     
Files with missing lines Coverage Δ
atomsci/ddm/pipeline/model_datasets.py 60.86% <100.00%> (+12.98%) ⬆️
atomsci/ddm/pipeline/featurization.py 51.10% <0.00%> (+12.88%) ⬆️
atomsci/ddm/pipeline/model_pipeline.py 43.55% <33.33%> (+14.70%) ⬆️
atomsci/ddm/pipeline/transformations.py 56.85% <89.65%> (+17.86%) ⬆️
atomsci/ddm/pipeline/perf_data.py 69.56% <80.00%> (+57.32%) ⬆️
atomsci/ddm/pipeline/model_wrapper.py 60.21% <78.30%> (+30.53%) ⬆️

... and 3 files with indirect coverage changes

# Set up transformers for weights, if needed
self.transformers_w = trans.create_weight_transformers(self.params, model_dataset)
# Set up transformers for weights, if needed
self.transformers_w[k] = trans.create_weight_transformers(self.params, td)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this one be a privileged function like _create_weight_transformers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this match the create_feature_transformers above. I realize that this is similar to the create*_transformers in model_wrapper. These functions can be called by anyone, but the ones in model_wrapper used to have side effects.

if os.path.exists(local_path):
self.log.info(f"Reloading transformers from model tarball {local_path}")
with open(local_path, 'rb') as txfmr:
transformers_tuple = pickle.load(txfmr)
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this always prefers the model tarball transformers over a supplied path to transformers.pkl object. Is this what we want? I think in other situations we provide a path (like path to external training data) to override the one inherent in the model, so this is the opposite logic.

Also, we'd need some tests to make sure a user-supplied transformer object is valid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests could wait if we leave the logic the way it is, since it will be rare that the user-supplied case comes up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic matches the current transformer logic. We can change it later during a larger refactor.


self.transformers[i] = ty
self.transformers_x[i] = tx
self.transformers_w[i] = tw
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see self.transformers updated to be self.transformers_y in a refactor / when we're comfortable breaking backward compatibility.

Copy link
Collaborator Author

@stewarthe6 stewarthe6 Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. But I don't think may people directly access transformers.

"""Transform the responses and/or features in the given DeepChem dataset using the current transformers.

Args:
dataset: The DeepChem DiskDataset that contains a dataset
fold (int): Which fold is being transformed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fold train_valid isn't an int but that's a minor quibble as long as the transformers thing is a dictionary everywhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're specifying a fold here, maybe we should make it more explicitly linked to the correct fold of dataset as well instead of relying on a different part of the code to pass the correct fold of dataset into this function. so instead of just setting transformed_dataset=dataset, use the fold=0 parameter to do like transformed_dataset=dataset.get_fold(fold) or something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after reading the model_pipeline code it makes sense to leave the k-fold selection of dataset there with the k-fold loop; maybe in the docstring of this one just add that if it's k-fold then the data passed in should be / is the k-th fold so people understand

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll have to go back and document this to explain when to use an integer and when to use the 'final' key.

atomsci/ddm/pipeline/model_wrapper.py Outdated Show resolved Hide resolved

return result

def test_KFoldRegressionPerfData():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring: goal of test is to make sure you can create the object and that values are stored where they're supposed to be stored? Should this be tested with non k-fold data to make sure the object is not created?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are non k-fold examples that use SimpleRegressionPerfData() for example.


# perfect score every time
assert r2_mean==1
assert r2_std==0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you're testing multitask but just checking the mean performance metrics not per-task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what compute_perf_metrics does. It averages out all the r2 values. Including r2s from different tasks.


def test_one_task_no_missing_values():
y = np.array([[1.0], [3.0], [5.0]])
w = np.array([[1, 1], [1, 1], [1, 1]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is w supposed to be the same shape as Y

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I asked CoPilot to write that test. I should be more careful.

@@ -1279,16 +1238,12 @@ def get_pred_values(self):
"""
ids = sorted(self.pred_vals.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated but didn't we agree not to sort predictions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked, KFoldRegressionPerfData and KFoldClassificationPerfData sort in get_pred_values(), get_real_values() and get_weights() while SimpleClassificationPerfData, SimpleRegressionPerfData and SimpleHybridPerfData do NOT sort by ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order that the keys are returned from self.pred_vals is totally random. So it's either sorted or totally random. Not sure which would be better.

# Convert true values to one-hot encoding
if self.num_classes > 2:
self.real_vals = np.concatenate([dc.metrics.to_one_hot(dataset.y[:,j], self.num_classes).reshape(-1,1,self.num_classes)
self.real_vals = np.concatenate([dc.metrics.to_one_hot(self.real_classes[:,j], self.num_classes).reshape(-1,1,self.num_classes)
for j in range(self.num_tasks)], axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one-hot encoding count as a transformation? for a refactor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sklearn lets you add a one-hot step into the pipeline.

@mauvais2
Copy link
Collaborator

Just to mention that these tests got affected.

  • FAILED test_delaney_panel.py::test_reg_config_delaney_fit_RF_3fold - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

  • FAILED test_delaney_panel.py::test_reg_config_delaney_fit_XGB_3fold - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

  • FAILED test_delaney_panel.py::test_reg_kfold_config_delaney_fit_NN_graphconv - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

  • FAILED test_balancing_transformer.py::test_balancing_transformer - ValueError: not enough values to unpack (expected 2, got 1)

  • FAILED test_balancing_transformer.py::test_all_transformers - assert 0.0

  • FAILED test_perf_data.py::test_KFoldRegressionPerfData - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

  • FAILED test_perf_data.py::test_KFoldRegressionPerfDataMulti - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

  • FAILED test_perf_data.py::test_KFoldClassificationPerfData - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

@stewarthe6 stewarthe6 marked this pull request as ready for review January 14, 2025 18:13
@stewarthe6 stewarthe6 merged commit 94f93d0 into 1.7.0 Jan 15, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants