New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Bug transformer fitting #385

Merged

stewarthe6 merged 31 commits into 1.7.0 from bug_transformer_fitting

Jan 15, 2025

Collaborator

stewarthe6 commented Dec 11, 2024

Transformers for normalizing inputs and outputs and weights are being trained on the full dataset instead of just the training dataset. This is causing data leakage.

stewarthe6 added 13 commits

December 9, 2024 13:19


          One transformer set for each fold and for train_valid dataset

d44ee76


          Returned all_training_datasets

10675c6


          creating transformers now expects a DeepChem dataset instead of a mod…

71d0859

…el_dataset


          Added fold as a parameter to more functions, 'final' is the name of t…

a2520a4

…he default fold, transformers only see one Dataset during creation, updated unit tests


          No more transformers in perf_data, No more transformers in EpochManag…

18987d3

…er, transforms happen right before it's used. Transformations are un-done before passing into perf_data


          Switched back to saving one .pkl for all transformers. The pkl is sti…

4689f9e

…ll a tuple, but the elements are now dictionaries instead of lists. The rest of the code should work the same.


          Fixed issue where _create_*_transformers sometimes would not return a…

cd14bd9

… value


          Missing model_dataset argument

00a74af


          Removed double nested list

9cb3f4d


          Updated transformer path in test

72172a2


          Removed fold argument and added backwards transformer functionality

9863aef


          Removed a few 'final' arguments that are no longer used

a40dffa


          Removed unused imports

6332c60

stewarthe6 marked this pull request as draft

December 11, 2024 23:05

stewarthe6 requested review from paulsonak and mauvais2

December 11, 2024 23:05

paulsonak reviewed

View reviewed changes

atomsci/ddm/pipeline/model_pipeline.py

@@ @@ -301,7 +301,7 @@ def load_featurize_data(self, params=None): @@
                       # is fitted to the training data only. The transformers are then applied to the training,
                       # validation and test sets separately.
                       if not params.split_only:
-                          self.model_wrapper.create_transformers(self.data)
+                          self.model_wrapper.create_transformers(trans.get_all_training_datasets(self.data))

Collaborator

paulsonak Dec 11, 2024

this get_all_training_datasets seems like a function that belongs in model_datasets?

Collaborator Author

stewarthe6 Dec 12, 2024

I disagree. The structure created by get_all_training_datasets is very specific to how transformers are implemented and I would like all that code in one place. Eventually I think there should be a TransformerManager object that can build all needed transformers given a dataset.

codecov bot commented Dec 11, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 80.71066% with 38 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
atomsci/ddm/pipeline/model_wrapper.py	78.30%	23 Missing ⚠️
atomsci/ddm/pipeline/perf_data.py	80.00%	9 Missing ⚠️
atomsci/ddm/pipeline/transformations.py	89.65%	3 Missing ⚠️
atomsci/ddm/pipeline/model_pipeline.py	33.33%	2 Missing ⚠️
atomsci/ddm/pipeline/featurization.py	0.00%	1 Missing ⚠️

@@            Coverage Diff             @@
##            1.7.0     #385      +/-   ##
==========================================
+ Coverage   20.29%   28.53%   +8.24%     
==========================================
  Files          47       47              
  Lines       12883    12919      +36     
==========================================
+ Hits         2615     3687    +1072     
+ Misses      10268     9232    -1036

Files with missing lines	Coverage Δ
atomsci/ddm/pipeline/model_datasets.py	`60.86% <100.00%> (+12.98%)`	⬆️
atomsci/ddm/pipeline/featurization.py	`51.10% <0.00%> (+12.88%)`	⬆️
atomsci/ddm/pipeline/model_pipeline.py	`43.55% <33.33%> (+14.70%)`	⬆️
atomsci/ddm/pipeline/transformations.py	`56.85% <89.65%> (+17.86%)`	⬆️
atomsci/ddm/pipeline/perf_data.py	`69.56% <80.00%> (+57.32%)`	⬆️
atomsci/ddm/pipeline/model_wrapper.py	`60.21% <78.30%> (+30.53%)`	⬆️

... and 3 files with indirect coverage changes

stewarthe6 added 2 commits

December 11, 2024 15:50


          specified fold for embedding features

398cf06


          More tests for perf_data

218f70d

paulsonak reviewed

View reviewed changes

atomsci/ddm/pipeline/model_wrapper.py

    
                      # Set up transformers for weights, if needed

                      self.transformers_w = trans.create_weight_transformers(self.params, model_dataset)

                          # Set up transformers for weights, if needed

                          self.transformers_w[k] = trans.create_weight_transformers(self.params, td)

Collaborator

paulsonak Dec 11, 2024

should this one be a privileged function like _create_weight_transformers

Collaborator Author

stewarthe6 Dec 12, 2024

I made this match the create_feature_transformers above. I realize that this is similar to the create*_transformers in model_wrapper. These functions can be called by anyone, but the ones in model_wrapper used to have side effects.

atomsci/ddm/pipeline/model_wrapper.py Outdated

+                          if os.path.exists(local_path):
+                              self.log.info(f"Reloading transformers from model tarball {local_path}")
+                              with open(local_path, 'rb') as txfmr:
+                                  transformers_tuple = pickle.load(txfmr)
                           else:

Collaborator

paulsonak Dec 11, 2024

this always prefers the model tarball transformers over a supplied path to transformers.pkl object. Is this what we want? I think in other situations we provide a path (like path to external training data) to override the one inherent in the model, so this is the opposite logic.

Also, we'd need some tests to make sure a user-supplied transformer object is valid.

Collaborator

paulsonak Dec 11, 2024

I think the tests could wait if we leave the logic the way it is, since it will be rare that the user-supplied case comes up.

Collaborator Author

stewarthe6 Dec 12, 2024

I think this logic matches the current transformer logic. We can change it later during a larger refactor.

atomsci/ddm/pipeline/model_wrapper.py Outdated

+                          self.transformers[i] = ty
+                          self.transformers_x[i] = tx
+                          self.transformers_w[i] = tw

Collaborator

paulsonak Dec 11, 2024

I would like to see self.transformers updated to be self.transformers_y in a refactor / when we're comfortable breaking backward compatibility.

Collaborator Author

stewarthe6 Dec 12, 2024 •

edited

Loading

Yes, I agree. But I don't think may people directly access transformers.

atomsci/ddm/pipeline/model_wrapper.py Outdated

                       """Transform the responses and/or features in the given DeepChem dataset using the current transformers.
                       Args:
                           dataset: The DeepChem DiskDataset that contains a dataset
+                          fold (int): Which fold is being transformed.

Collaborator

paulsonak Dec 11, 2024

fold train_valid isn't an int but that's a minor quibble as long as the transformers thing is a dictionary everywhere.

Collaborator

paulsonak Dec 12, 2024

If you're specifying a fold here, maybe we should make it more explicitly linked to the correct fold of dataset as well instead of relying on a different part of the code to pass the correct fold of dataset into this function. so instead of just setting transformed_dataset=dataset, use the fold=0 parameter to do like transformed_dataset=dataset.get_fold(fold) or something?

Collaborator

paulsonak Dec 12, 2024

after reading the model_pipeline code it makes sense to leave the k-fold selection of dataset there with the k-fold loop; maybe in the docstring of this one just add that if it's k-fold then the data passed in should be / is the k-th fold so people understand

Collaborator Author

stewarthe6 Dec 12, 2024

Yeah, I'll have to go back and document this to explain when to use an integer and when to use the 'final' key.

atomsci/ddm/pipeline/model_wrapper.py Outdated Show resolved Hide resolved

atomsci/ddm/test/unit/test_perf_data.py


		return result

		def test_KFoldRegressionPerfData():

Collaborator

paulsonak Dec 12, 2024

docstring: goal of test is to make sure you can create the object and that values are stored where they're supposed to be stored? Should this be tested with non k-fold data to make sure the object is not created?

Collaborator Author

stewarthe6 Dec 13, 2024

There are non k-fold examples that use SimpleRegressionPerfData() for example.

atomsci/ddm/test/unit/test_perf_data.py

+                  # perfect score every time
+                  assert r2_mean==1
+                  assert r2_std==0

Collaborator

paulsonak Dec 12, 2024

here you're testing multitask but just checking the mean performance metrics not per-task?

Collaborator Author

stewarthe6 Dec 13, 2024

Yeah, that's what compute_perf_metrics does. It averages out all the r2 values. Including r2s from different tasks.

atomsci/ddm/test/unit/test_transformers.py Outdated

+              def test_one_task_no_missing_values():
+                  y = np.array([[1.0], [3.0], [5.0]])
+                  w = np.array([[1, 1], [1, 1], [1, 1]])

Collaborator

paulsonak Dec 12, 2024

is w supposed to be the same shape as Y

Collaborator Author

stewarthe6 Dec 13, 2024

Yes, I asked CoPilot to write that test. I should be more careful.

atomsci/ddm/pipeline/perf_data.py

		@@ -1279,16 +1238,12 @@ def get_pred_values(self):
		"""
		ids = sorted(self.pred_vals.keys())

Collaborator

paulsonak Dec 12, 2024

unrelated but didn't we agree not to sort predictions?

Collaborator

paulsonak Dec 12, 2024

I double checked, KFoldRegressionPerfData and KFoldClassificationPerfData sort in get_pred_values(), get_real_values() and get_weights() while SimpleClassificationPerfData, SimpleRegressionPerfData and SimpleHybridPerfData do NOT sort by ID.

Collaborator Author

stewarthe6 Dec 13, 2024

I think the order that the keys are returned from self.pred_vals is totally random. So it's either sorted or totally random. Not sure which would be better.

atomsci/ddm/pipeline/perf_data.py

                       # Convert true values to one-hot encoding
                       if self.num_classes > 2:
-                          self.real_vals = np.concatenate([dc.metrics.to_one_hot(dataset.y[:,j], self.num_classes).reshape(-1,1,self.num_classes)
+                          self.real_vals = np.concatenate([dc.metrics.to_one_hot(self.real_classes[:,j], self.num_classes).reshape(-1,1,self.num_classes)
                                                            for j in range(self.num_tasks)], axis=1)

Collaborator

paulsonak Dec 12, 2024

Does one-hot encoding count as a transformation? for a refactor?

Collaborator Author

stewarthe6 Dec 13, 2024

I think sklearn lets you add a one-hot step into the pipeline.

stewarthe6 and others added 8 commits

December 12, 2024 15:35


          Test to make sure transformers are fit correctly on training data only

78f7e35


          Added check to make sure that every requested id in the subset has a …

804a62b

…response value found


          call get_untransformed_responses instead

23bd84f


          Cache the untransformed response dict

61c18fd


          Should not have to pass a 'final' argument

9c8abc3


          Weights and y should be the same shape

a7ed96a


          dataset transformation moved into generate_predictions()

b0940ad


          zero out transformed values that are larger than 1e30

a7eb892

Collaborator

mauvais2 commented Dec 13, 2024

Just to mention that these tests got affected.

FAILED test_delaney_panel.py::test_reg_config_delaney_fit_RF_3fold - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
FAILED test_delaney_panel.py::test_reg_config_delaney_fit_XGB_3fold - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
FAILED test_delaney_panel.py::test_reg_kfold_config_delaney_fit_NN_graphconv - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
FAILED test_balancing_transformer.py::test_balancing_transformer - ValueError: not enough values to unpack (expected 2, got 1)
FAILED test_balancing_transformer.py::test_all_transformers - assert 0.0
FAILED test_perf_data.py::test_KFoldRegressionPerfData - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
FAILED test_perf_data.py::test_KFoldRegressionPerfDataMulti - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
FAILED test_perf_data.py::test_KFoldClassificationPerfData - IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

stewarthe6 and others added 8 commits

December 16, 2024 09:44


          get_untransformed_responses returns an array, not a dictionary

d1b25d8


          sped up and updated the test

74c1953


          Updated transformer test to correctly test the standard deviation and…

58c454d

… to use a new json file specific to that test


          update large values to be capped at 1e30

d17bfc4


          Added a test for kfold cross validation transformers

d5b3b55


          Merge branch 'bug_transformer_fitting' of github.com:ATOMScience-org/…

069fa28

…AMPL into bug_transformer_fitting


          Test for y transformers

3fb2f45


          Removed unused 'fold' parameter. Added documentation for this PR

d309761

stewarthe6 marked this pull request as ready for review

January 14, 2025 18:13

stewarthe6 merged commit 94f93d0 into 1.7.0

12 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet