-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug transformer fitting #385
Conversation
…he default fold, transformers only see one Dataset during creation, updated unit tests
…er, transforms happen right before it's used. Transformations are un-done before passing into perf_data
…ll a tuple, but the elements are now dictionaries instead of lists. The rest of the code should work the same.
@@ -301,7 +301,7 @@ def load_featurize_data(self, params=None): | |||
# is fitted to the training data only. The transformers are then applied to the training, | |||
# validation and test sets separately. | |||
if not params.split_only: | |||
self.model_wrapper.create_transformers(self.data) | |||
self.model_wrapper.create_transformers(trans.get_all_training_datasets(self.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this get_all_training_datasets
seems like a function that belongs in model_datasets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. The structure created by get_all_training_datasets is very specific to how transformers are implemented and I would like all that code in one place. Eventually I think there should be a TransformerManager object that can build all needed transformers given a dataset.
Codecov ReportAttention: Patch coverage is @@ Coverage Diff @@
## 1.7.0 #385 +/- ##
==========================================
+ Coverage 20.29% 28.53% +8.24%
==========================================
Files 47 47
Lines 12883 12919 +36
==========================================
+ Hits 2615 3687 +1072
+ Misses 10268 9232 -1036
|
# Set up transformers for weights, if needed | ||
self.transformers_w = trans.create_weight_transformers(self.params, model_dataset) | ||
# Set up transformers for weights, if needed | ||
self.transformers_w[k] = trans.create_weight_transformers(self.params, td) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this one be a privileged function like _create_weight_transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this match the create_feature_transformers above. I realize that this is similar to the create*_transformers in model_wrapper. These functions can be called by anyone, but the ones in model_wrapper used to have side effects.
if os.path.exists(local_path): | ||
self.log.info(f"Reloading transformers from model tarball {local_path}") | ||
with open(local_path, 'rb') as txfmr: | ||
transformers_tuple = pickle.load(txfmr) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this always prefers the model tarball transformers over a supplied path to transformers.pkl object. Is this what we want? I think in other situations we provide a path (like path to external training data) to override the one inherent in the model, so this is the opposite logic.
Also, we'd need some tests to make sure a user-supplied transformer object is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests could wait if we leave the logic the way it is, since it will be rare that the user-supplied case comes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic matches the current transformer logic. We can change it later during a larger refactor.
|
||
self.transformers[i] = ty | ||
self.transformers_x[i] = tx | ||
self.transformers_w[i] = tw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see self.transformers
updated to be self.transformers_y
in a refactor / when we're comfortable breaking backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. But I don't think may people directly access transformers.
"""Transform the responses and/or features in the given DeepChem dataset using the current transformers. | ||
|
||
Args: | ||
dataset: The DeepChem DiskDataset that contains a dataset | ||
fold (int): Which fold is being transformed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fold train_valid
isn't an int but that's a minor quibble as long as the transformers thing is a dictionary everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're specifying a fold here, maybe we should make it more explicitly linked to the correct fold of dataset as well instead of relying on a different part of the code to pass the correct fold of dataset into this function. so instead of just setting transformed_dataset=dataset
, use the fold=0
parameter to do like transformed_dataset=dataset.get_fold(fold)
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after reading the model_pipeline code it makes sense to leave the k-fold selection of dataset there with the k-fold loop; maybe in the docstring of this one just add that if it's k-fold then the data passed in should be / is the k-th fold so people understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll have to go back and document this to explain when to use an integer and when to use the 'final' key.
|
||
return result | ||
|
||
def test_KFoldRegressionPerfData(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring: goal of test is to make sure you can create the object and that values are stored where they're supposed to be stored? Should this be tested with non k-fold data to make sure the object is not created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are non k-fold examples that use SimpleRegressionPerfData() for example.
|
||
# perfect score every time | ||
assert r2_mean==1 | ||
assert r2_std==0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you're testing multitask but just checking the mean performance metrics not per-task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what compute_perf_metrics does. It averages out all the r2 values. Including r2s from different tasks.
|
||
def test_one_task_no_missing_values(): | ||
y = np.array([[1.0], [3.0], [5.0]]) | ||
w = np.array([[1, 1], [1, 1], [1, 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is w supposed to be the same shape as Y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I asked CoPilot to write that test. I should be more careful.
@@ -1279,16 +1238,12 @@ def get_pred_values(self): | |||
""" | |||
ids = sorted(self.pred_vals.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated but didn't we agree not to sort predictions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked, KFoldRegressionPerfData and KFoldClassificationPerfData sort in get_pred_values()
, get_real_values()
and get_weights()
while SimpleClassificationPerfData, SimpleRegressionPerfData and SimpleHybridPerfData do NOT sort by ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order that the keys are returned from self.pred_vals is totally random. So it's either sorted or totally random. Not sure which would be better.
# Convert true values to one-hot encoding | ||
if self.num_classes > 2: | ||
self.real_vals = np.concatenate([dc.metrics.to_one_hot(dataset.y[:,j], self.num_classes).reshape(-1,1,self.num_classes) | ||
self.real_vals = np.concatenate([dc.metrics.to_one_hot(self.real_classes[:,j], self.num_classes).reshape(-1,1,self.num_classes) | ||
for j in range(self.num_tasks)], axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does one-hot encoding count as a transformation? for a refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sklearn lets you add a one-hot step into the pipeline.
…response value found
Just to mention that these tests got affected.
|
… to use a new json file specific to that test
…AMPL into bug_transformer_fitting
Transformers for normalizing inputs and outputs and weights are being trained on the full dataset instead of just the training dataset. This is causing data leakage.