-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug transformer fitting #385
Changes from 1 commit
d44ee76
10675c6
71d0859
a2520a4
18987d3
4689f9e
cd14bd9
00a74af
9cb3f4d
72172a2
9863aef
a40dffa
6332c60
398cf06
218f70d
78f7e35
804a62b
23bd84f
61c18fd
9c8abc3
a7ed96a
b0940ad
a7eb892
d1b25d8
74c1953
58c454d
d17bfc4
d5b3b55
069fa28
3fb2f45
d309761
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
…er, transforms happen right before it's used. Transformations are un-done before passing into perf_data
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -379,6 +379,7 @@ def get_featurized_data(self, params=None): | |
if params.prediction_type=='classification': | ||
w = w.astype(np.float32) | ||
|
||
self.untransformed_dataset = NumpyDataset(features, self.vals, ids=ids) | ||
self.dataset = NumpyDataset(features, self.vals, ids=ids, w=w) | ||
self.log.info("Using prefeaturized data; number of features = " + str(self.n_features)) | ||
return | ||
|
@@ -404,6 +405,7 @@ def get_featurized_data(self, params=None): | |
self.log.debug("Number of features: " + str(self.n_features)) | ||
|
||
# Create the DeepChem dataset | ||
self.untransformed_dataset = NumpyDataset(features, self.vals, ids=ids) | ||
self.dataset = NumpyDataset(features, self.vals, ids=ids, w=w) | ||
# Checking for minimum number of rows | ||
if len(self.dataset) < params.min_compound_number: | ||
|
@@ -681,7 +683,7 @@ def has_all_feature_columns(self, dset_df): | |
|
||
# ************************************************************************************* | ||
|
||
def get_subset_responses_and_weights(self, subset, transformers): | ||
def get_subset_responses_and_weights(self, subset): | ||
"""Returns a dictionary mapping compound IDs in the given dataset subset to arrays of response values | ||
and weights. Used by the perf_data module under k-fold CV. | ||
|
||
|
@@ -703,16 +705,33 @@ def get_subset_responses_and_weights(self, subset, transformers): | |
else: | ||
raise ValueError('Unknown dataset subset type "%s"' % subset) | ||
|
||
y = dc.trans.undo_transforms(dataset.y, transformers) | ||
response_vals = dict() | ||
dataset_ids = set(dataset.ids) | ||
for id, y in zip(self.untransformed_dataset.ids, self.untransformed_dataset.y): | ||
if id in dataset_ids: | ||
response_vals[id] = y | ||
|
||
w = dataset.w | ||
response_vals = dict([(id, y[i,:]) for i, id in enumerate(dataset.ids)]) | ||
weights = dict([(id, w[i,:]) for i, id in enumerate(dataset.ids)]) | ||
self.subset_response_dict[subset] = response_vals | ||
self.subset_weight_dict[subset] = weights | ||
return self.subset_response_dict[subset], self.subset_weight_dict[subset] | ||
|
||
# ************************************************************************************* | ||
|
||
def get_untransformed_responses(self, ids): | ||
""" Returns a numpy array of untransformed response values | ||
""" | ||
response_vals = np.zeros((len(ids), self.untransformed_dataset.y.shape[1])) | ||
response_dict = dict([(id, y) for id, y in zip(self.untransformed_dataset.ids, self.untransformed_dataset.y)]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where is this function used? does it make sense to make response_vals a dict and then turn that dict.keys() into a numpy array elsewhere? |
||
|
||
for i, id in enumerate(ids): | ||
response_vals[i] = response_dict[id] | ||
|
||
return response_vals | ||
|
||
# ************************************************************************************* | ||
|
||
def _get_split_key(self): | ||
"""Creates the proper CSV name for a split file | ||
|
||
|
@@ -828,6 +847,8 @@ def get_featurized_data(self, dset_df, is_featurized=False): | |
params, self.contains_responses) | ||
self.log.warning("Done") | ||
self.n_features = self.featurization.get_feature_count() | ||
|
||
self.untransformed_dataset= NumpyDataset(features, self.vals, ids=ids) | ||
self.dataset = NumpyDataset(features, self.vals, ids=ids) | ||
|
||
# **************************************************************************************** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to deal with missing values if
dataset_ids
aren't inuntransformed_dataset.ids
- shouldn't happen but here it is skipped silently, in that you could havelen(response_vals)<len(dataset_ids)
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should call
get_untransformed_responses
here instead of this for loop?get_untransformed_responses
returns an np.array right now but I don't know if that makes sense for anything since it could be arbitrary IDs flattened into an array.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll add a check here.