Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ml29 adv validation #42

Merged
merged 6 commits into from
Nov 14, 2019
Merged

Ml29 adv validation #42

merged 6 commits into from
Nov 14, 2019

Conversation

atselikov
Copy link
Contributor

Added adversarial validation checking on the stage of data loading, so it will inform user in case of train and test is completely different.

resolve #29

Alex Tselikov and others added 2 commits November 8, 2019 15:28
@atselikov atselikov requested review from mahedeeb and aalhour November 8, 2019 15:25
Copy link
Contributor

@mahedeeb mahedeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adversarial_validation(dataframes_dict, ignore_columns)
I would suggest adding a new list of dataframes names. Maybe the user has three 3 dataframes(datasets)[trian, validate, test] and the user interested in applying the test between validate and test, the user should be able to do this by applying the following:
adversarial_validation(dataframes_dict, ignore_columns, datasets=["validate", "test"]) or
adversarial_validation(dataframes_dict, ignore_columns, datasets=["train", "test"]) or
adversarial_validation(dataframes_dict, ignore_columns, datasets=["train", "validate"])

ignore_columns: list = [],
max_dataframe_length: int = 100000,
threshold: float = 0.7) -> float:
""" Training a probabilistic classifier to distinguish train/test examples.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest giving the function a short name here e.g. train/test probabilistic classifier. and then a blank line and then the summary of the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

""" Training a probabilistic classifier to distinguish train/test examples.
See more info here: http://fastml.com/adversarial-validation-part-one/

This function tries to check whether test and train data coming from the same data distribution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe to say it checks instead of tries? Or do you mean that the function can fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ,check is better, agree

# Check if it only one dataframe provided
if len(dataframe_dict) != 2:
# do nothing and return the original data
logger.info("Can't apply adversarial_validation because count of dataframes is not equal to 2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest including a print statement as follows:
print("The number o the dataframes is not equal to 2. Therefore, aversarial validation will not be performed!")
The idea is to inform the user

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

clf = xgb.XGBClassifier(**xgb_params, seed=10)
results = []
logger.info('Adversarial validation checking:')
for fold, (train_index, test_index) in enumerate(skf.split(df_joined, y)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest extracting this in a new function if it is possible

Copy link
Contributor

@mahedeeb mahedeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great work!!!
The main problem now is the retrun statement that you forgot. This should be removed otherwise the code doesn't work

if len(dataframe_dict) != 2:
# do nothing and return the original data
print("Can't apply adversarial_validation because count of dataframes is not equal to 2")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your great refactoring. You have a "return" statement here. I think you forgot removing it. I tried to test your code in one of the flows, but it did not work. I would suggest testing your code in all flows and check if it works without any problem. Even better, I would recommend strongly to write unit tests for the functions that you created.

print("Can't apply adversarial_validation because count of dataframes is not equal to 2")
return

# TODO: support > 2 dataframes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing the "TODO" functions and open issues for those tasks

train = train[columns_to_use]
test = test[columns_to_use]
# add identifier and combine
train['istrain'] = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check this again. do you need to repeat the code here. Let us discuss this if the answer is yes.

| mean of KFold validation results (ROC-AUC scores)
"""

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=44)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us discuss this. I would still write it to an external file

return str(list(sorted_x[:top_features]))


def round_and_sort_dict(feat_imp: dict) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pycharm says that this function should return a dict but it got a list. Is the output a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string - the purpose to print the first top_features

Copy link
Contributor

@mahedeeb mahedeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works nicly! Thanks a lot

@mahedeeb mahedeeb merged commit c5371da into master Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Adversarial validation check on the stage of *scale_data*
2 participants