-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ml29 adv validation #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adversarial_validation(dataframes_dict, ignore_columns)
I would suggest adding a new list of dataframes names. Maybe the user has three 3 dataframes(datasets)[trian, validate, test] and the user interested in applying the test between validate and test, the user should be able to do this by applying the following:
adversarial_validation(dataframes_dict, ignore_columns, datasets=["validate", "test"]) or
adversarial_validation(dataframes_dict, ignore_columns, datasets=["train", "test"]) or
adversarial_validation(dataframes_dict, ignore_columns, datasets=["train", "validate"])
ignore_columns: list = [], | ||
max_dataframe_length: int = 100000, | ||
threshold: float = 0.7) -> float: | ||
""" Training a probabilistic classifier to distinguish train/test examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest giving the function a short name here e.g. train/test probabilistic classifier. and then a blank line and then the summary of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
""" Training a probabilistic classifier to distinguish train/test examples. | ||
See more info here: http://fastml.com/adversarial-validation-part-one/ | ||
|
||
This function tries to check whether test and train data coming from the same data distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe to say it checks instead of tries? Or do you mean that the function can fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ,check is better, agree
# Check if it only one dataframe provided | ||
if len(dataframe_dict) != 2: | ||
# do nothing and return the original data | ||
logger.info("Can't apply adversarial_validation because count of dataframes is not equal to 2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest including a print statement as follows:
print("The number o the dataframes is not equal to 2. Therefore, aversarial validation will not be performed!")
The idea is to inform the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
clf = xgb.XGBClassifier(**xgb_params, seed=10) | ||
results = [] | ||
logger.info('Adversarial validation checking:') | ||
for fold, (train_index, test_index) in enumerate(skf.split(df_joined, y)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest extracting this in a new function if it is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great work!!!
The main problem now is the retrun statement that you forgot. This should be removed otherwise the code doesn't work
if len(dataframe_dict) != 2: | ||
# do nothing and return the original data | ||
print("Can't apply adversarial_validation because count of dataframes is not equal to 2") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your great refactoring. You have a "return" statement here. I think you forgot removing it. I tried to test your code in one of the flows, but it did not work. I would suggest testing your code in all flows and check if it works without any problem. Even better, I would recommend strongly to write unit tests for the functions that you created.
print("Can't apply adversarial_validation because count of dataframes is not equal to 2") | ||
return | ||
|
||
# TODO: support > 2 dataframes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing the "TODO" functions and open issues for those tasks
train = train[columns_to_use] | ||
test = test[columns_to_use] | ||
# add identifier and combine | ||
train['istrain'] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check this again. do you need to repeat the code here. Let us discuss this if the answer is yes.
| mean of KFold validation results (ROC-AUC scores) | ||
""" | ||
|
||
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=44) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us discuss this. I would still write it to an external file
return str(list(sorted_x[:top_features])) | ||
|
||
|
||
def round_and_sort_dict(feat_imp: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pycharm says that this function should return a dict but it got a list. Is the output a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string - the purpose to print the first top_features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works nicly! Thanks a lot
Added adversarial validation checking on the stage of data loading, so it will inform user in case of train and test is completely different.
resolve #29