New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Ml29 adv validation #42

Merged

mahedeeb merged 6 commits into master from ml29-adv_validation

Nov 14, 2019

Contributor

atselikov commented Nov 8, 2019

Added adversarial validation checking on the stage of data loading, so it will inform user in case of train and test is completely different.

resolve #29

Alex Tselikov and others added 2 commits

November 8, 2019 15:28


          AV: create

0b01b6d


          AV: create

a5fa214

atselikov requested review from mahedeeb and aalhour

November 8, 2019 15:25


          AV: decrease AV time

2594bab

mahedeeb suggested changes

View reviewed changes

Contributor

mahedeeb left a comment

adversarial_validation(dataframes_dict, ignore_columns)
I would suggest adding a new list of dataframes names. Maybe the user has three 3 dataframes(datasets)[trian, validate, test] and the user interested in applying the test between validate and test, the user should be able to do this by applying the following:
adversarial_validation(dataframes_dict, ignore_columns, datasets=["validate", "test"]) or
adversarial_validation(dataframes_dict, ignore_columns, datasets=["train", "test"]) or
adversarial_validation(dataframes_dict, ignore_columns, datasets=["train", "validate"])

flows/flows.py Show resolved Hide resolved

preprocessing/data_science_help_functions.py Outdated Show resolved Hide resolved

preprocessing/data_science_help_functions.py Outdated

+                                         ignore_columns: list = [],
+                                         max_dataframe_length: int = 100000,
+                                         threshold: float = 0.7) -> float:
+                  """ Training a probabilistic classifier to distinguish train/test examples.

Contributor

mahedeeb Nov 11, 2019

I suggest giving the function a short name here e.g. train/test probabilistic classifier. and then a blank line and then the summary of the function

Contributor Author

atselikov Nov 12, 2019

done

preprocessing/data_science_help_functions.py Outdated

+                  """ Training a probabilistic classifier to distinguish train/test examples.
+                  See more info here: http://fastml.com/adversarial-validation-part-one/
+                  This function tries to check whether test and train data coming from the same data distribution.

Contributor

mahedeeb Nov 11, 2019

maybe to say it checks instead of tries? Or do you mean that the function can fail?

Contributor Author

atselikov Nov 12, 2019

ok ,check is better, agree

preprocessing/data_science_help_functions.py Outdated

+                  # Check if it only one dataframe provided
+                  if len(dataframe_dict) != 2:
+                      # do nothing and return the original data
+                      logger.info("Can't apply adversarial_validation because count of dataframes is not equal to 2")

Contributor

mahedeeb Nov 11, 2019

I would suggest including a print statement as follows:
print("The number o the dataframes is not equal to 2. Therefore, aversarial validation will not be performed!")
The idea is to inform the user

Contributor Author

atselikov Nov 12, 2019

ok

preprocessing/data_science_help_functions.py

+                  clf = xgb.XGBClassifier(**xgb_params, seed=10)
+                  results = []
+                  logger.info('Adversarial validation checking:')
+                  for fold, (train_index, test_index) in enumerate(skf.split(df_joined, y)):

Contributor

mahedeeb Nov 11, 2019

I would suggest extracting this in a new function if it is possible

preprocessing/data_science_help_functions.py Outdated Show resolved Hide resolved

preprocessing/data_science_help_functions.py Outdated Show resolved Hide resolved

preprocessing/data_science_help_functions.py Show resolved Hide resolved

preprocessing/data_transformer.py Outdated Show resolved Hide resolved


          #42 discussion fixes

d0d9cab

mahedeeb suggested changes

View reviewed changes

Contributor

mahedeeb left a comment

Thanks a lot for the great work!!!
The main problem now is the retrun statement that you forgot. This should be removed otherwise the code doesn't work

preprocessing/data_science_help_functions.py

+                  if len(dataframe_dict) != 2:
+                      # do nothing and return the original data
+                      print("Can't apply adversarial_validation because count of dataframes is not equal to 2")
+                      return

Contributor

mahedeeb Nov 12, 2019

Thanks a lot for your great refactoring. You have a "return" statement here. I think you forgot removing it. I tried to test your code in one of the flows, but it did not work. I would suggest testing your code in all flows and check if it works without any problem. Even better, I would recommend strongly to write unit tests for the functions that you created.

preprocessing/data_science_help_functions.py Outdated

+                      print("Can't apply adversarial_validation because count of dataframes is not equal to 2")
+                      return
+                  # TODO: support > 2 dataframes

Contributor

mahedeeb Nov 12, 2019

I would suggest removing the "TODO" functions and open issues for those tasks

preprocessing/data_science_help_functions.py Outdated

+                      train = train[columns_to_use]
+                      test = test[columns_to_use]
+                  # add identifier and combine
+                  train['istrain'] = 1

Contributor

mahedeeb Nov 12, 2019

Could you please check this again. do you need to repeat the code here. Let us discuss this if the answer is yes.

preprocessing/data_science_help_functions.py

+                          | mean of KFold validation results (ROC-AUC scores)
+                  """
+                  skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=44)

Contributor

mahedeeb Nov 12, 2019

Let us discuss this. I would still write it to an external file

preprocessing/data_science_help_functions.py Outdated

		return str(list(sorted_x[:top_features]))


		def round_and_sort_dict(feat_imp: dict) -> dict:

Contributor

mahedeeb Nov 12, 2019

Pycharm says that this function should return a dict but it got a list. Is the output a list?

Contributor Author

atselikov Nov 13, 2019

string - the purpose to print the first top_features

atselikov added 2 commits

November 13, 2019 13:23


          #42 discussion fixes

8c97e2c


          add printing info

c96c0a3

mahedeeb approved these changes

View reviewed changes

Contributor

mahedeeb left a comment

works nicly! Thanks a lot

mahedeeb merged commit c5371da into master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet