Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the Validation Use TabPFN Sklearn Interface #18

Open
LennartPurucker opened this issue Mar 20, 2025 · 14 comments
Open

Make the Validation Use TabPFN Sklearn Interface #18

LennartPurucker opened this issue Mar 20, 2025 · 14 comments

Comments

@LennartPurucker
Copy link
Owner

See the title; I have to push this code to the main branch at some point.

Right now, we validate without the preprocessing and ensembling of the sklearn interface.
Ideally, we want to check if finetuning improves over this baseline and not just over no preprocessing or ensembling.

@iivalchev
Copy link
Contributor

@LennartPurucker is there any progress on that. I might have some time would be happy to tackle this.

@LennartPurucker
Copy link
Owner Author

Give me a day to share the code I created for this at some point, and then you could bootstrap based on that!

@LennartPurucker
Copy link
Owner Author

@iivalchev it seems I already did this here: https://github.com/LennartPurucker/finetune_tabpfn_v2/blob/stop_ot_testing/finetuning_scripts/training_utils/validation_utils.py#L62

But only for binary classification and very hacky but this is how I used it.

(@AlexanderPfefferle also take a look at this)

@iivalchev
Copy link
Contributor

@LennartPurucker thanks will take a look!

@iivalchev
Copy link
Contributor

@LennartPurucker let's see if I read the changes right. In validation_utils.py use_native_validation if True will skip the sklearn preprocessing and retain the current validation behavior. Otherwise it will fit/predict on the full TabPFN with sklearn preprocessing done.

I like the practicality of the current approach. I was thinking is it possible to extract the sklearn preprocessing and invoke it on its own. However this could be fragile I guess using the standard TabPFN model api is safer.

What needs to happen next?

  • support regression
  • make validation method configurable from the fine-tunning entry point
  • testing

Few things puzzle me.

  1. from autogluon.core.utils.early_stopping import ESWrapperOOF can't find this also is autogluon dependency desirable?

  2. this line looks funny es_wrapper_oof.update(y=y_val, y_score=y_pred_proba_val, cur_round=0, y_pred_proba=y_pred_proba_val)

  3. Why did you add X_test and X_ubiased_val data sets? Shouldn't one be sufficient?

@LennartPurucker
Copy link
Owner Author

I like the practicality of the current approach. I was thinking is it possible to extract the sklearn preprocessing and invoke it on its own. However this could be fragile I guess using the standard TabPFN model api is safer.

That should be possible as well. I know @MagnusBuehler did something like this.

Few things puzzle me.

You can ignore these things. The branch has some "secret" research code I tested. Ideally, I suggest creating a new branch and only extracting the use_native_validation logic.

@iivalchev
Copy link
Contributor

@LennartPurucker thank you for the clarifications. I would then cherry pick just the portions related to validating against the full-blown TabPFN model ignoring the rest, so no additional validation data sets will be needed. Will go for the simple solution of not trying to extract the pipeline and glueing it to the fine-tuning. If that happens to not work well will dig into how to improve it.

@LennartPurucker
Copy link
Owner Author

Sounds good, let me know when I should take a look at a draft PR or similar!

@MagnusBuehler
Copy link

That should be possible as well. I know @MagnusBuehler did something like this.

I have extracted the preprocessing into a separate class so that it can be easily applied to the finetuning data. I am happy to share this code if you are interested.

@LennartPurucker I have seen that in the validation code n_estimators is fixed to 1. Is there a reason for this? (

clf = TabPFNClassifier(model_path=save_path,n_estimators=1, categorical_features_indices=categorical_features_indices, device="cuda")
)

@LennartPurucker
Copy link
Owner Author

Mostly for the sake of speed 🤔

@iivalchev
Copy link
Contributor

iivalchev commented Apr 14, 2025

iivalchev#1 started porting the initial changes by @LennartPurucker. Needs to be tested.

@MagnusBuehler that would be great if you can share, but is the change in the main TabPFN codebase?

Also some fine-tunning effort is being done here as well: PriorLabs/TabPFN#273 FYI

@LennartPurucker
Copy link
Owner Author

Also some fine-tunning effort is being done here as well: PriorLabs/TabPFN#273 FYI

I am aware of this, but the primary focus seems to be on being able to fine-tune the sklearn interface. Last time I checked, the code disabled preprocessing to achieve this.

It is orthogonal to this code base in the sense that the PR does not focus on the training but on making it trainable, so to speak.

@MagnusBuehler
Copy link

@iivalchev Here is a slightly modified version, which I use (https://gist.github.com/MagnusBuehler/3e62613b7f0f7556eacb16653e584533)
I replicated the default sklearn parameters to align the preprocessing with the default TabPFNClassifier. One detail is that the set n_estimators have an effect on how many different augmented views of the data set are generated. (with n_estimators=4, 4 different xs,ys pairs are generated).

@iivalchev
Copy link
Contributor

@MagnusBuehler thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants