Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of feature importances in a supervised setting #677

Merged
merged 14 commits into from
Apr 7, 2024

Conversation

Lilly-May
Copy link
Collaborator

@Lilly-May Lilly-May commented Mar 28, 2024

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked (closes Calculate feature importances easily #258)
  • If you've fixed a bug or added code that should be tested, add tests!

Description of changes

  • Added a new method feature_importances to calculate the contribution of each feature to the prediction of a specified feature. For instance, one could be interested in what features are good predictors of disease severity, and depending on the disease, features such as age or pre-existing conditions would have high feature importance.
  • Users have the option to choose between a continuous or a categorical prediction, as well as the option to choose between using SVM, Random Forest, or Regression to perform the prediction
  • Implemented a basic plotting function for visualizing individual feature importances with a bar plot (see example below)

ToDos

  • Test performance on the test set. Should we print the resulting loss/accuracy out then? Should that somehow affect the plotting function, e.g. as an indicator of reliability?
  • Improve the plotting function. Personally, I would prefer to plot everything in the same color - what do you think?

Considerations

  • Currently, the importances are only comparable relative to each other; we can't compare them across models or datasets. I'll think more about how we could improve interpretability.
  • How do we handle non-numerical features? Currently, I'm simply dropping them and give a warning. Should we instead throw an error stating it's necessary to properly encode features first?
  • We should discuss if we want to incorporate bias detection (e.g. FairLearn).

Example

adata = ep.dt.mimic_2(encoded=False)
ep.pp.knn_impute(adata, n_neighbours=5)
ep.tl.feature_importances(adata, predicted_feature="tco2_first", prediction_type="continuous", model="rf", input_features="all")
ep.pl.feature_importances(adata)

feature_importances

@github-actions github-actions bot added the enhancement New feature or request label Mar 28, 2024
@Zethson
Copy link
Member

Zethson commented Mar 28, 2024

How do we handle non-numerical features? Currently, I'm simply dropping them and give a warning. Should we instead throw an error stating it's necessary to properly encode features first?

Yes, I would not drop them. Maybe even throw a ValueError with the suggested solution

@Lilly-May
Copy link
Collaborator Author

Yes, I would not drop them. Maybe even throw a ValueError with the suggested solution

I adjusted the code as suggested: Non-numeric features now cause the function to fail with a ValueError, requesting a proper encoding of the features beforehand.

@Lilly-May Lilly-May marked this pull request as ready for review April 3, 2024 07:52
@Lilly-May Lilly-May requested a review from Zethson April 3, 2024 07:53
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work! Thank you very much.

Besides my comments, we also need to add this to the documentation of ehrapy somewhere. I would try not to add a new section, but I'll leave it up to you to make a suggestion.

ehrapy/plot/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/plot/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/plot/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/plot/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/tools/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/tools/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/tools/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/tools/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/tools/supervised/_feature_importances.py Outdated Show resolved Hide resolved
ehrapy/tools/supervised/_feature_importances.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the chore label Apr 3, 2024
@Lilly-May Lilly-May removed the chore label Apr 3, 2024
@Zethson
Copy link
Member

Zethson commented Apr 3, 2024

Awesome! Just one more thing to resolve (#677 (comment)) and then we're good to go!

@github-actions github-actions bot added the chore label Apr 7, 2024
Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>
@Lilly-May Lilly-May removed the chore label Apr 7, 2024
@Lilly-May Lilly-May merged commit 47e5c07 into main Apr 7, 2024
9 of 17 checks passed
@Zethson Zethson deleted the feature/feature_importances branch April 17, 2024 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calculate feature importances easily
2 participants