Calculation of feature importances in a supervised setting #677

Lilly-May · 2024-03-28T10:41:12Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked (closes Calculate feature importances easily #258)
If you've fixed a bug or added code that should be tested, add tests!

Description of changes

Added a new method feature_importances to calculate the contribution of each feature to the prediction of a specified feature. For instance, one could be interested in what features are good predictors of disease severity, and depending on the disease, features such as age or pre-existing conditions would have high feature importance.
Users have the option to choose between a continuous or a categorical prediction, as well as the option to choose between using SVM, Random Forest, or Regression to perform the prediction
Implemented a basic plotting function for visualizing individual feature importances with a bar plot (see example below)

ToDos

Test performance on the test set. Should we print the resulting loss/accuracy out then? Should that somehow affect the plotting function, e.g. as an indicator of reliability?
Improve the plotting function. Personally, I would prefer to plot everything in the same color - what do you think?

Considerations

Currently, the importances are only comparable relative to each other; we can't compare them across models or datasets. I'll think more about how we could improve interpretability.
How do we handle non-numerical features? Currently, I'm simply dropping them and give a warning. Should we instead throw an error stating it's necessary to properly encode features first?
We should discuss if we want to incorporate bias detection (e.g. FairLearn).

Example

adata = ep.dt.mimic_2(encoded=False)
ep.pp.knn_impute(adata, n_neighbours=5)
ep.tl.feature_importances(adata, predicted_feature="tco2_first", prediction_type="continuous", model="rf", input_features="all")
ep.pl.feature_importances(adata)

Zethson · 2024-03-28T10:56:36Z

How do we handle non-numerical features? Currently, I'm simply dropping them and give a warning. Should we instead throw an error stating it's necessary to properly encode features first?

Yes, I would not drop them. Maybe even throw a ValueError with the suggested solution

Lilly-May · 2024-04-02T13:01:00Z

Yes, I would not drop them. Maybe even throw a ValueError with the suggested solution

I adjusted the code as suggested: Non-numeric features now cause the function to fail with a ValueError, requesting a proper encoding of the features beforehand.

Zethson

Really good work! Thank you very much.

Besides my comments, we also need to add this to the documentation of ehrapy somewhere. I would try not to add a new section, but I'll leave it up to you to make a suggestion.

ehrapy/plot/supervised/_feature_importances.py

ehrapy/tools/supervised/_feature_importances.py

Zethson · 2024-04-03T17:53:51Z

Awesome! Just one more thing to resolve (#677 (comment)) and then we're good to go!

ehrapy/tools/feature_ranking/_feature_importances.py

Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>

Lilly-May added 3 commits March 27, 2024 15:32

Added feature importances function and tests

03c3161

Added feature importances plotting function

a0c2551

Add function in init

8c8a5a9

github-actions bot added the enhancement New feature or request label Mar 28, 2024

Lilly-May added 2 commits April 2, 2024 14:40

Added evaluation on test set

bb726f6

Raise error for non-numeric features

0305bf3

Lilly-May and others added 2 commits April 2, 2024 17:17

Added output as percentage option

6cfd94f

Merge branch 'main' into feature/feature_importances

bf3a159

Lilly-May marked this pull request as ready for review April 3, 2024 07:52

Lilly-May requested a review from Zethson April 3, 2024 07:53

Zethson reviewed Apr 3, 2024

View reviewed changes

Lilly-May added 4 commits April 3, 2024 14:28

Harmonize plotting function

b268006

PR Reviews

254873c

Return Figure axes when show=False

5475269

Refactor API

1e1acf2

github-actions bot added the chore label Apr 3, 2024

Added docs examples

201299f

Lilly-May removed the chore label Apr 3, 2024

Zethson approved these changes Apr 7, 2024

View reviewed changes

ehrapy/tools/feature_ranking/_feature_importances.py Outdated Show resolved Hide resolved

Auto prediction_type inference

0c65ce1

github-actions bot added the chore label Apr 7, 2024

Simplified error message

3c0a979

Co-authored-by: Lukas Heumos <lukas.heumos@posteo.net>

Lilly-May removed the chore label Apr 7, 2024

Lilly-May merged commit 47e5c07 into main Apr 7, 2024
9 of 17 checks passed

Zethson deleted the feature/feature_importances branch April 17, 2024 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculation of feature importances in a supervised setting #677

Calculation of feature importances in a supervised setting #677

Lilly-May commented Mar 28, 2024 •

edited

Loading

Zethson commented Mar 28, 2024

Lilly-May commented Apr 2, 2024

Zethson left a comment

Zethson commented Apr 3, 2024

Calculation of feature importances in a supervised setting #677

Calculation of feature importances in a supervised setting #677

Conversation

Lilly-May commented Mar 28, 2024 • edited Loading

Zethson commented Mar 28, 2024

Lilly-May commented Apr 2, 2024

Zethson left a comment

Choose a reason for hiding this comment

Zethson commented Apr 3, 2024

Lilly-May commented Mar 28, 2024 •

edited

Loading