Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GBDTs feature importance #292

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

xnuohz
Copy link
Contributor

@xnuohz xnuohz commented Dec 12, 2023

  1. Add APIs to get feature importance
  2. Add test case
  3. Update example

Copy link

codecov bot commented Dec 12, 2023

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.41%. Comparing base (a90a154) to head (3b4ceb5).
Report is 87 commits behind head on master.

Files with missing lines Patch % Lines
torch_frame/gbdt/gbdt.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #292   +/-   ##
=======================================
  Coverage   93.41%   93.41%           
=======================================
  Files         116      116           
  Lines        5949     5970   +21     
=======================================
+ Hits         5557     5577   +20     
- Misses        392      393    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -232,3 +232,7 @@ def _load(self, path: str) -> None:
import xgboost

self.model = xgboost.Booster(model_file=path)

def _feature_importance(self) -> list:
scores = self.model.get_score(importance_type='weight')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe weight can be passed as an argument

@@ -226,3 +226,7 @@ def _load(self, path: str) -> None:
import lightgbm

self.model = lightgbm.Booster(model_file=path)

def _feature_importance(self) -> list:
scores = self.model.feature_importance(importance_type='gain')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@@ -135,6 +139,19 @@ def load(self, path: str) -> None:
self._load(path)
self._is_fitted = True

def feature_importance(self) -> list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def feature_importance(self) -> list:
def feature_importance(self, *args, **kwargs) -> list:

num_features = 0
for x in stypes:
if x == stype.numerical:
num_features += 3 * 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get the code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I want to get the total number of FakeDataset features. 3 means the number and 1 means the dimension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do so by some more generic ways, e.g. getting the values and dimensions from col_names_dict or tensor_frame, rather than using magic numbers?

Copy link
Contributor

@yiweny yiweny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. @weihua916 or @zechengz or @akihironitta should also take a look.

num_features = 0
for x in stypes:
if x == stype.numerical:
num_features += 3 * 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do so by some more generic ways, e.g. getting the values and dimensions from col_names_dict or tensor_frame, rather than using magic numbers?

iteration (int, optional): Limit number of iterations in the feature
importance calculation. If None, if the best iteration exists,
it is used; otherwise, all trees are used. If <= 0, all trees
are used (no limits).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add doc-string on iteration

'feature': dataset.feat_cols,
'importance': gbdt.feature_importance()
}).sort_values(by='importance', ascending=False)
print(scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some more text around the scores

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an parser argument to enable user specify whether they want to have feature importance.

'feature': dataset.feat_cols,
'importance': gbdt.feature_importance()
}).sort_values(by='importance', ascending=False)
print(scores)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add an parser argument to enable user specify whether they want to have feature importance.

], f'Expect split or gain, got {importance_type}.'
scores = self.model.feature_importance(importance_type=importance_type,
iteration=iteration)
return scores.tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this list to be just a list of scores? IMO it's better to return a dictionary where keys are column names and values are corresponding scores. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return types of GBDT's feature importance API are different. For convenience, I converted them to lists.

lightgbm -> ndarray
xgboost -> dict[str, float]
catboost -> ndarray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants