Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a proper TFIDF transformation #104

Open
johann-petrak opened this issue Jan 29, 2019 · 2 comments
Open

Add a proper TFIDF transformation #104

johann-petrak opened this issue Jan 29, 2019 · 2 comments

Comments

@johann-petrak
Copy link
Collaborator

Currently we use the CorpusStats plugin to create tfidf scores per token which we can then use together with the featureName4Value option of an attribute to inject into the sparse vector instead of just tf.
However, this does not work with n-grams where n>1 (we multiply the scores in that case, but that is not really what we want).

Maybe we can add the CorpusStats code as a subroutine for gathering stats on the fly during the feature extraction, then update the generated instances based on the statistics. This would of course only work for Mallet-representations, not any representation that gets written out immediately.

@johann-petrak
Copy link
Collaborator Author

We would have to gather statistics on a per-attribute basis, probably based on some new flag in the attribute declaration. Currently, we set the sparse vector element of a nominal attribute to 1.0 and the sparse vector element of an ngram to the number of times the ngram occurs within the span.

We should probably make ngram counting configurable (allow to just use 1.0 there as well).

Then, we need a per-attribute stats object that can be updated concurrently and the feature extraction code needs to know that we want to do this. Finally we have to add a transoformer stage to the pipeline for transforming the counts based on the stats according to one of several configurable methods.

Finally we should also allow to filter based on some kind of threshold, e.g. not include the feature if the df is too small or too high etc. So we would have to check if we can remove a feature from a sparse vector at that stage or if just setting it to 0.0 would work equally well (possibly creating a significant number of non-sparse zeroes).

The problem with all this is that in order to make it work properly, a large number of possible approaches and options should get supported.

@johann-petrak
Copy link
Collaborator Author

johann-petrak commented Jan 29, 2019

Maybe we should try to implement some simple filtering first, based on just the DF stats of individual unigrams, even for the ngrams: just filter all ngrams or values where the featureName4Value value is e.g. 0.0 or null. This would allow us to for now use a simple approach where we use Corpusstats and then a groovy script to filter the feature.

Currently, we impute 1.0 if featureName4Value is used but not found, but we could just change this to do the filtering instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant