Add a proper TFIDF transformation #104

johann-petrak · 2019-01-29T21:33:19Z

Currently we use the CorpusStats plugin to create tfidf scores per token which we can then use together with the featureName4Value option of an attribute to inject into the sparse vector instead of just tf.
However, this does not work with n-grams where n>1 (we multiply the scores in that case, but that is not really what we want).

Maybe we can add the CorpusStats code as a subroutine for gathering stats on the fly during the feature extraction, then update the generated instances based on the statistics. This would of course only work for Mallet-representations, not any representation that gets written out immediately.

johann-petrak · 2019-01-29T21:53:26Z

We would have to gather statistics on a per-attribute basis, probably based on some new flag in the attribute declaration. Currently, we set the sparse vector element of a nominal attribute to 1.0 and the sparse vector element of an ngram to the number of times the ngram occurs within the span.

We should probably make ngram counting configurable (allow to just use 1.0 there as well).

Then, we need a per-attribute stats object that can be updated concurrently and the feature extraction code needs to know that we want to do this. Finally we have to add a transoformer stage to the pipeline for transforming the counts based on the stats according to one of several configurable methods.

Finally we should also allow to filter based on some kind of threshold, e.g. not include the feature if the df is too small or too high etc. So we would have to check if we can remove a feature from a sparse vector at that stage or if just setting it to 0.0 would work equally well (possibly creating a significant number of non-sparse zeroes).

The problem with all this is that in order to make it work properly, a large number of possible approaches and options should get supported.

johann-petrak · 2019-01-29T21:56:11Z

Maybe we should try to implement some simple filtering first, based on just the DF stats of individual unigrams, even for the ngrams: just filter all ngrams or values where the featureName4Value value is e.g. 0.0 or null. This would allow us to for now use a simple approach where we use Corpusstats and then a groovy script to filter the feature.

Currently, we impute 1.0 if featureName4Value is used but not found, but we could just change this to do the filtering instead.

johann-petrak mentioned this issue Jan 29, 2019

Simple filtering of nominal values and ngrams #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a proper TFIDF transformation #104

Add a proper TFIDF transformation #104

johann-petrak commented Jan 29, 2019

johann-petrak commented Jan 29, 2019

johann-petrak commented Jan 29, 2019 •

edited

Loading

Add a proper TFIDF transformation #104

Add a proper TFIDF transformation #104

Comments

johann-petrak commented Jan 29, 2019

johann-petrak commented Jan 29, 2019

johann-petrak commented Jan 29, 2019 • edited Loading

johann-petrak commented Jan 29, 2019 •

edited

Loading