Generating patterns from documents #11267

dimidd · 2022-08-03T11:20:39Z

dimidd
Aug 3, 2022

Hi, thanks for this great library.

I was wondering if there's any recommended workflow, or even automated tools to create Matcher patterns based on labelled docs. Suppose we have a labelled dataset for text classification, where each document has multiple labels. We'd like to create a set of patterns for each label, with some desiderata:

Recall: the patterns will match all documents where the label exist
Precision: the patterns won't match documents where the label doesn't exist
Generalization: the number of patterns should be considerably less than the number of documents. E.g. a single pattern should match tens or hundreds of documents
Simplicity: each pattern should be relatively simple

Is there some recommended methodology other than examining manually features such as lemma, POS, dependency labels etc.?
I'd love to learn from the collective experience of Explosion and the NLP community at large.

pmbaumgartner · 2022-08-16T18:26:19Z

pmbaumgartner
Aug 16, 2022

Hey @dimidd - thanks for the question. This is something I've been curious about myself for a while. There is a library in spaCy Universe called PatternOmatic, but it looks like it's compatible only with spaCy v2 and I'm not sure if it's able to associate anything with specific labels.

I had done something similar at a past job and thought I'd resurrect that code and share with you. In this case, I'm just using the lemma features with spaCy and then building a ML model with them and interpreting the features based on importance. This looks for lemma patterns between 2 and 4 tokens. This will at least give some insight into which patterns might be valuable, and reverse-engineer some simpler rules from this.

For this example, I'm using the ag_news dataset from the datasets library and trying to predict which news articles are about business. I then use the feature importances from 2-4 grams of those lemmas and calculate some performance statistics as if I was applying only the presence of lemmas as my classification rule.

import pandas as pd
import spacy
from datasets import load_dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, recall_score
from tqdm import tqdm

dataset = load_dataset("ag_news")

LIMIT = 10000

# Load data and set the business class to the positive label
train_text = dataset["train"]["text"][:LIMIT]
train_labels = dataset["train"]["label"][:LIMIT]
train_labels_business = [1 if label == 2 else 0 for label in train_labels]

# Process docs to get lemmas
nlp = spacy.load("en_core_web_md")

lemmatized_docs = []
for doc in tqdm(nlp.pipe(train_text, disable=["ner"]), total=len(train_text)):
    lemmatized_docs.append([token.lemma_ for token in doc if token.is_alpha])


def identity(tokens):
    return tokens


# Fit a binary count vectorizer for features
# then use a RandomForest for the model
cv = CountVectorizer(
    tokenizer=identity, preprocessor=identity, binary=True, ngram_range=(2, 4)
)
X = cv.fit_transform(lemmatized_docs)
y = train_labels_business
clf = RandomForestClassifier(random_state=1234)
clf.fit(X, y)

# Calculate feature statistics
feature_names = cv.get_feature_names_out().tolist()
features = pd.DataFrame(
    list(zip(cv.get_feature_names_out(), clf.feature_importances_)),
    columns=["feature", "score"],
).sort_values("score", ascending=False)
features_top = features.head(500).copy()
features_top["precision"] = features_top["feature"].apply(
    lambda x: precision_score(y, X[:, feature_names.index(x)].toarray().ravel())
)
features_top["recall"] = features_top["feature"].apply(
    lambda x: recall_score(y, X[:, feature_names.index(x)].toarray().ravel())
)
features_top["f1"] = (features_top["precision"] * features_top["recall"]) / 2

print(features_top.sort_values("f1", ascending=False).head(10).round(2))

                 feature  score  precision  recall    f1
481707         oil price   0.01       0.88    0.11  0.05
96539           NEW YORK   0.00       0.77    0.12  0.05
96591   NEW YORK Reuters   0.00       0.85    0.09  0.04
163719      YORK Reuters   0.00       0.84    0.09  0.04
485786        on Tuesday   0.00       0.47    0.09  0.02
613486       the company   0.00       0.58    0.07  0.02
115821  Profile Research   0.00       0.94    0.03  0.02
486260      on Wednesday   0.00       0.45    0.07  0.02
484552         on Monday   0.00       0.44    0.07  0.01
552510            say on   0.00       0.44    0.06  0.01

You can see that this highlights the importance of knowing your data and also doing some preprocessing, as the top patterns for business related news articles are usually apparently from Reuters and have NEW YORK as the location. Depending on your use case, those might not be good patterns for you (e.g. if you have new news articles that don't have a location or the same source).

There's obviously a bit more to do with this. From here you'd have to translate the lemmas to patterns. You could also explore adding additional features and looking at combinations of rules, but be warned the possibility space of combining the rules is going to grow really fast. You might also want to penalize patterns that are longer or write some custom scoring function.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating patterns from documents #11267

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Generating patterns from documents #11267

dimidd Aug 3, 2022

Replies: 1 comment

pmbaumgartner Aug 16, 2022

dimidd
Aug 3, 2022

pmbaumgartner
Aug 16, 2022