Generating patterns from documents #11267
Replies: 1 comment
-
Hey @dimidd - thanks for the question. This is something I've been curious about myself for a while. There is a library in spaCy Universe called I had done something similar at a past job and thought I'd resurrect that code and share with you. In this case, I'm just using the lemma features with spaCy and then building a ML model with them and interpreting the features based on importance. This looks for lemma patterns between 2 and 4 tokens. This will at least give some insight into which patterns might be valuable, and reverse-engineer some simpler rules from this. For this example, I'm using the import pandas as pd
import spacy
from datasets import load_dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_score, recall_score
from tqdm import tqdm
dataset = load_dataset("ag_news")
LIMIT = 10000
# Load data and set the business class to the positive label
train_text = dataset["train"]["text"][:LIMIT]
train_labels = dataset["train"]["label"][:LIMIT]
train_labels_business = [1 if label == 2 else 0 for label in train_labels]
# Process docs to get lemmas
nlp = spacy.load("en_core_web_md")
lemmatized_docs = []
for doc in tqdm(nlp.pipe(train_text, disable=["ner"]), total=len(train_text)):
lemmatized_docs.append([token.lemma_ for token in doc if token.is_alpha])
def identity(tokens):
return tokens
# Fit a binary count vectorizer for features
# then use a RandomForest for the model
cv = CountVectorizer(
tokenizer=identity, preprocessor=identity, binary=True, ngram_range=(2, 4)
)
X = cv.fit_transform(lemmatized_docs)
y = train_labels_business
clf = RandomForestClassifier(random_state=1234)
clf.fit(X, y)
# Calculate feature statistics
feature_names = cv.get_feature_names_out().tolist()
features = pd.DataFrame(
list(zip(cv.get_feature_names_out(), clf.feature_importances_)),
columns=["feature", "score"],
).sort_values("score", ascending=False)
features_top = features.head(500).copy()
features_top["precision"] = features_top["feature"].apply(
lambda x: precision_score(y, X[:, feature_names.index(x)].toarray().ravel())
)
features_top["recall"] = features_top["feature"].apply(
lambda x: recall_score(y, X[:, feature_names.index(x)].toarray().ravel())
)
features_top["f1"] = (features_top["precision"] * features_top["recall"]) / 2
print(features_top.sort_values("f1", ascending=False).head(10).round(2))
You can see that this highlights the importance of knowing your data and also doing some preprocessing, as the top patterns for business related news articles are usually apparently from Reuters and have NEW YORK as the location. Depending on your use case, those might not be good patterns for you (e.g. if you have new news articles that don't have a location or the same source). There's obviously a bit more to do with this. From here you'd have to translate the lemmas to patterns. You could also explore adding additional features and looking at combinations of rules, but be warned the possibility space of combining the rules is going to grow really fast. You might also want to penalize patterns that are longer or write some custom scoring function. |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for this great library.
I was wondering if there's any recommended workflow, or even automated tools to create Matcher patterns based on labelled docs. Suppose we have a labelled dataset for text classification, where each document has multiple labels. We'd like to create a set of patterns for each label, with some desiderata:
Is there some recommended methodology other than examining manually features such as lemma, POS, dependency labels etc.?
I'd love to learn from the collective experience of Explosion and the NLP community at large.
Beta Was this translation helpful? Give feedback.
All reactions