diff --git a/docs/changelog.md b/docs/changelog.md index f4b32ef5..a6ae4948 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -1,3 +1,38 @@ +--- +hide: + - navigation +--- + + +## **Version 0.7.0** +*Release date: 3 November, 2022* + +**Highlights**: + +* Cleaned up documentation and added several visual representations of the algorithm (excluding MMR / MaxSum) +* Added function to extract and pass word- and document embeddings which should make fine-tuning much faster + +```python +from keybert import KeyBERT + +kw_model = KeyBERT() + +# Prepare embeddings +doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs) + +# Extract keywords without needing to re-calculate embeddings +keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings) +``` + +Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`. + +**Fixes**: + +* Redundant documentation was removed by [@mabhay3420](https://github.com/priyanshul-govil) in [#123](https://github.com/MaartenGr/KeyBERT/pull/123) +* Fixed Gensim backend not working after v4 migration ([#71](https://github.com/MaartenGr/KeyBERT/issues/71)) +* Fixed `candidates` not working ([#122](https://github.com/MaartenGr/KeyBERT/issues/122)) + + ## **Version 0.6.0** *Release date: 25 July, 2022* diff --git a/docs/faq.md b/docs/faq.md index 82f16f74..04814585 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,3 +1,8 @@ +--- +hide: + - navigation +--- + ## **Which embedding model works best for which language?** Unfortunately, there is not a definitive list of the best models for each language, this highly depends on your data, the model, and your specific use-case. However, the default model in KeyBERT diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md index 29b35ccb..d6e2871e 100644 --- a/docs/guides/quickstart.md +++ b/docs/guides/quickstart.md @@ -14,7 +14,13 @@ pip install keybert[spacy] pip install keybert[use] ``` -## **Usage** + +
+--8<-- "docs/images/pipeline.svg" +
+ + +## **Basic usage** The most minimal example can be seen below for the extraction of keywords: ```python @@ -70,6 +76,12 @@ keywords = kw_model.extract_keywords(doc, highlight=True) I would advise either `"all-MiniLM-L6-v2"` for English documents or `"paraphrase-multilingual-MiniLM-L12-v2"` for multi-lingual documents or any other language. +## **Fine-tuning** + +As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead +to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two +approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**. + ### **Max Sum Distance** To diversify the results, we take the 2 x top_n most similar words/phrases to the document. @@ -93,8 +105,8 @@ keywords / keyphrases which is also based on cosine similarity. The results with **high diversity**: ```python ->>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', - use_mmr=True, diversity=0.7) +kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', + use_mmr=True, diversity=0.7) [('algorithm generalize training', 0.7727), ('labels unseen instances', 0.1649), ('new examples optimal', 0.4185), @@ -114,58 +126,93 @@ The results with **low diversity**: ('learning algorithm generalize', 0.7514)] ``` -### **Candidate Keywords/Keyphrases** +## **Candidate Keywords/Keyphrases** In some cases, one might want to be using candidate keywords generated by other keyword algorithms or retrieved from a select list of possible keywords/keyphrases. In KeyBERT, you can easily use those candidate keywords to perform keyword extraction: ```python import yake from keybert import KeyBERT -doc = """ - Supervised learning is the machine learning task of learning a function that - maps an input to an output based on example input-output pairs.[1] It infers a - function from labeled training data consisting of a set of training examples.[2] - In supervised learning, each example is a pair consisting of an input object - (typically a vector) and a desired output value (also called the supervisory signal). - A supervised learning algorithm analyzes the training data and produces an inferred function, - which can be used for mapping new examples. An optimal scenario will allow for the - algorithm to correctly determine the class labels for unseen instances. This requires - the learning algorithm to generalize from the training data to unseen situations in a - 'reasonable' way (see inductive bias). - """ - # Create candidates kw_extractor = yake.KeywordExtractor(top=50) candidates = kw_extractor.extract_keywords(doc) candidates = [candidate[0] for candidate in candidates] -# KeyBERT init +# Pass candidates to KeyBERT kw_model = KeyBERT() -keywords = kw_model.extract_keywords(doc, candidates) +keywords = kw_model.extract_keywords(doc, candidates=candidates) ``` -### **Guided KeyBERT** +## **Guided KeyBERT** Guided KeyBERT is similar to Guided Topic Modeling in that it tries to steer the training towards a set of seeded terms. When applying KeyBERT it automatically extracts the most related keywords to a specific document. However, there are times when stakeholders and users are looking for specific types of keywords. For example, when publishing an article on your website through contentful, you typically already know the global keywords related to the article. However, there might be a specific topic in the article that you would like to be extracted through the keywords. To achieve this, we simply give KeyBERT a set of related seeded keywords (it can also be a single one!) and search for keywords that are similar to both the document and the seeded keywords. +
+--8<-- "docs/images/guided.svg" +
+ Using this feature is as simple as defining a list of seeded keywords and passing them to KeyBERT: ```python -doc = """ - Supervised learning is the machine learning task of learning a function that - maps an input to an output based on example input-output pairs.[1] It infers a - function from labeled training data consisting of a set of training examples.[2] - In supervised learning, each example is a pair consisting of an input object - (typically a vector) and a desired output value (also called the supervisory signal). - A supervised learning algorithm analyzes the training data and produces an inferred function, - which can be used for mapping new examples. An optimal scenario will allow for the - algorithm to correctly determine the class labels for unseen instances. This requires - the learning algorithm to generalize from the training data to unseen situations in a - 'reasonable' way (see inductive bias). - """ - +from keybert import KeyBERT kw_model = KeyBERT() + +# Define our seeded term seed_keywords = ["information"] -keywords = kw_model.extract_keywords(doc, use_mmr=True, diversity=0.1, seed_keywords=seed_keywords) +keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords) +``` + +## **Prepare embeddings** + +When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and +word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that +we only have to calculate it once: + + +```python +from keybert import KeyBERT + +kw_model = KeyBERT() +doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs) +``` + +You can then use these embeddings and pass them to `.extract_keywords` to speed up the tuning the model: + +```python +keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings) +``` + +There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated: + +* `candidates` +* `keyphrase_ngram_range` +* `stop_words` +* `min_df` +* `vectorizer` + +The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`. + +In other words, the following will work as they use the same parameter subset: + +```python +from keybert import KeyBERT + +kw_model = KeyBERT() +doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english") +keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", + doc_embeddings=doc_embeddings, + word_embeddings=word_embeddings) +``` + +The following, however, will throw an error since we did not use the same values for `min_df` and `stop_words`: + +```python +from keybert import KeyBERT + +kw_model = KeyBERT() +doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch") +keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", + doc_embeddings=doc_embeddings, + word_embeddings=word_embeddings) ``` diff --git a/docs/images/guided.svg b/docs/images/guided.svg new file mode 100644 index 00000000..ee6e65b0 --- /dev/null +++ b/docs/images/guided.svg @@ -0,0 +1,16 @@ + + + + + + + Input DocumentTokenize WordsEmbed TokensExtract EmbeddingsAverage seed keyword and document embeddingsCalculateCosine SimilarityMost microbats use echolocationto navigate and find food.Most microbats...sonarmostmicrobatsuse echolocationtonavigate andfindfood0.110.550.320.28................0.720.960.490.34mostfoodMost microbats...mostfood.......08.73We use the CountVectorizer from Scikit-Learn to tokenize our document into candidate kewords/keyphrases.We embed the seeded keywords (e.g., the word “sonar”) and calculate a weighted average with the document embedding (1:3). We calculate the cosine similarity between all candidate keywords and the input document. The keywords that have the largest similarity to the document are extracted. \ No newline at end of file diff --git a/docs/images/pipeline.svg b/docs/images/pipeline.svg new file mode 100644 index 00000000..b93e4241 --- /dev/null +++ b/docs/images/pipeline.svg @@ -0,0 +1,16 @@ + + + + + + + Input DocumentTokenize WordsEmbed TokensExtract EmbeddingsEmbed DocumentCalculateCosine SimilarityMost microbats use echolocationto navigate and find food.Most microbats use echolocationto navigate and find food.mostmicrobatsuse echolocationtonavigate andfindfood0.110.550.28............0.720.960.34mostfoodMost microbats...mostfood.......08.73We use the CountVectorizer from Scikit-Learn to tokenize our document into candidate kewords/keyphrases.We can use any language model that can embed both documents and keywords, like sentence-transformers.We calculate the cosine similarity between all candidate keywords and the input document. The keywords that have the largest similarity to the document are extracted. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 332a8926..0af611bc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,3 +1,8 @@ +--- +hide: + - navigation +--- + # **KeyBERT** diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index 4cf2bd22..54661441 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -5,3 +5,15 @@ :root>* { --md-typeset-a-color: #0277BD; } + +body[data-md-color-primary="black"] .excalidraw svg { + filter: invert(100%) hue-rotate(180deg); +} + +body[data-md-color-primary="black"] .excalidraw svg rect { + fill: transparent; +} + +.excalidraw { + text-align: center; +} diff --git a/keybert/__init__.py b/keybert/__init__.py index 00697950..048b901d 100644 --- a/keybert/__init__.py +++ b/keybert/__init__.py @@ -1,3 +1,3 @@ from keybert._model import KeyBERT -__version__ = "0.6.0" +__version__ = "0.7.0" diff --git a/keybert/_model.py b/keybert/_model.py index 8a386606..999c73fa 100644 --- a/keybert/_model.py +++ b/keybert/_model.py @@ -30,6 +30,10 @@ class KeyBERT: The most similar words could then be identified as the words that best describe the entire document. + +
+ --8<-- "docs/images/pipeline.svg" +
""" def __init__(self, model="all-MiniLM-L6-v2"): @@ -65,6 +69,8 @@ def extract_keywords( vectorizer: CountVectorizer = None, highlight: bool = False, seed_keywords: List[str] = None, + doc_embeddings: np.array = None, + word_embeddings: np.array = None, ) -> Union[List[Tuple[str, float]], List[List[Tuple[str, float]]]]: """Extract keywords and/or keyphrases @@ -97,6 +103,12 @@ def extract_keywords( NOTE: This does not work if multiple documents are passed. seed_keywords: Seed keywords that may guide the extraction of keywords by steering the similarities towards the seeded keywords. + doc_embeddings: The embeddings of each document. + word_embeddings: The embeddings of each potential keyword/keyphrase across + across the vocabulary of the set of input documents. + NOTE: The `word_embeddings` should be generated through + `.extract_embeddings` as the order of these embeddings depend + on the vectorizer that was used to generate its vocabulary. Returns: keywords: The top n keywords for a document with their respective distances @@ -113,8 +125,7 @@ def extract_keywords( keywords = kw_model.extract_keywords(doc) ``` - To extract keywords from multiple documents, - which is typically quite a bit faster: + To extract keywords from multiple documents, which is typically quite a bit faster: ```python from keybert import KeyBERT @@ -152,9 +163,21 @@ def extract_keywords( words = count.get_feature_names() df = count.transform(docs) + # Check if the right number of word embeddings are generated compared with the vectorizer + if word_embeddings is not None: + if word_embeddings.shape[0] != len(words): + raise ValueError("Make sure that the `word_embeddings` are generated from the function " + "`.extract_embeddings`. \nMoreover, the `candidates`, `keyphrase_ngram_range`," + "`stop_words`, and `min_df` parameters need to have the same values in both " + "`.extract_embeddings` and `.extract_keywords`.") + # Extract embeddings - doc_embeddings = self.model.embed(docs) - word_embeddings = self.model.embed(words) + if doc_embeddings is None: + doc_embeddings = self.model.embed(docs) + if word_embeddings is None: + word_embeddings = self.model.embed(words) + if seed_keywords is not None: + seed_embeddings = self.model.embed([" ".join(seed_keywords)]) # Find keywords all_keywords = [] @@ -169,7 +192,6 @@ def extract_keywords( # Guided KeyBERT with seed keywords if seed_keywords is not None: - seed_embeddings = self.model.embed([" ".join(seed_keywords)]) doc_embedding = np.average( [doc_embedding, seed_embeddings], axis=0, weights=[3, 1] ) @@ -215,3 +237,92 @@ def extract_keywords( all_keywords = all_keywords[0] return all_keywords + + def extract_embeddings( + self, + docs: Union[str, List[str]], + candidates: List[str] = None, + keyphrase_ngram_range: Tuple[int, int] = (1, 1), + stop_words: Union[str, List[str]] = "english", + min_df: int = 1, + vectorizer: CountVectorizer = None + ) -> Union[List[Tuple[str, float]], List[List[Tuple[str, float]]]]: + """Extract document and word embeddings for the input documents and the + generated candidate keywords/keyphrases respectively. + + Note that all potential keywords/keyphrases are not returned but only their + word embeddings. This means that the values of `candidates`, `keyphrase_ngram_range`, + `stop_words`, and `min_df` need to be the same between using `.extract_embeddings` and + `.extract_keywords`. + + Arguments: + docs: The document(s) for which to extract keywords/keyphrases + candidates: Candidate keywords/keyphrases to use instead of extracting them from the document(s) + NOTE: This is not used if you passed a `vectorizer`. + keyphrase_ngram_range: Length, in words, of the extracted keywords/keyphrases. + NOTE: This is not used if you passed a `vectorizer`. + stop_words: Stopwords to remove from the document. + NOTE: This is not used if you passed a `vectorizer`. + min_df: Minimum document frequency of a word across all documents + if keywords for multiple documents need to be extracted. + NOTE: This is not used if you passed a `vectorizer`. + vectorizer: Pass in your own `CountVectorizer` from + `sklearn.feature_extraction.text.CountVectorizer` + + Returns: + doc_embeddings: The embeddings of each document. + word_embeddings: The embeddings of each potential keyword/keyphrase across + across the vocabulary of the set of input documents. + NOTE: The `word_embeddings` should be generated through + `.extract_embeddings` as the order of these embeddings depend + on the vectorizer that was used to generate its vocabulary. + + Usage: + + To generate the word and document embeddings from a set of documents: + + ```python + from keybert import KeyBERT + + kw_model = KeyBERT() + doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs) + ``` + + You can then use these embeddings and pass them to `.extract_keywords` to speed up the tuning the model: + + ```python + keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings) + ``` + """ + # Check for a single, empty document + if isinstance(docs, str): + if docs: + docs = [docs] + else: + return [] + + # Extract potential words using a vectorizer / tokenizer + if vectorizer: + count = vectorizer.fit(docs) + else: + try: + count = CountVectorizer( + ngram_range=keyphrase_ngram_range, + stop_words=stop_words, + min_df=min_df, + vocabulary=candidates, + ).fit(docs) + except ValueError: + return [] + + # Scikit-Learn Deprecation: get_feature_names is deprecated in 1.0 + # and will be removed in 1.2. Please use get_feature_names_out instead. + if version.parse(sklearn_version) >= version.parse("1.0.0"): + words = count.get_feature_names_out() + else: + words = count.get_feature_names() + + doc_embeddings = self.model.embed(docs) + word_embeddings = self.model.embed(words) + + return doc_embeddings, word_embeddings diff --git a/keybert/backend/_gensim.py b/keybert/backend/_gensim.py index 0b450ada..13e81dc8 100644 --- a/keybert/backend/_gensim.py +++ b/keybert/backend/_gensim.py @@ -1,7 +1,9 @@ import numpy as np from tqdm import tqdm from typing import List +from packaging import version from keybert.backend import BaseEmbedder +from gensim import __version__ as gensim_version from gensim.models.keyedvectors import Word2VecKeyedVectors @@ -49,9 +51,13 @@ def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray: Document/words embeddings with shape (n, m) with `n` documents/words that each have an embeddings size of `m` """ - vector_shape = self.embedding_model.word_vec( - list(self.embedding_model.vocab.keys())[0] - ).shape + if version.parse(gensim_version) >= version.parse("4.0.0"): + get_vector = self.embedding_model.get_vector + vector_shape = get_vector(self.embedding_model.index_to_key[0]).shape + else: + get_vector = self.embedding_model.word_vec + vector_shape = get_vector(list(self.embedding_model.vocab.keys())[0]).shape + empty_vector = np.zeros(vector_shape[0]) embeddings = [] @@ -61,7 +67,7 @@ def embed(self, documents: List[str], verbose: bool = False) -> np.ndarray: # Extract word embeddings for word in doc.split(" "): try: - word_embedding = self.embedding_model.word_vec(word) + word_embedding = get_vector(word) doc_embedding.append(word_embedding) except KeyError: doc_embedding.append(empty_vector) diff --git a/mkdocs.yml b/mkdocs.yml index d1a1bac4..9135d529 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -7,6 +7,7 @@ use_directory_urls: false extra_css: - stylesheets/extra.css + nav: - Home: index.md - Guides: @@ -45,18 +46,23 @@ theme: - navigation.tracking - toc.follow palette: - - scheme: black - toggle: - icon: material/weather-sunny - name: Switch to dark mode - - scheme: slate - toggle: - icon: material/weather-night - name: Switch to light mode + - media: "(prefers-color-scheme: light)" + scheme: black + toggle: + icon: material/weather-sunny + name: Switch to dark mode + - media: "(prefers-color-scheme: dark)" + scheme: slate + primary: black + toggle: + icon: material/weather-night + name: Switch to light mode + markdown_extensions: - admonition - pymdownx.details - pymdownx.highlight - pymdownx.superfences - - toc: + - pymdownx.snippets + - toc: permalink: true diff --git a/setup.py b/setup.py index 5175f662..3b6789e0 100644 --- a/setup.py +++ b/setup.py @@ -37,7 +37,7 @@ setup( name="keybert", packages=find_packages(exclude=["notebooks", "docs"]), - version="0.6.0", + version="0.7.0", author="Maarten Grootendorst", author_email="maartengrootendorst@gmail.com", description="KeyBERT performs keyword extraction with state-of-the-art transformer models.", diff --git a/tests/test_model.py b/tests/test_model.py index 4db6380c..539e512e 100644 --- a/tests/test_model.py +++ b/tests/test_model.py @@ -1,9 +1,13 @@ import pytest -from .utils import get_test_data -from sklearn.feature_extraction.text import CountVectorizer from keybert import KeyBERT +from sklearn.datasets import fetch_20newsgroups +from sklearn.feature_extraction.text import CountVectorizer + +from .utils import get_test_data + doc_one, doc_two = get_test_data() +docs = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))['data'] model = KeyBERT(model="all-MiniLM-L6-v2") @@ -39,42 +43,65 @@ def test_single_doc(keyphrase_length, vectorizer): @pytest.mark.parametrize( "vectorizer", [None, CountVectorizer(ngram_range=(1, 1), stop_words="english")] ) -def test_extract_keywords_single_doc(keyphrase_length, mmr, maxsum, vectorizer): +@pytest.mark.parametrize( + "candidates", [None, ["praise"]] +) +@pytest.mark.parametrize( + "seed_keywords", [None, ["time", "night", "day", "moment"]] +) +def test_extract_keywords_single_doc(keyphrase_length, mmr, maxsum, vectorizer, candidates, seed_keywords): """Test extraction of protected single document method""" top_n = 5 keywords = model.extract_keywords( doc_one, top_n=top_n, + candidates=candidates, keyphrase_ngram_range=keyphrase_length, + seed_keywords=seed_keywords, use_mmr=mmr, use_maxsum=maxsum, diversity=0.5, vectorizer=vectorizer, ) assert isinstance(keywords, list) - assert isinstance(keywords[0][0], str) - assert isinstance(keywords[0][1], float) - assert len(keywords) == top_n + if not candidates: + assert isinstance(keywords[0][0], str) + assert isinstance(keywords[0][1], float) + assert len(keywords) == top_n for keyword in keywords: assert len(keyword[0].split(" ")) <= keyphrase_length[1] + if candidates and keyphrase_length[1] == 1 and not vectorizer and not maxsum: + assert keywords[0][0] == candidates[0] + @pytest.mark.parametrize("keyphrase_length", [(1, i + 1) for i in range(5)]) -def test_extract_keywords_multiple_docs(keyphrase_length): - """Test extractino of protected multiple document method""" +@pytest.mark.parametrize( + "candidates", [None, ["praise"]] +) +def test_extract_keywords_multiple_docs(keyphrase_length, candidates): + """Test extraction of protected multiple document method""" top_n = 5 keywords_list = model.extract_keywords( - [doc_one, doc_two], top_n=top_n, keyphrase_ngram_range=keyphrase_length + [doc_one, doc_two], + top_n=top_n, + keyphrase_ngram_range=keyphrase_length, + candidates=candidates ) assert isinstance(keywords_list, list) assert isinstance(keywords_list[0], list) assert len(keywords_list) == 2 - for keywords in keywords_list: - assert len(keywords) == top_n + if not candidates: + for keywords in keywords_list: + assert len(keywords) == top_n + + for keyword in keywords: + assert len(keyword[0].split(" ")) <= keyphrase_length[1] - for keyword in keywords: - assert len(keyword[0].split(" ")) <= keyphrase_length[1] + if candidates and keyphrase_length[1] == 1: + assert keywords_list[0][0][0] == candidates[0] + assert len(keywords_list[1]) == 0 def test_guided(): @@ -98,3 +125,29 @@ def test_empty_doc(): result = model.extract_keywords(doc) assert result == [] + + +def test_extract_embeddings(): + """Test extracting embeddings and testing out different parameters""" + n_docs = 50 + doc_embeddings, word_embeddings = model.extract_embeddings(docs[:n_docs]) + keywords_fast = model.extract_keywords( + docs[:n_docs], + doc_embeddings=doc_embeddings, + word_embeddings=word_embeddings + ) + keywords_slow = model.extract_keywords(docs[:n_docs]) + + assert doc_embeddings.shape[1] == word_embeddings.shape[1] + assert doc_embeddings.shape[0] == n_docs + assert keywords_fast == keywords_slow + + # When we use `min_df=3` to extract the keywords, this should give an error since + # this value was not used when extracting the embeddings and should be the same. + with pytest.raises(ValueError): + _ = model.extract_keywords( + docs[:n_docs], + doc_embeddings=doc_embeddings, + word_embeddings=word_embeddings, + min_df=3 + )