-
Notifications
You must be signed in to change notification settings - Fork 358
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Use paraphrase-MiniLM-L6-v2 as the default embedding model * Highlight a document's keywords * Added FAQ
- Loading branch information
Showing
18 changed files
with
242 additions
and
83 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
## **Which embedding model works best for which language?** | ||
Unfortunately, there is not a definitive list of the best models for each language, this highly depends | ||
on your data, the model, and your specific use-case. However, the default model in KeyBERT | ||
(`"paraphrase-MiniLM-L6-v2"`) works great for **English** documents. In contrast, for **multi-lingual** | ||
documents or any other language, `"paraphrase-multilingual-MiniLM-L12-v2""` has shown great performance. | ||
|
||
If you want to use a model that provides a higher quality, but takes more compute time, then I would advise using `paraphrase-mpnet-base-v2` and `paraphrase-multilingual-mpnet-base-v2` instead. | ||
|
||
|
||
## **Should I preprocess the data?** | ||
No. By using document embeddings there is typically no need to preprocess the data as all parts of a document | ||
are important in understanding the general topic of the document. Although this holds true in 99% of cases, if you | ||
have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags | ||
typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply | ||
topic modeling to HTML-code to extract topics of code, then it becomes important. | ||
|
||
|
||
## **Can I use the GPU to speed up the model?** | ||
Yes! Since KeyBERT uses embeddings as its backend, a GPU is actually prefered when using this package. | ||
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
from keybert.model import KeyBERT | ||
from keybert._model import KeyBERT | ||
|
||
__version__ = "0.3.0" | ||
__version__ = "0.4.0" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
import re | ||
from rich.console import Console | ||
from rich.highlighter import RegexHighlighter | ||
from typing import Tuple, List | ||
|
||
|
||
class NullHighlighter(RegexHighlighter): | ||
"""Apply style to anything that looks like an email.""" | ||
|
||
base_style = "" | ||
highlights = [r""] | ||
|
||
|
||
def highlight_document(doc: str, | ||
keywords: List[Tuple[str, float]]): | ||
""" Highlight keywords in a document | ||
Arguments: | ||
doc: The document for which to extract keywords/keyphrases | ||
keywords: the top n keywords for a document with their respective distances | ||
to the input document | ||
Returns: | ||
highlighted_text: The document with additional tags to highlight keywords | ||
according to the rich package | ||
""" | ||
keywords_only = [keyword for keyword, _ in keywords] | ||
max_len = max([len(token.split(" ")) for token in keywords_only]) | ||
|
||
if max_len == 1: | ||
highlighted_text = _highlight_one_gram(doc, keywords_only) | ||
else: | ||
highlighted_text = _highlight_n_gram(doc, keywords_only) | ||
|
||
console = Console(highlighter=NullHighlighter()) | ||
console.print(highlighted_text) | ||
|
||
|
||
def _highlight_one_gram(doc: str, | ||
keywords: List[str]) -> str: | ||
""" Highlight 1-gram keywords in a document | ||
Arguments: | ||
doc: The document for which to extract keywords/keyphrases | ||
keywords: the top n keywords for a document | ||
Returns: | ||
highlighted_text: The document with additional tags to highlight keywords | ||
according to the rich package | ||
""" | ||
tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).split(" ") | ||
|
||
highlighted_text = " ".join([f"[black on #FFFF00]{token}[/]" | ||
if token.lower() in keywords | ||
else f"{token}" | ||
for token in tokens]).strip() | ||
return highlighted_text | ||
|
||
|
||
def _highlight_n_gram(doc: str, | ||
keywords: List[str]) -> str: | ||
""" Highlight n-gram keywords in a document | ||
Arguments: | ||
doc: The document for which to extract keywords/keyphrases | ||
keywords: the top n keywords for a document | ||
Returns: | ||
highlighted_text: The document with additional tags to highlight keywords | ||
according to the rich package | ||
""" | ||
max_len = max([len(token.split(" ")) for token in keywords]) | ||
tokens = re.sub(r' +', ' ', doc.replace("\n", " ")).strip().split(" ") | ||
n_gram_tokens = [[" ".join(tokens[i: i + max_len][0: j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)] | ||
highlighted_text = [] | ||
skip = False | ||
|
||
for n_grams in n_gram_tokens: | ||
candidate = False | ||
|
||
if not skip: | ||
for index, n_gram in enumerate(n_grams): | ||
|
||
if n_gram.lower() in keywords: | ||
candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1] | ||
skip = index + 1 | ||
|
||
if not candidate: | ||
candidate = n_grams[0] | ||
|
||
highlighted_text.append(candidate) | ||
|
||
else: | ||
skip = skip - 1 | ||
highlighted_text = " ".join(highlighted_text) | ||
return highlighted_text |
File renamed without changes.
File renamed without changes.
Oops, something went wrong.