-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/matcher alignment #7319
Closed
broaddeep
wants to merge
136
commits into
explosion:develop
from
broaddeep:feature/matcher-alignment
Closed
Feature/matcher alignment #7319
broaddeep
wants to merge
136
commits into
explosion:develop
from
broaddeep:feature/matcher-alignment
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: tewodrosm <tedmaam2006@gmail.com>
Update srsly pin
Remove `nlp.tokenizer` from quickstart template so that the default language-specific tokenizer settings are filled instead.
…nizers [ci skip] Remove nlp.tokenizer from quickstart template
* link components across enabled, resumed and frozen * revert renaming * revert renaming, the sequel
* add capture argument to project_run and run_commands * git bump to 3.0.1 * Set version to 3.0.1.dev0 Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* EL set_kb docs fix * custom warning for set_kb mistake
Fix class variable and init for `UkrainianLemmatizer` so that it loads the `uk` dictionaries rather than having the parent `RussianLemmatizer` override with the `ru` settings.
* fix NEL config and IO, and n_sents functionality * add docs * fix test
* import SpanGroup from tokens module * revert edits from different PR * add to __all__
Add tip about --gpu-id to training quickstart
…matter Add time and level to default logging formatter
Set the `include_dirs` in each `Extension` rather than in `setup()` to handle the case where there is a custom `distutils.cfg` that modifies the include paths, in particular for python from homebrew.
Reapply the refactoring (#4721) so that `Sentencizer` uses the faster `predict` and `set_annotations` for both `__call__` and `pipe`.
Check `component_names` instead of `pipe_names` to allow sourcing disabled components.
import wandb failure - UX
Fix formatting in bg/bn quickstart recs
has_annotation docs fix
Extend docs related to Vocab.get_noun_chunks
kb.get_candidates renamed to get_alias_candidates
Set include_dirs in Extension
* failing unit test * ensure that doc.spans refers to the copied doc, not the old * add type info
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Support for match alignments.
Many users wanted rule-base matcher to support subgroup labeling(#3275) or Group capture(#4642), Look-around operator like regex(#6420).
However, it wasn't an easy task as seen in the issue(#3275).
To address this issue, I propose the concept of match alignments.
It represents the part of a token pattern that contributed to the match.
For example, suppose we have pattern
[{"ORTH": "a", "OP": "+"}, {"ORTH": "b"}]
and the text is given as
a a a b
The matched span will have four tokens(in the longest greedy setup).
We can easily verify that the first three matched tokens(
a a a
) was matched by the first token pattern ({"ORTH": "a", "OP": "+"}
),and the last token(
b
) was matched by the second token pattern ({"ORTH": "b"}
)We can rewrite this in
List[int]
,[0, 0, 0, 1]
.Using this information, it can be applied to have the same effect as group capture or look around operator, subgroup labeling.
Implementation details
API
Types of change
New feature,
Checklist