Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/matcher alignment #7319

Closed
wants to merge 136 commits into from
Closed

Feature/matcher alignment #7319

wants to merge 136 commits into from

Conversation

broaddeep
Copy link
Contributor

@broaddeep broaddeep commented Mar 6, 2021

Description

Support for match alignments.

Many users wanted rule-base matcher to support subgroup labeling(#3275) or Group capture(#4642), Look-around operator like regex(#6420).
However, it wasn't an easy task as seen in the issue(#3275).

To address this issue, I propose the concept of match alignments.
It represents the part of a token pattern that contributed to the match.

For example, suppose we have pattern
[{"ORTH": "a", "OP": "+"}, {"ORTH": "b"}]
and the text is given as
a a a b

The matched span will have four tokens(in the longest greedy setup).
We can easily verify that the first three matched tokens(a a a) was matched by the first token pattern ({"ORTH": "a", "OP": "+"}),
and the last token(b) was matched by the second token pattern ({"ORTH": "b"})

We can rewrite this in List[int], [0, 0, 0, 1].

Using this information, it can be applied to have the same effect as group capture or look around operator, subgroup labeling.

Implementation details

  • Each time the state changes, it keeps track of the index of the token pattern at that time and the length of the span.
  • See be3a664

API

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

pattern = [
    {'ENT_TYPE': 'PERSON', 'OP': '+'}, 
    {'LEMMA': 'love'}, 
    {'ENT_TYPE': 'PERSON', 'OP': '+'}
]

matcher = Matcher(nlp.vocab)
matcher.add("test", [pattern], greedy='LONGEST')

doc = nlp("John Doe loves Jane Doe. John loves Jane.")

matches = matcher(doc, match_alignments=True)

for m in matches:
    print(m)
# (1618900948208871284, 0, 5, [0, 0, 1, 2, 2])
# (1618900948208871284, 6, 9, [0, 1, 2])
  • It does not require breaking changes. (All test passed)
  • Test case added.

Types of change

New feature,

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

ines and others added 30 commits January 30, 2021 21:47
Co-authored-by: tewodrosm <tedmaam2006@gmail.com>
Remove `nlp.tokenizer` from quickstart template so that the default
language-specific tokenizer settings are filled instead.
…nizers [ci skip]

Remove nlp.tokenizer from quickstart template
* link components across enabled, resumed and frozen

* revert renaming

* revert renaming, the sequel
* add capture argument to project_run and run_commands

* git bump to 3.0.1

* Set version to 3.0.1.dev0

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
svlandeg and others added 28 commits February 22, 2021 11:04
* EL set_kb docs fix

* custom warning for set_kb mistake
Fix class variable and init for `UkrainianLemmatizer` so that it loads
the `uk` dictionaries rather than having the parent `RussianLemmatizer`
override with the `ru` settings.
* fix NEL config and IO, and n_sents functionality

* add docs

* fix test
* import SpanGroup from tokens module

* revert edits from different PR

* add to __all__
Add tip about --gpu-id to training quickstart
…matter

Add time and level to default logging formatter
Set the `include_dirs` in each `Extension` rather than in `setup()` to
handle the case where there is a custom `distutils.cfg` that modifies
the include paths, in particular for python from homebrew.
Reapply the refactoring (#4721) so that `Sentencizer` uses the faster
`predict` and `set_annotations` for both `__call__` and `pipe`.
Check `component_names` instead of `pipe_names` to allow sourcing
disabled components.
Extend docs related to Vocab.get_noun_chunks
kb.get_candidates renamed to get_alias_candidates
* failing unit test

* ensure that doc.spans refers to the copied doc, not the old

* add type info
@broaddeep broaddeep closed this Mar 6, 2021
@broaddeep broaddeep deleted the feature/matcher-alignment branch March 6, 2021 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.