Feature/matcher alignment #7319

broaddeep · 2021-03-06T11:52:15Z

Description

Support for match alignments.

Many users wanted rule-base matcher to support subgroup labeling(#3275) or Group capture(#4642), Look-around operator like regex(#6420).
However, it wasn't an easy task as seen in the issue(#3275).

To address this issue, I propose the concept of match alignments.
It represents the part of a token pattern that contributed to the match.

For example, suppose we have pattern
[{"ORTH": "a", "OP": "+"}, {"ORTH": "b"}]
and the text is given as
a a a b

The matched span will have four tokens(in the longest greedy setup).
We can easily verify that the first three matched tokens(a a a) was matched by the first token pattern ({"ORTH": "a", "OP": "+"}),
and the last token(b) was matched by the second token pattern ({"ORTH": "b"})

We can rewrite this in List[int], [0, 0, 0, 1].

Using this information, it can be applied to have the same effect as group capture or look around operator, subgroup labeling.

Implementation details

Each time the state changes, it keeps track of the index of the token pattern at that time and the length of the span.
See be3a664

API

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')

pattern = [
    {'ENT_TYPE': 'PERSON', 'OP': '+'}, 
    {'LEMMA': 'love'}, 
    {'ENT_TYPE': 'PERSON', 'OP': '+'}
]

matcher = Matcher(nlp.vocab)
matcher.add("test", [pattern], greedy='LONGEST')

doc = nlp("John Doe loves Jane Doe. John loves Jane.")

matches = matcher(doc, match_alignments=True)

for m in matches:
    print(m)
# (1618900948208871284, 0, 5, [0, 0, 1, 2, 2])
# (1618900948208871284, 6, 9, [0, 1, 2])

It does not require breaking changes. (All test passed)
Test case added.

Types of change

New feature,

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

Co-authored-by: tewodrosm <tedmaam2006@gmail.com>

Update srsly pin

Remove `nlp.tokenizer` from quickstart template so that the default language-specific tokenizer settings are filled instead.

…nizers [ci skip] Remove nlp.tokenizer from quickstart template

* link components across enabled, resumed and frozen * revert renaming * revert renaming, the sequel

Fix a typo

* add capture argument to project_run and run_commands * git bump to 3.0.1 * Set version to 3.0.1.dev0 Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

* EL set_kb docs fix * custom warning for set_kb mistake

Fix class variable and init for `UkrainianLemmatizer` so that it loads the `uk` dictionaries rather than having the parent `RussianLemmatizer` override with the `ru` settings.

… span (#7074)

* fix NEL config and IO, and n_sents functionality * add docs * fix test

…7185)

* import SpanGroup from tokens module * revert edits from different PR * add to __all__

Add tip about --gpu-id to training quickstart

…matter Add time and level to default logging formatter

Set the `include_dirs` in each `Extension` rather than in `setup()` to handle the case where there is a custom `distutils.cfg` that modifies the include paths, in particular for python from homebrew.

Reapply the refactoring (#4721) so that `Sentencizer` uses the faster `predict` and `set_annotations` for both `__call__` and `pipe`.

Check `component_names` instead of `pipe_names` to allow sourcing disabled components.

import wandb failure - UX

Fix formatting in bg/bn quickstart recs

has_annotation docs fix

Extend docs related to Vocab.get_noun_chunks

kb.get_candidates renamed to get_alias_candidates

Set include_dirs in Extension

* failing unit test * ensure that doc.spans refers to the copied doc, not the old * add type info

ines and others added 30 commits January 30, 2021 21:47

Merge pull request #6862 from explosion/develop

3a09299

Update CLI docs [ci skip]

45c5510

add stop words

a8d8418

Co-authored-by: tewodrosm <tedmaam2006@gmail.com>

reformatting

91e72c0

Update README and docs [ci skip]

a8a1231

Update README.md [ci skip]

1f1fbdb

Update CONTRIBUTING.md [ci skip]

6a7ffff

Merge pull request #6864 from svlandeg/feature/Amharic_stopwords

f1d48fd

Update LICENSE [ci skip]

638d265

Update version pin [ci skip]

c9b52bf

Update labels [ci skip]

82da6ae

Update Binder meta [ci skip]

6a68397

Update docs [ci skip]

7752f80

Update labels [ci skip]

4ca0f91

Update netlify.toml [ci skip]

b80bce7

Update table [ci skip]

31b842d

Update srsly pin

91e24d2

Merge pull request #6869 from explosion/chore/update-srsly

bcaf534

Update srsly pin

Remove nlp.tokenizer from quickstart template

35a863c

Remove `nlp.tokenizer` from quickstart template so that the default language-specific tokenizer settings are filled instead.

Merge pull request #6870 from adrianeboyd/bugfix/quickstart-lang-toke…

3b9ecd2

…nizers [ci skip] Remove nlp.tokenizer from quickstart template

Fix config quickstart and download [ci skip]

e17ea88

Update spacy-transformers pin [ci skip]

8a24507

Fix linking resumed components (#6859)

acabb28

* link components across enabled, resumed and frozen * revert renaming * revert renaming, the sequel

Fix default clone branch and error handling [ci skip]

b460732

Fix pip args

b9573e9

Make wheel the default format and update docs [ci skip]

a59f3fc

Fix a typo

6fdc332

Merge pull request #6884 from pcyin/patch-1 [ci skip]

e97d3f3

Fix a typo

remove link_components flag again (#6883)

f638306

Add capture argument to project_run (#6878)

f319d27

* add capture argument to project_run and run_commands * git bump to 3.0.1 * Set version to 3.0.1.dev0 Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>

svlandeg and others added 28 commits February 22, 2021 11:04

NEL docs & UX (#7129)

ba5a50f

* EL set_kb docs fix * custom warning for set_kb mistake

Fix Ukrainian lemmatizer init (#7127)

264862c

Fix class variable and init for `UkrainianLemmatizer` so that it loads the `uk` dictionaries rather than having the parent `RussianLemmatizer` override with the `ru` settings.

only evaluate named entities for NEL if there is a corresponding gold…

113e8d0

… span (#7074)

fix NEL config and IO, and n_sents functionality (#7100)

b92f81d

* fix NEL config and IO, and n_sents functionality * add docs * fix test

fix typo in models.md (#7157)

b1996a5

Update sentencizer documentation example with sentencizer pipe name (#…

fa7ddc7

…7185)

Fix SpanGroup import (#7182)

0563cd7

* import SpanGroup from tokens module * revert edits from different PR * add to __all__

Merge pull request #7126 from adrianeboyd/docs/gpu-id-opt [ci skip]

24cecbb

Add tip about --gpu-id to training quickstart

Merge pull request #7115 from SergeyShk/ruts [ci skip]

9e8a7e0

Auto-format [ci skip]

d2c5153

Merge pull request #7073 from adrianeboyd/feature/logger-level-in-for…

592678f

…matter Add time and level to default logging formatter

Set include_dirs in Extension

3abfa99

Set the `include_dirs` in each `Extension` rather than in `setup()` to handle the case where there is a custom `distutils.cfg` that modifies the include paths, in particular for python from homebrew.

Extend docs related to Vocab.get_noun_chunks

6a37f34

kb.get_candidates renamed to get_alias_candidates

08fd901

Re-refactor Sentencizer with Pipe API (#7176)

10c930c

Reapply the refactoring (#4721) so that `Sentencizer` uses the faster `predict` and `set_annotations` for both `__call__` and `pipe`.

Allow sourcing disabled components (#7215)

e43d43d

Check `component_names` instead of `pipe_names` to allow sourcing disabled components.

fix type in docs

2483390

Fix formatting in bg/bn quickstart recs

ee7bb0b

import wandb failure - UX

2010219

Merge pull request #7223 from svlandeg/ux/wandb_import

fb0f095

import wandb failure - UX

Merge pull request #7222 from adrianeboyd/bugfix/quickstart-recs-bg-bn

0dbc2a1

Fix formatting in bg/bn quickstart recs

Merge pull request #7220 from svlandeg/docs/has_annotation [ci skip]

dc46fa0

has_annotation docs fix

Merge pull request #7207 from adrianeboyd/docs/get-noun-chunks [ci skip]

408b948

Extend docs related to Vocab.get_noun_chunks

Merge pull request #7211 from svlandeg/docs/el_update [ci skip]

8f7c7b2

kb.get_candidates renamed to get_alias_candidates

Merge pull request #7204 from adrianeboyd/bugfix/include-dirs-distutils

9f204b3

Set include_dirs in Extension

Fix spans weak ref in doc copy (#7225)

dd99872

* failing unit test * ensure that doc.spans refers to the copied doc, not the old * add type info

Support for matcher-alignment (#6420, #4642, #3275)

be3a664

SGA

8c662b6

broaddeep closed this Mar 6, 2021

broaddeep deleted the feature/matcher-alignment branch March 6, 2021 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/matcher alignment #7319

Feature/matcher alignment #7319

broaddeep commented Mar 6, 2021 •

edited

Loading

Feature/matcher alignment #7319

Feature/matcher alignment #7319

Conversation

broaddeep commented Mar 6, 2021 • edited Loading

Description

Implementation details

API

Types of change

Checklist

broaddeep commented Mar 6, 2021 •

edited

Loading