Lookaround operators on Matcher patterns #6420
Labels
enhancement
Feature requests and improvements
feat / matcher
Feature: Token, phrase and dependency matcher
help wanted
Contributions welcome!
Feature description
The Matcher supports
!
,?
,+
, and*
operators and quantifiers. I have text where it would be useful to have something like the regex lookaround patterns, where a pattern should or should not be matched, but is not included as part of the matched range.For example, consider the following text.
I want to create patterns for
AB CD site
andXY site
and label them as source and destination spans. Thefrom
andto
tokens are needed to distinguish betweenAB CD site
andXY site
, but should not be part of the match.The first match span the tokens for
from AB CD site
. I want justAB CD site
back as the match. Same for the second match.Proposal
The Matcher should support the following new ops, roughly based on the regex counterparts.
?=
?!
Zero or more lookaround can be used as the start and end of the pattern. A lookaround operator cannot be surrounded on both sides by non-lookaround operators in a pattern.
While there is a distinction between lookahead and lookbehind in regex, these operators are just positive/negative matchers that are not included in the result.
The
from
andto
tokens are matched by not part of the match range.Could the feature be a custom component or spaCy plugin?
No.
The text was updated successfully, but these errors were encountered: