Skip to content

tanloong/pytregex

Repository files navigation

coverage Python Version from PEP 621 TOML license

Tregex is the Java program for identifying patterns in constituency trees. PyTregex provides similar functionality in Python.

Usage

Command-line

Install it with pip install and run it by python -m pytregex.

$ pip install pytregex

$ echo '(NP(DT The)(NN battery)(NN plant))' | python -m pytregex pattern 'NP < NN' -filter
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ echo '(NP(DT The)(NN battery)(NN plant))' > trees.txt
$ python -m pytregex pattern 'NP < NN' ./trees.txt
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ python -m pytregex pattern 'NP < NN' -C ./trees.txt
# 2

$ python -m pytregex pattern 'NP < NN=a' -h a ./trees.txt
# (NN battery)
# (NN plant)
# There were 2 matches in total.

$ python -m pytregex explain '<'
# 'A < B' means A immediately dominates B

$ python -m pytregex pprint '(NP(DT The)(NN battery)(NN plant))'
# NP
# ├── DT
# │   └── The
# ├── NN
# │   └── battery
# └── NN
#     └── plant

Inline

from pytregex.tregex import TregexPattern

p = TregexPattern("NP < NN=a")
matches = p.findall("(NP(DT The)(NN battery)(NN plant))")
handles = p.get_nodes("a")
print("matches nodes:\n{}\n".format("\n".join(str(m) for m in matches)))
print("named nodes:\n{}".format("\n".join(str(h) for h in handles)))

# Output:
# matches nodes:
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
#
# named nodes:
# (NN battery)
# (NN plant)

See tests for more examples.

Differences from Tregex

Tregex is whitespace-sensitive, it distinguishes between | and ␣|␣. PyTregex ignores whitespace and has different symbols taking the place of ␣|␣.

Tregex PyTregex
node disjunction A|B A|B
A␣|␣B
condition disjunction A<B␣|␣<C A<B␣||␣<C
A<B||<C
expression disjunction A␣|␣B N/A
expression separation N/A A;B
A␣;␣B

In the table above the difference between expression disjunction and expression separation is whether "expressions stop evaluating as soon as the result is known." For example, in Tregex NP=a | NNP=b if NP matches, b will not be assigned even if there is an NNP in the tree, while in PyTregex NP=a ; NNP=b assigns b as long as NNP is found regardless of whether NP matches.

Missing features

Backreferencing

$ tree='(NP NP , NP ,)'
$ pattern='(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)' 

$ echo "$tree" | java edu.stanford.nlp.trees.tregex.TregexPattern "$pattern" -filter -s 2>/dev/null
# (NP NP , NP ,)

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
#                                                 ˄
# Parsing error at token '='

Headfinders

PyTregex currently has only one HeadFinder which is for English. If your patterns are for trees of other languages and contain <#, >#, <<#, or >>#, they may not work as expected.

Variable groups

$ tree='(SBAR (WHNP-11 (WP who)) (S (NP-SBJ (-NONE- *T*-11)) (VP (VBD resigned))))' 
$ pattern='@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))' 

$ echo "$tree" | java edu.stanford.nlp.trees.tregex.TregexPattern "$pattern" -filter 2>/dev/null
# (SBAR
#   (WHNP-11 (WP who))
#   (S
#     (NP-SBJ (-NONE- *T*-11))
#     (VP (VBD resigned))))

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# Tokenization error: Illegal character "#"

Acknowledgments

Thanks Galen Andrew, Roger Levy, Anna Rafferty, and John Bauer for their work on Tregex. One-third of PyTregex's code is just translated from Tregex.

This program uses David Beazley's PLY(Python Lex-Yacc) for pattern tokenization and parsing.

About

Tregex-like constituency tree matcher

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published