Skip to content

Latest commit

 

History

History
140 lines (118 loc) · 3.38 KB

README.md

File metadata and controls

140 lines (118 loc) · 3.38 KB

trie-search

Trie-search is a package for text pattern search using marisa-trie.

Installation

$ pip install trie-search

Usage

Create trie dictionary

Before using this package, you need to create trie dictionary, or prepare a list of patterns.

The following example simply creates trie dictionary of marisa_trie.Trie from list of article titles in English version of Wikipedia, and saves it to ./example/triedict.

$ cd ./example
$ curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz | gzcat | python create_triedict.py

NOTICE : This script will consume more than 2GB memory.

trie_search.TrieSearch

Create an instance, and load dictionary:

>>> import trie_search
>>> trie = trie_search.TrieSearch(filepath='./example/triedict')

If you have list or tuple object of patterns, you can create an instance as follows:

>>> patterns = ['pattern1', 'pattern2', 'pattern3']
>>> trie = trie_search.TrieSearch(patterns)

TrieSearch.search_all_patterns

Search all patterns in an input text:

>>> text = ('in computer science , a trie , also called digital tree and '
...         'sometimes radix tree or prefix tree ( as they can be searched '
...         'by prefixes ) , is a kind of search tree - an ordered tree data '
...         'structure that is used to store a dynamic set or associative array '
...         'where the keys are usually strings .')
>>> for pattern, start_idx in trie.search_all_patterns(text):
...     print pattern, start_idx
...
in 0
computer 3
computer science 3
science 12
, 20
a 22
trie 24
, 29
also 31
called 36
digital 43
... skipped ...
array 246
where 252
where the 252
the 258
the keys 258
keys 262
are 267
usually 271
strings 279 
  • The text is the 1st sentence of https://en.wikipedia.org/wiki/Trie. For normalization, remove capitalizations and add single white space before/after symbols.

  • search_all_patterns returns an iterator. Each searched pattern is represented as a tuple (pattern_string, pattern_start_index). The results are sorted by the start index. If you want to get the result as a list object, use list function as follow:

     >>> patterns = list(trie.search_all_patterns(text))

TrieSearch.search_longest_patterns

Search longest patterns in an input text:

>>> for pattern, start_idx in sorted(trie.search_longest_patterns(text), key=lambda x: x[1]):
...     print pattern, start_idx
...
in 0
computer science 3
, 20
a 22
trie 24
, 29
also 31
called 36
digital tree 43
and 56
sometimes 60
radix tree 70
or 81
prefix tree 84
( 96
as 98
they 101
can 106
be 110
by 122
prefixes 125
) 134
, 136
is a 138
kind 143
of 148
search tree 151
- 163
an 165
ordered tree data structure 168
that 196
is 201
used to 204
store 212
a 218
dynamic set 220
or 232
associative array 235
where the 253
the keys 259
are 268
usually 272
strings 280
  • search_all_patterns also returns an iterator. The result sorted by the length of patterns. In the above example, the result is re-sorted by the start index.

trie_search.RecordTrieSearch

trie_search.RecordTrieSearch is a sub class of marisa_trie.RecordTrie, which maps unicode keys to data tuples.

The functions, search_all_patterns and search_longest_patterns, are also implemented in trie_search.RecordTrieSearch.