Improve large dictionary matching performance #5532
Replies: 6 comments
-
You'd know for sure by doing more detailed profiling, but I suspect this may depend a lot on the speed of the tokenizer. The Your example looks like it didn't get copied quite correctly? You can also just use: for label, terms in word_dict.items():
matcher.add(label, [nlp.make_doc(term) for term in terms]) As a very rough estimate I'd expect the default English tokenizer to take about a minute to process 1 million 3-word texts, but it can depend a lot on the tokenizer settings and the contents of the texts (the tokenizer has a cache). |
Beta Was this translation helpful? Give feedback.
-
I am investigating the bottleneck. And the time spent on English is a good reference. Actually I am using character based token in this case. |
Beta Was this translation helpful? Give feedback.
-
The Reloading a pickled |
Beta Was this translation helpful? Give feedback.
-
It actually took about 2.10 minutes. I just watched it out and felt it took 10 minutes, but it's a bit more than 2 minutes. Not bad. |
Beta Was this translation helpful? Give feedback.
-
I built the dictionary in an initializer and once for each run. At development time, it's quite often to run repeatedly, so a pickled matcher is worth trying and thanks for the note. |
Beta Was this translation helpful? Give feedback.
-
You can use flashtext, there is a "plugin" in the spacy universe page: https://spacy.io/universe/project/spacy-lookup |
Beta Was this translation helpful? Give feedback.
-
I have a dictionary which contains nearly 1 million text entries. I used the 'PhraseMatcher' to compile all the entries into patterns and it takes quite a while to complete the compiling process:
This piece of code alone takes about 10 minutes to complete. Is there a way to make it faster, given the size of the dictionary? Would the EntityRuler faster than PhraseMatcher?
Beta Was this translation helpful? Give feedback.
All reactions