This is our solution for RuCoCo-23 shared task: Coreference Resolution in Russian (only single antecedent resolution).
Colab notebook for baseline (no RST) model.
-
(Option 1) With Docker
- Run the container locally or remotely using the following command:
docker run --rm -d -p 3334:3333 --name spacy_ru tchewik/isanlp_spacy:ru
- Connect to it from Python:
from isanlp.processor_remote import ProcessorRemote spacy_address = ['0.0.0.0', 3334] spacy_processor = (ProcessorRemote(spacy_address[0], spacy_address[1], '0'), ['tokens', 'sentences'], {'lemma': 'lemma', 'postag': 'postag', 'morph': 'morph', 'syntax_dep_tree': 'syntax_dep_tree', 'entities': 'entities'})
- Run the container locally or remotely using the following command:
-
(Option 2) Locally
- Download the model
python -m spacy download ru_core_news_lg
- Initialize in Python using
ProcessorSpaCy
from isanlp.processor_spacy import ProcessorSpaCy spacy_processor = (ProcessorSpaCy(model_name='ru_core_news_lg'), ['tokens', 'sentences'], {'lemma': 'lemma', 'postag': 'postag', 'morph': 'morph', 'syntax_dep_tree': 'syntax_dep_tree', 'entities': 'entities'})
- Download the model
-
(Only option) With Docker
- Run the container locally or remotely using the following command:
docker run --rm -d -p 3335:3333 --name rst_ru tchewik/isanlp_rst:2.1-rstreebank
- Connect to it from Python:
from isanlp.processor_remote import ProcessorRemote rst_address = ['0.0.0.0', 3335] rst_processor = (ProcessorRemote(rst_address[0], rst_address[1], 'default'), ['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'], {'rst': 'rst'})
- Run the container locally or remotely using the following command:
There are two models from the test leaderboard of RuCoCo-23: base and Rh-enhanced. The latter requires RST parsing which makes it slow. There are also two options for running: with Docker or locally.
name | F1 (dev) | F1 (test) | time (example, CPU only) |
for local run (place into models/ ) |
docker image |
---|---|---|---|---|---|
base | 74.3 | 72.8 | ~883 ms | model_base.tar.gz | tchewik/corefhd:base |
base+rh | 74.6 | 73.3 | ~19 s | model_rh.tar.gz | tchewik/corefhd:rh |
-
(Option 1) With Docker
- Run the container locally or remotely using the following command using selected tag (
base
orrh
):docker run --rm -d -p 3336:3333 --name corefhd tchewik/isanlp_corefhd:<tag>
- Connect to it from Python:
from isanlp.processor_remote import ProcessorRemote coref_address = ['0.0.0.0', 3336] # Base model corefhd = (ProcessorRemote(coref_address[0], coref_address[1], 'default'), ['text', 'tokens', 'sentences', 'lemma', 'postag', 'syntax_dep_tree', 'entities'], {'entity_clusters': 'entity_clusters'}) # Rh model corefhd = (ProcessorRemote(coref_address[0], coref_address[1], 'default'), ['text', 'tokens', 'sentences', 'lemma', 'postag', 'syntax_dep_tree', 'entities', 'rst'], {'entity_clusters': 'entity_clusters'})
- Run the container locally or remotely using the following command using selected tag (
-
(Option 2) Locally
- Download the model as
models/model_base.tar.gz
ormodels/model_rh.tar.gz
(link in the table). - Find the python path for allennlp and update for LUKE (see
load_custom_allennlp_scripts.bash
) - Initialize in Python using
ProcessorCorefHD
:from processor_corefhd import ProcessorCorefHD # Base model corefhd_processor = (ProcessorCorefHD(cuda_device=-1, use_discourse=False), ['text', 'tokens', 'sentences', 'lemma', 'postag', 'syntax_dep_tree', 'entities'], {0: 'entity_clusters'}) # Rh model corefhd_processor = (ProcessorCorefHD(cuda_device=-1, use_discourse=True), ['text', 'tokens', 'sentences', 'lemma', 'postag', 'syntax_dep_tree', 'entities', 'rst'], {'entity_clusters': 'entity_clusters'})
- Download the model as
-
Construct the pipeline from initialized processors:
-
For base model
from isanlp import PipelineCommon from isanlp.processor_razdel import ProcessorRazdel ppl = PipelineCommon([ (ProcessorRazdel(), ['text'], {'tokens': 'tokens', 'sentences': 'sentences'}), spacy_processor, corefhd_processor ])
-
For Rh model
from isanlp import PipelineCommon from isanlp.processor_razdel import ProcessorRazdel ppl = PipelineCommon([ (ProcessorRazdel(), ['text'], {'tokens': 'tokens', 'sentences': 'sentences'}), spacy_processor, rst_processor, corefhd_processor ])
-
-
Run the constructed pipeline:
text = open('text_example.txt', 'r').read().strip() result = ppl(text)
The result is given in token spans:
>>> result['entity_clusters'] [[[0, 1], [7, 7], [19, 19], [103, 104], [126, 126]], [[23, 27], [30, 30]], [[68, 69], [72, 72]], [[78, 83], [132, 132]], [[44, 53], [138, 138], [152, 152]], [[133, 134], [140, 140], [149, 149]], [[89, 90], [142, 142]]]
Example finding the corresponding text spans:
def print_coreference_clusters(text, tokens, entity_clusters): def mention_to_str(mention): return text[tokens[mention[0]].begin: tokens[mention[1]].end] for entity in entity_clusters: print(f'{mention_to_str(entity[0])} ::: {[mention_to_str(mention) for mention in entity[1:]]}') >>> print_coreference_clusters(result['text'], result['tokens'], result['entity_clusters']) Иоганн Шильтбергер ::: ['он', 'отрок', 'сам Иоганн', 'он'] рыцаря по имени Леонгарт Рихартингер ::: ['его'] венгерские крестоносцы ::: ['которым'] 24-летним сыном герцога Бургундии Жаном Бесстрашным ::: ['Жана'] венгерский король и будущий император Священной Римской империи Сигизмунд I ::: ['Сигизмунда', 'Сигизмунд'] бургундские рыцари ::: ['Они', 'им'] турецкой армией ::: ['турок']
Further information and examples can be found in our paper:
@INPROCEEDINGS{chistova2023light,
author = {Chistova, E. and Smirnov, I.},
title = {Light Coreference Resolution for Russian with Hierarchical Discourse Features},
booktitle = {Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference "Dialogue" (2023)},
year = {2023},
number = {22},
pages = {34--41}
}