GitHub - tchewik/corefhd: This is our solution for the RuCoCo-23 shared task described in "Light Coreference Resolution for Russian with Hierarchical Discourse Features"

Light Coreference Resolution for Russian with Hierarchical Discourse Features

This is our solution for RuCoCo-23 shared task: Coreference Resolution in Russian (only single antecedent resolution).

Colab notebook for baseline (no RST) model.

1. Set up the syntax & NER parser

(Option 1) With Docker

Run the container locally or remotely using the following command:

   docker run --rm -d -p 3334:3333 --name spacy_ru tchewik/isanlp_spacy:ru

Connect to it from Python:

from isanlp.processor_remote import ProcessorRemote
 
spacy_address = ['0.0.0.0', 3334]
spacy_processor = (ProcessorRemote(spacy_address[0], spacy_address[1], '0'),
                   ['tokens', 'sentences'],
                   {'lemma': 'lemma',
                    'postag': 'postag',
                    'morph': 'morph',
                    'syntax_dep_tree': 'syntax_dep_tree',
                    'entities': 'entities'})

(Option 2) Locally

Download the model

python -m spacy download ru_core_news_lg

Initialize in Python using ProcessorSpaCy

from isanlp.processor_spacy import ProcessorSpaCy

spacy_processor = (ProcessorSpaCy(model_name='ru_core_news_lg'),
                  ['tokens', 'sentences'],
                  {'lemma': 'lemma',
                   'postag': 'postag',
                   'morph': 'morph',
                   'syntax_dep_tree': 'syntax_dep_tree',
                   'entities': 'entities'})

2. Set up the RST parser (only for `model_rh`)

(Only option) With Docker

Run the container locally or remotely using the following command:

docker run --rm -d -p 3335:3333 --name rst_ru tchewik/isanlp_rst:2.1-rstreebank

Connect to it from Python:

from isanlp.processor_remote import ProcessorRemote

rst_address = ['0.0.0.0', 3335]
rst_processor = (ProcessorRemote(rst_address[0], rst_address[1], 'default'),
                 ['text', 'tokens', 'sentences', 'postag', 'morph', 'lemma', 'syntax_dep_tree'],
                 {'rst': 'rst'})

3. Set up the coreference resolver

There are two models from the test leaderboard of RuCoCo-23: base and Rh-enhanced. The latter requires RST parsing which makes it slow. There are also two options for running: with Docker or locally.

name	F1 (dev)	F1 (test)	time (example, CPU only)	for local run (place into `models/`)	docker image
base	74.3	72.8	~883 ms	model_base.tar.gz	`tchewik/corefhd:base`
base+rh	74.6	73.3	~19 s	model_rh.tar.gz	`tchewik/corefhd:rh`

(Option 1) With Docker

Run the container locally or remotely using the following command using selected tag (base or rh):
```
   docker run --rm -d -p 3336:3333 --name corefhd tchewik/isanlp_corefhd:<tag>
```

Connect to it from Python:

from isanlp.processor_remote import ProcessorRemote

coref_address = ['0.0.0.0', 3336]

# Base model
corefhd = (ProcessorRemote(coref_address[0], coref_address[1], 'default'),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities'],
           {'entity_clusters': 'entity_clusters'})

# Rh model
corefhd = (ProcessorRemote(coref_address[0], coref_address[1], 'default'),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities', 'rst'],
           {'entity_clusters': 'entity_clusters'})

(Option 2) Locally

Download the model as models/model_base.tar.gz or models/model_rh.tar.gz (link in the table).
Find the python path for allennlp and update for LUKE (see load_custom_allennlp_scripts.bash)

Initialize in Python using ProcessorCorefHD:

from processor_corefhd import ProcessorCorefHD

# Base model
corefhd_processor = (ProcessorCorefHD(cuda_device=-1, use_discourse=False),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities'],
           {0: 'entity_clusters'})

# Rh model
corefhd_processor = (ProcessorCorefHD(cuda_device=-1, use_discourse=True),
           ['text', 'tokens', 'sentences',
            'lemma', 'postag', 'syntax_dep_tree', 'entities', 'rst'],
           {'entity_clusters': 'entity_clusters'})

4. Process the texts

Construct the pipeline from initialized processors:

For base model

  from isanlp import PipelineCommon
  from isanlp.processor_razdel import ProcessorRazdel

  ppl = PipelineCommon([
     (ProcessorRazdel(), ['text'],
      {'tokens': 'tokens',
       'sentences': 'sentences'}),
     spacy_processor,
     corefhd_processor
  ])

For Rh model

  from isanlp import PipelineCommon
  from isanlp.processor_razdel import ProcessorRazdel

  ppl = PipelineCommon([
     (ProcessorRazdel(), ['text'],
      {'tokens': 'tokens',
       'sentences': 'sentences'}),
     spacy_processor,
     rst_processor,
     corefhd_processor
  ])

Run the constructed pipeline:

text = open('text_example.txt', 'r').read().strip()
result = ppl(text)

The result is given in token spans:

   >>> result['entity_clusters']
   [[[0, 1], [7, 7], [19, 19], [103, 104], [126, 126]],
    [[23, 27], [30, 30]],
    [[68, 69], [72, 72]],
    [[78, 83], [132, 132]],
    [[44, 53], [138, 138], [152, 152]],
    [[133, 134], [140, 140], [149, 149]],
    [[89, 90], [142, 142]]]

Example finding the corresponding text spans:

def print_coreference_clusters(text, tokens, entity_clusters):
   def mention_to_str(mention):
       return text[tokens[mention[0]].begin: tokens[mention[1]].end]
   for entity in entity_clusters:
       print(f'{mention_to_str(entity[0])} ::: {[mention_to_str(mention) for mention in entity[1:]]}')
   
>>> print_coreference_clusters(result['text'], result['tokens'], result['entity_clusters'])
Иоганн Шильтбергер ::: ['он', 'отрок', 'сам Иоганн', 'он']
рыцаря по имени Леонгарт Рихартингер ::: ['его']
венгерские крестоносцы ::: ['которым']
24-летним сыном герцога Бургундии Жаном Бесстрашным ::: ['Жана']
венгерский король и будущий император Священной Римской империи Сигизмунд I ::: ['Сигизмунда', 'Сигизмунд']
бургундские рыцари ::: ['Они', 'им']
турецкой армией ::: ['турок']

Cite

Further information and examples can be found in our paper:

@INPROCEEDINGS{chistova2023light,
      author = {Chistova, E. and Smirnov, I.},
      title = {Light Coreference Resolution for Russian with Hierarchical Discourse Features},
      booktitle = {Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference "Dialogue" (2023)},
      year = {2023},
      number = {22},
      pages = {34--41}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
Dockerfile-base		Dockerfile-base
Dockerfile-rh		Dockerfile-rh
LICENSE		LICENSE
README.md		README.md
load_custom_allennlp_scripts.bash		load_custom_allennlp_scripts.bash
load_custom_allennlp_scripts_docker.bash		load_custom_allennlp_scripts_docker.bash
pipeline_object_base.py		pipeline_object_base.py
pipeline_object_rh.py		pipeline_object_rh.py
processor_corefhd.py		processor_corefhd.py
start.py		start.py
text_example.txt		text_example.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Light Coreference Resolution for Russian with Hierarchical Discourse Features

1. Set up the syntax & NER parser

2. Set up the RST parser (only for `model_rh`)

3. Set up the coreference resolver

4. Process the texts

Cite

About

Languages

License

tchewik/corefhd

Folders and files

Latest commit

History

Repository files navigation

Light Coreference Resolution for Russian with Hierarchical Discourse Features

1. Set up the syntax & NER parser

2. Set up the RST parser (only for model_rh)

3. Set up the coreference resolver

4. Process the texts

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

2. Set up the RST parser (only for `model_rh`)