What is the cleanest method to get coreferenced text returned #12142

tomasonjo · 2023-01-22T21:01:44Z

tomasonjo
Jan 22, 2023

Hey, I like using SpaCy but I have little to no experience with Doc or Token objects. I would simply like to return coreferenced text that I get by using the coreferee plugin. It seems that you can check coref clusters and resolve tokens with the appropriate method, but I have no idea what would be the cleanest options to return the coreferenced text of the entire doc object.

Thanks,
Tomaž

danieldk · 2023-01-23T11:29:53Z

danieldk
Jan 23, 2023

Could you explain a bit more what you'd like to accomplish?

There are two ways to access the coreference chains: (1) through the Doc._.coref_chains attribute of docs; or (2) through the Token._.coref_chains attribute of a token. The data model of coref chains is described here:

https://github.com/msg-systems/coreferee#2-interacting-with-the-data-model

Suppose that we have the English example from the coreferee documentation:

>>> doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

At a basic level, you can see a Doc as a container of tokens:

>>> [(idx, token.text) for idx, token in enumerate(doc)]
[(0, 'Although'), (1, 'he'), (2, 'was'), (3, 'very'), (4, 'busy'), (5, 'with'), (6, 'his'), (7, 'work'), (8, ','), (9, 'Peter'), (10, 'had'), (11, 'had'), (12, 'enough'), (13, 'of'), (14, 'it'), (15, '.'), (16, 'He'), (17, 'and'), (18, 'his'), (19, 'wife'), (20, 'decided'), (21, 'they'), (22, 'needed'), (23, 'a'), (24, 'holiday'), (25, '.'), (26, 'They'), (27, 'travelled'), (28, 'to'), (29, 'Spain'), (30, 'because'), (31, 'they'), (32, 'loved'), (33, 'the'), (34, 'country'), (35, 'very'), (36, 'much'), (37, '.')]

Suppose that you want the first (and only) coreference chain that wife s part of. You can index a Doc like a container to get this chain:

>>> doc[19]._.coref_chains[0]
2: [16, 19], [21], [26], [31]

This is a Chain object, which gives you access to the identifier of the chain (index) and the mentions (mentions) of that particular coreference chain. Each mention is a list of tokens. So, if we want to get the string representations of the mentions in the chain of she, you could do so as follows:

>>> [[doc[idx].text for idx in mention] for mention in doc[19]._.coref_chains[0].mentions]
[['He', 'wife'], ['they'], ['They'], ['they']]

To unpack this:

for mention in doc[19]._.coref_chains[0].mentions: iterate over each mention in the chain
[doc[idx].text for idx in mention]: each mention is a list of token identifiers (plain ints), you can use these to index a token in the Doc (doc[idx]). This will give you a Token, which has a text property that returns the orthogonal representation of a token.

Since a Chain acts as a list-like object, you can leave out the mentions attribute:

>>> [[type(idx) for idx in mention] for mention in doc[19]._.coref_chains[0]]
[[<class 'int'>, <class 'int'>], [<class 'int'>], [<class 'int'>], [<class 'int'>]]

I would recommend you to go through the documentation of Token and Doc to understand how there classes work and what properties are available:

https://spacy.io/api/doc
https://spacy.io/api/token

0 replies

tomasonjo · 2023-01-23T11:37:49Z

tomasonjo
Jan 23, 2023
Author

What I want to return is ->

doc = nlp("Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.")

and output is:

Although Peter was very busy with Peter work, Peter had had enough of work. Peter and his wife decided they needed a holiday. Peter and wife travelled to Spain because Peter and wife loved the country very much.

The coreference is only a starting step in the information extraction pipeline. I want to retrieve the coreferenced text and input it into the REBEL model next.

I could perhaps come up with some crappy code to do this, but I want to use this example in my book and would therefore like as clean code as possible. Thanks

0 replies

tomasonjo · 2023-01-23T19:05:49Z

tomasonjo
Jan 23, 2023
Author

A method like neuralCoref provides would be great

doc._.coref_resolved

0 replies

tomasonjo · 2023-01-23T20:29:07Z

tomasonjo
Jan 23, 2023
Author

Found a solution

resolved_text = ""
for token in coref_doc:
  
    repres = coref_doc._.coref_chains.resolve(token)
    print(repres)
    if repres:
        resolved_text += " " + " and ".join([t.text for t in repres])
    else:
        resolved_text += " " + token.text
    
print(resolved_text)

0 replies

richardpaulhudson · 2023-01-24T12:49:54Z

richardpaulhudson
Jan 24, 2023

Hi @tomasonjo, we considered but decided against implementing this requirement at the very beginning of Coreferee's life: see msg-systems/coreferee#1. The problem is that some of the languages Coreferee supports — and probably a majority of the world's languages that it might one day support — do not consistently express anaphors with overt pronouns. So in many cases there is nothing for a referred-to noun to replace. You can do this in English, but because it's a language-specific thing it is more appropriate to use the code supplied above outside Coreferee rather than within it.

0 replies

tomasonjo · 2023-01-24T16:52:26Z

tomasonjo
Jan 24, 2023
Author

Hi @richardpaulhudson , thanks for your input. However, it would probably make sense to at least include the code for english version in the docs. At least, that's my opinion as coreference is usually only a single step in the NLP pipeline and not the end output.

Thanks

2 replies

richardpaulhudson Jan 25, 2023

It's true that coreference resolution is rarely an end in itself. However, a downstream component does not need a human-readable representation of its output to be able to make effective use of that output, and even for English (where it's grammatically possible), I don't think replacing pronouns like this makes very much sense. If a pronoun's referring back to a noun that heads a phrase, do we just replace the noun or the whole phrase? And if it's referring back to a noun with an indefinite article, do we use a definite article when replacing the pronoun? You quickly get into language generation with all the problems that entails, but without any particular purpose because when doing information extraction you can write code to traverse the cluster structures without generating a new textual representation of each input document.

tomasonjo Jan 25, 2023
Author

I understand your perspective. I'll give you a perspective from a non-expert NLP user, who likes to play around with NLP.

There are three coreference plugins for spacy, but only one works with the latest version and doesn't have older dependencies that makes it a pain in the ass to combine with other models (apart from creating multiple environments). Now, the other two models both implement a method that resolves a coreference text, but since I don't want to deal with dependency hell, I am dreading the idea of using them. That makes the coreferee plugin the go to plugin. As you said, coreference resolution is usually an intermediate step in the pipeline. For example, say that I want to use the resolved text of coreference resolution for NER or RE downstream. With NER, the only difference between using coreference resolution or not would be the count of mentions, so if that's not critical it is not a big deal. However, with RE, it makes a big difference. Since I am not implementing custom models and pipelines, it is really hard for me to do anything downstream with coreference cluster information. For example, I have no idea how would coreference clusters help me with downstream REBEL model.

So all in all, you are treating the coreferee as starting point for experienced SpaCy users and essentially raising the entry bar. An inexperienced user doesn't care all that much about (in)definite articles and so on, and knows the results aren't perfect. Even the actual clusters are hardly ever perfect, but they are good enough, so we use them. The same thing would apply to resolved text IMO. The way I see it now is you have decided to leave the problems of resolving text or using cluster information downstream to users, which raises the bar significantly for testing and simple implementations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the cleanest method to get coreferenced text returned #12142

{{title}}

Replies: 6 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the cleanest method to get coreferenced text returned #12142

tomasonjo Jan 22, 2023

Replies: 6 comments · 2 replies

danieldk Jan 23, 2023

tomasonjo Jan 23, 2023 Author

tomasonjo Jan 23, 2023 Author

tomasonjo Jan 23, 2023 Author

richardpaulhudson Jan 24, 2023

tomasonjo Jan 24, 2023 Author

richardpaulhudson Jan 25, 2023

tomasonjo Jan 25, 2023 Author

tomasonjo
Jan 22, 2023

Replies: 6 comments 2 replies

danieldk
Jan 23, 2023

tomasonjo
Jan 23, 2023
Author

tomasonjo
Jan 23, 2023
Author

tomasonjo
Jan 23, 2023
Author

richardpaulhudson
Jan 24, 2023

tomasonjo
Jan 24, 2023
Author

tomasonjo Jan 25, 2023
Author