Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotations extending beyond the end of the document. #52

Open
greenwoodma opened this issue Aug 1, 2018 · 0 comments
Open

Annotations extending beyond the end of the document. #52

greenwoodma opened this issue Aug 1, 2018 · 0 comments

Comments

@greenwoodma
Copy link

greenwoodma commented Aug 1, 2018

I'm currently writing a plugin for GATE (http://gate.ac.uk) to enable us to read in documents annotated with Anafora. You can find the latest version of the plugin at https://github.com/GateNLP/gateplugin-Format_Anafora

While the XML format Anafora uses is nice and easy to parse I've come across quite a few documents where an annotator has somehow managed to produce an annotation that ends after the end of the document. In many cases this isn't just one or two characters difference (something I could understand if there were issues of encoding etc.) but a difference of 50 characters or more. Currently I simply truncate these annotations to match the document length, and it seems that this results in seeing the same annotations in both Anafora and GATE.

As I haven't checked every annotation on every document though I'm wondering if this is an isolated issues with annotations that end at the end of a document, or if there might be a wider issue with annotation offsets being stored incorrectly.

I've attached a document and annotation file so you can see what I mean. In this instance the document is 817 characters long (I'm assuming it's UTF-8 but wiith no multi-byte characters as it's also 817 bytes long), but the second annotation in the file produced by Anafora spans from offset 656 to 835; in otherwords it goes 18 characters beyond the end of the document.

doc000149.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant