You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While the XML format Anafora uses is nice and easy to parse I've come across quite a few documents where an annotator has somehow managed to produce an annotation that ends after the end of the document. In many cases this isn't just one or two characters difference (something I could understand if there were issues of encoding etc.) but a difference of 50 characters or more. Currently I simply truncate these annotations to match the document length, and it seems that this results in seeing the same annotations in both Anafora and GATE.
As I haven't checked every annotation on every document though I'm wondering if this is an isolated issues with annotations that end at the end of a document, or if there might be a wider issue with annotation offsets being stored incorrectly.
I've attached a document and annotation file so you can see what I mean. In this instance the document is 817 characters long (I'm assuming it's UTF-8 but wiith no multi-byte characters as it's also 817 bytes long), but the second annotation in the file produced by Anafora spans from offset 656 to 835; in otherwords it goes 18 characters beyond the end of the document.
I'm currently writing a plugin for GATE (http://gate.ac.uk) to enable us to read in documents annotated with Anafora. You can find the latest version of the plugin at https://github.com/GateNLP/gateplugin-Format_Anafora
While the XML format Anafora uses is nice and easy to parse I've come across quite a few documents where an annotator has somehow managed to produce an annotation that ends after the end of the document. In many cases this isn't just one or two characters difference (something I could understand if there were issues of encoding etc.) but a difference of 50 characters or more. Currently I simply truncate these annotations to match the document length, and it seems that this results in seeing the same annotations in both Anafora and GATE.
As I haven't checked every annotation on every document though I'm wondering if this is an isolated issues with annotations that end at the end of a document, or if there might be a wider issue with annotation offsets being stored incorrectly.
I've attached a document and annotation file so you can see what I mean. In this instance the document is 817 characters long (I'm assuming it's UTF-8 but wiith no multi-byte characters as it's also 817 bytes long), but the second annotation in the file produced by Anafora spans from offset 656 to 835; in otherwords it goes 18 characters beyond the end of the document.
doc000149.zip
The text was updated successfully, but these errors were encountered: