Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for AnnotatedText #79

Open
mishushakov opened this issue Apr 28, 2022 · 10 comments
Open

Support for AnnotatedText #79

mishushakov opened this issue Apr 28, 2022 · 10 comments

Comments

@mishushakov
Copy link

mishushakov commented Apr 28, 2022

hey, thanks for this awesome project!
do you consider adding AnnotatedText support?

this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)

right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document

@mishushakov
Copy link
Author

mishushakov commented Apr 29, 2022

i've managed to actually crack the code (i think), it's really easy
you start at 0 offset and 0 line
you iterate the nodes/elements in annotation
you push node's contents into text at the current line
additionally you want to check for new line and increase the current line count
then you add length of the node to offset

the only thing remaining is you should iterate over texts line-by-line and add text's offset to nlprule's offset
for example if i have a misspelling at offset 30 and misspelling occurred on 4 index, you just add 30 + 4, this would be your final offset which basically is consistent with your document length

here's implemented in js

let annotatedText = {"annotation": [
  {"text": "A "},
  {"markup": "<b>"},
  {"text": "test"},
  {"markup": "</b>"},
  {"markup": "<p>", "interpretAs": "\n\n"},
  {"text": "Interpret as new line"},
  {"markup": "</p>"}
]}

let offset = 0
let currentLine = 0
let texts = []
let result = ''

annotatedText.annotation.forEach((node) => {
  // result is only for debugging
  result += node.text ? node.text : node.markup

  if (node.interpretAs === '\n\n') {
    currentLine++
  }

  if (node.text) {
    if (!texts[currentLine]) texts[currentLine] = []
    texts[currentLine].push({text: node.text, offset})
  }

  offset += node.text ? node.text.length : node.markup.length
})

console.log(texts)
console.log(result)

here's the result, you can check that offset is correct

[
 [{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
 [{ text: 'Interpret as new line', offset: 16 }]
]
A <b>test</b><p>Interpret as new line</p>

i'm thinking of translating into Rust now, but i'd need to make sure about the edge-cases
first one that i can think of is utf8

@bminixhofer
Copy link
Owner

Hi, I'm not familiar with the AnnotatedText format, can you link some resource? Is this specific to LanguageTool?

In principle this does sound like a good feature though.

Thanks for the sample implementation, it does seem pretty straightforward.

@mishushakov
Copy link
Author

mishushakov commented Apr 29, 2022

hi!
yep, take a look at LanguageTool HTTP API

https://languagetool.org/http-api/#!/default/post_check

annotated text feature in LanguageTool allows you to check documents with markup (html/word/markdown) without writing parsers

you only have to convert the text into annotated text format (using tools already available)
i'd be personally interested in building annotated text converters so that people don't have to build their own (like in case of prose-md) if they want to check markup

annotated text is just a nice abstraction to allow that

@mishushakov
Copy link
Author

the workflow would look like this: convert into annotated text using a converter > check with nlprule

now what's better than this is that one could take it one step further and build a HTTP server on top of nlprule
the HTTP server could then be used as a drop-in replacement for LanguageTool, which i think is a good thing, because it would drive more people towards this project

@bminixhofer
Copy link
Owner

OK, thanks, I've had a look.

If there is a clean, simple implementation we can support AnnotatedText in the main library. I am currently not actively working on nlprule so a PR would be very welcome.

Regarding the HTTP server. That would be a nice tool but it's not something that should be in the main library, and not something I currently want to work on / maintain - but it would be a good fit for a separate package!

@mishushakov
Copy link
Author

i'd start annotatedtext crate for building and parsing annotatedtext
then i'd try to do spell-checking using nlprule and think about how to add it the the library

totally agree that http server shouldn't be included in the library
anyways, will report the progress here

@mishushakov
Copy link
Author

mishushakov commented May 1, 2022

I have finished my AnnotatedText library for Rust and ready to make tests using nlprule
i will publish as soon as i get them working together

in the original implementation i did overlook the code a bit:

texts[currentLine] = {text: node.text, offset}

should actually be

texts[currentLine].push({text: node.text, offset})

Result

texts = [
 [{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
 [{ text: 'Interpret as new line', offset: 16 }]
]

then you can get your sentences line-by-line

texts.map(line => line.text)

(i have updated my reference code above)

here's how you'll do the same thing using the Rust library

fn main () {
  let example = r#"
    {"annotation": [
        {"text": "A "},
        {"markup": "<b>"},
        {"text": "test"},
        {"markup": "</b>"},
        {"markup": "<p>", "interpretAs": "\n\n"},
        {"text": "Interpret as new line"},
        {"markup": "</p>"}
    ]}"#;

    let annotation = lib::Annotation::from_str(&example).unwrap();
    let result = annotation.to_text_tree();

    let r = result[0].iter().cloned().map(|r| r.text)
    .collect::<Vec<_>>()
    .join("");

    println!("{}", r)
}

also, i still don't know whether the offset should be expressed in bytes or in chars (currently it's in bytes), maybe you have an opinion on that?

@bminixhofer
Copy link
Owner

Hi, sorry for the late response.

It is probably best to keep track of both and return a Position. That way it is compatible with the Python bindings / LT API (where counting in characters is natural) and with the Rust API (where counting in bytes is natural).

@mishushakov
Copy link
Author

mishushakov commented May 16, 2022

Hey,
here's the code so far:

use std::{str::FromStr, collections::HashMap, ops::Range};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
    pub annotation: Vec<AnnotatedText>
}

#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
    pub text: Option<String>,
    pub markup: Option<String>,
    pub interpretAs: Option<String>
}

pub type AnnotatedTextMap = HashMap<usize, String>;

impl FromStr for Annotation {
    type Err = serde_json::error::Error;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        return serde_json::from_str(&s);
    }
}

impl ToString for Annotation {
    fn to_string(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
                _ => ()
            }
        });

        return result;
    }
}

impl Annotation {
    pub fn to_original(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { markup: Some(markup), ..} => result += &markup,
                _ => ()
            }
        });

        return result;
    }

    pub fn to_text_map(&self) -> AnnotatedTextMap {
        let mut offset: usize = 0;
        let mut map = HashMap::new();
        let _terminator = String::from("\n\n");

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => {
                    map.insert( offset, text.clone());
                    offset += text.len()
                },
                // AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
                AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
                _ => ()
            }
        });

        return map
    }

    pub fn find_original_position(&self, text_position: Range<usize>) -> Range<usize> {
        let text_map = self.to_text_map();
        let mut min_distance = usize::MAX;
        let mut best_match: usize = 0;

        text_map.iter().for_each(|(key, _value)| {
            let closest_position = *key;
            if text_position.start <= closest_position {
                let distance = closest_position - text_position.start;
                if distance < min_distance {
                    best_match = *key;
                    min_distance = distance;
                }
            }
        });

        return best_match + text_position.start..text_position.end
    }
}

i decided the best approach would be to just copy the Java implementation to Rust

i leave the code here, so maybe someone could take it and reimplement the find_original_position function to find the correct range

the function takes AnnotatedText, converts it to offset map and returns original position relative to plain text position

LanguageTool reference source can be found here: https://github.com/languagetool-org/languagetool/blob/b5f85984ea2fcbce8b64da1d88fc701528810a13/languagetool-core/src/main/java/org/languagetool/markup/AnnotatedText.java#L109-L141

the issue with find_original_position right now is that the end position is incorrect

i can't really fix it right now, because i don't know Java and don't really understand the algorithm, on top of that i haven't figured out completely how borrowing works in Rust

@mishushakov
Copy link
Author

I’ve decided to give it another try yesterday

Changelog

  • B-tree instead of HashMap (sorted)
  • Closest position takes the sentence length into account for more precision
  • Some progress in end range calculation

What doesn't work

End range is still incorrect, example program output:

HTML: <h1>She was <span>not been here since </span><b>Monday</b></h1>
Text: She was not been here since Monday
Text Range: 4..16
Original Range: 8..20
Snippet: "was <span>no"

LanguageTool calculates the end range correctly taking <span> into account

Code

lib.rs

use std::{str::FromStr, collections::BTreeMap, ops::Range};
use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
    pub annotation: Vec<AnnotatedText>
}

#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
    pub text: Option<String>,
    pub markup: Option<String>,
    pub interpretAs: Option<String>
}

pub type AnnotatedTextMap = BTreeMap<usize, String>;

impl FromStr for Annotation {
    type Err = serde_json::error::Error;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        return serde_json::from_str(&s);
    }
}

impl ToString for Annotation {
    fn to_string(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
                _ => ()
            }
        });

        return result;
    }
}

impl Annotation {
    pub fn to_original(&self) -> String {
        let mut result: String = "".to_string();

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => result += &text,
                AnnotatedText { markup: Some(markup), ..} => result += &markup,
                _ => ()
            }
        });

        return result;
    }

    pub fn to_text_map(&self) -> AnnotatedTextMap {
        let mut offset: usize = 0;
        let mut map = BTreeMap::new();
        let _terminator = String::from("\n\n");

        self.annotation.iter().for_each(|annotated_text| {
            match annotated_text {
                AnnotatedText { text: Some(text), ..} => {
                    map.insert(offset, text.clone());
                    offset += text.len()
                },
                // AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
                AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
                _ => ()
            }
        });

        return map
    }

    pub fn find_original_position(&self, text_position: &Range<usize>) -> Range<usize> {
        let text_map = self.to_text_map();
        let mut min_distance = usize::MAX;
        let mut best_match: usize = 0;

        text_map.iter().for_each(|(key, value)| {
            let closest_position = *key + value.len();
            if text_position.start <= closest_position {
                let distance = closest_position - text_position.start;
                if distance < min_distance {
                    best_match = *key;
                    min_distance = distance;
                }
            }
        });

        return best_match + text_position.start..best_match + text_position.end
    }
}

main.rs

mod lib;
use std::str::FromStr;
use nlprule::{Tokenizer, Rules};

fn main () {
    // This example doesn't work correctly
    let example = r#"
    {"annotation": [
      {"markup": "<h1>"},
      {"text": "She was "},
      {"markup": "<span>"},
      {"text": "not been here since "},
      {"markup": "</span>"},
      {"markup": "<b>"},
      {"text": "Monday"},
      {"markup": "</b>"},
      {"markup": "</h1>"}
    ]}"#;

    // let example = r#"
    // {"annotation": [
    //   {"text": "She was "},
    //   {"text": "not been here since "},
    //   {"markup": "<b>"},
    //   {"text": "Monday"},
    //   {"markup": "</b>"}
    // ]}"#;

    let tokenizer = Tokenizer::new("./en_tokenizer.bin").unwrap();
    let rules = Rules::new("./en_rules.bin").unwrap();

    let annotation = lib::Annotation::from_str(&example).unwrap();
    let text = annotation.to_string();
    let original = annotation.to_original();
    let suggestions = rules.suggest(&text, &tokenizer);

    let original_range = suggestions[0].span().byte();
    let result = annotation.find_original_position(&original_range);
    println!("HTML: {}", &original);
    println!("Text: {}", &text);
    println!("Text Range: {:?}", &original_range);
    println!("Original Range: {:?}", &result);
    println!("Snippet: {:?}", &original[result])
}

The only question unsolved right now is the end range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants