-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for AnnotatedText #79
Comments
i've managed to actually crack the code (i think), it's really easy the only thing remaining is you should iterate over texts line-by-line and add text's offset to nlprule's offset here's implemented in js let annotatedText = {"annotation": [
{"text": "A "},
{"markup": "<b>"},
{"text": "test"},
{"markup": "</b>"},
{"markup": "<p>", "interpretAs": "\n\n"},
{"text": "Interpret as new line"},
{"markup": "</p>"}
]}
let offset = 0
let currentLine = 0
let texts = []
let result = ''
annotatedText.annotation.forEach((node) => {
// result is only for debugging
result += node.text ? node.text : node.markup
if (node.interpretAs === '\n\n') {
currentLine++
}
if (node.text) {
if (!texts[currentLine]) texts[currentLine] = []
texts[currentLine].push({text: node.text, offset})
}
offset += node.text ? node.text.length : node.markup.length
})
console.log(texts)
console.log(result) here's the result, you can check that offset is correct [
[{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
[{ text: 'Interpret as new line', offset: 16 }]
]
i'm thinking of translating into Rust now, but i'd need to make sure about the edge-cases |
Hi, I'm not familiar with the AnnotatedText format, can you link some resource? Is this specific to LanguageTool? In principle this does sound like a good feature though. Thanks for the sample implementation, it does seem pretty straightforward. |
hi! https://languagetool.org/http-api/#!/default/post_check annotated text feature in LanguageTool allows you to check documents with markup (html/word/markdown) without writing parsers you only have to convert the text into annotated text format (using tools already available) annotated text is just a nice abstraction to allow that |
the workflow would look like this: convert into annotated text using a converter > check with nlprule now what's better than this is that one could take it one step further and build a HTTP server on top of nlprule |
OK, thanks, I've had a look. If there is a clean, simple implementation we can support AnnotatedText in the main library. I am currently not actively working on nlprule so a PR would be very welcome. Regarding the HTTP server. That would be a nice tool but it's not something that should be in the main library, and not something I currently want to work on / maintain - but it would be a good fit for a separate package! |
i'd start annotatedtext crate for building and parsing annotatedtext totally agree that http server shouldn't be included in the library |
I have finished my AnnotatedText library for Rust and ready to make tests using nlprule in the original implementation i did overlook the code a bit: texts[currentLine] = {text: node.text, offset} should actually be texts[currentLine].push({text: node.text, offset}) Result texts = [
[{ text: 'A', offset: 0 }, { text: 'test', offset: 5 }],
[{ text: 'Interpret as new line', offset: 16 }]
] then you can get your sentences line-by-line
(i have updated my reference code above) here's how you'll do the same thing using the Rust library fn main () {
let example = r#"
{"annotation": [
{"text": "A "},
{"markup": "<b>"},
{"text": "test"},
{"markup": "</b>"},
{"markup": "<p>", "interpretAs": "\n\n"},
{"text": "Interpret as new line"},
{"markup": "</p>"}
]}"#;
let annotation = lib::Annotation::from_str(&example).unwrap();
let result = annotation.to_text_tree();
let r = result[0].iter().cloned().map(|r| r.text)
.collect::<Vec<_>>()
.join("");
println!("{}", r)
} also, i still don't know whether the offset should be expressed in bytes or in chars (currently it's in bytes), maybe you have an opinion on that? |
Hi, sorry for the late response. It is probably best to keep track of both and return a Position. That way it is compatible with the Python bindings / LT API (where counting in characters is natural) and with the Rust API (where counting in bytes is natural). |
Hey, use std::{str::FromStr, collections::HashMap, ops::Range};
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
pub annotation: Vec<AnnotatedText>
}
#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
pub text: Option<String>,
pub markup: Option<String>,
pub interpretAs: Option<String>
}
pub type AnnotatedTextMap = HashMap<usize, String>;
impl FromStr for Annotation {
type Err = serde_json::error::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
return serde_json::from_str(&s);
}
}
impl ToString for Annotation {
fn to_string(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
_ => ()
}
});
return result;
}
}
impl Annotation {
pub fn to_original(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { markup: Some(markup), ..} => result += &markup,
_ => ()
}
});
return result;
}
pub fn to_text_map(&self) -> AnnotatedTextMap {
let mut offset: usize = 0;
let mut map = HashMap::new();
let _terminator = String::from("\n\n");
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => {
map.insert( offset, text.clone());
offset += text.len()
},
// AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
_ => ()
}
});
return map
}
pub fn find_original_position(&self, text_position: Range<usize>) -> Range<usize> {
let text_map = self.to_text_map();
let mut min_distance = usize::MAX;
let mut best_match: usize = 0;
text_map.iter().for_each(|(key, _value)| {
let closest_position = *key;
if text_position.start <= closest_position {
let distance = closest_position - text_position.start;
if distance < min_distance {
best_match = *key;
min_distance = distance;
}
}
});
return best_match + text_position.start..text_position.end
}
} i decided the best approach would be to just copy the Java implementation to Rust i leave the code here, so maybe someone could take it and reimplement the the function takes AnnotatedText, converts it to offset map and returns original position relative to plain text position LanguageTool reference source can be found here: https://github.com/languagetool-org/languagetool/blob/b5f85984ea2fcbce8b64da1d88fc701528810a13/languagetool-core/src/main/java/org/languagetool/markup/AnnotatedText.java#L109-L141 the issue with i can't really fix it right now, because i don't know Java and don't really understand the algorithm, on top of that i haven't figured out completely how borrowing works in Rust |
I’ve decided to give it another try yesterday Changelog
What doesn't work End range is still incorrect, example program output:
LanguageTool calculates the end range correctly taking Code
use std::{str::FromStr, collections::BTreeMap, ops::Range};
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize, Debug)]
pub struct Annotation {
pub annotation: Vec<AnnotatedText>
}
#[derive(Serialize, Deserialize, Debug)]
pub struct AnnotatedText {
pub text: Option<String>,
pub markup: Option<String>,
pub interpretAs: Option<String>
}
pub type AnnotatedTextMap = BTreeMap<usize, String>;
impl FromStr for Annotation {
type Err = serde_json::error::Error;
fn from_str(s: &str) -> Result<Self, Self::Err> {
return serde_json::from_str(&s);
}
}
impl ToString for Annotation {
fn to_string(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { interpretAs: Some(interpretAs), ..} => result += &interpretAs,
_ => ()
}
});
return result;
}
}
impl Annotation {
pub fn to_original(&self) -> String {
let mut result: String = "".to_string();
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => result += &text,
AnnotatedText { markup: Some(markup), ..} => result += &markup,
_ => ()
}
});
return result;
}
pub fn to_text_map(&self) -> AnnotatedTextMap {
let mut offset: usize = 0;
let mut map = BTreeMap::new();
let _terminator = String::from("\n\n");
self.annotation.iter().for_each(|annotated_text| {
match annotated_text {
AnnotatedText { text: Some(text), ..} => {
map.insert(offset, text.clone());
offset += text.len()
},
// AnnotatedText { interpretAs: Some(_terminator), ..} => offset += _terminator.len(),
AnnotatedText { markup: Some(markup), ..} => offset += markup.len(),
_ => ()
}
});
return map
}
pub fn find_original_position(&self, text_position: &Range<usize>) -> Range<usize> {
let text_map = self.to_text_map();
let mut min_distance = usize::MAX;
let mut best_match: usize = 0;
text_map.iter().for_each(|(key, value)| {
let closest_position = *key + value.len();
if text_position.start <= closest_position {
let distance = closest_position - text_position.start;
if distance < min_distance {
best_match = *key;
min_distance = distance;
}
}
});
return best_match + text_position.start..best_match + text_position.end
}
}
mod lib;
use std::str::FromStr;
use nlprule::{Tokenizer, Rules};
fn main () {
// This example doesn't work correctly
let example = r#"
{"annotation": [
{"markup": "<h1>"},
{"text": "She was "},
{"markup": "<span>"},
{"text": "not been here since "},
{"markup": "</span>"},
{"markup": "<b>"},
{"text": "Monday"},
{"markup": "</b>"},
{"markup": "</h1>"}
]}"#;
// let example = r#"
// {"annotation": [
// {"text": "She was "},
// {"text": "not been here since "},
// {"markup": "<b>"},
// {"text": "Monday"},
// {"markup": "</b>"}
// ]}"#;
let tokenizer = Tokenizer::new("./en_tokenizer.bin").unwrap();
let rules = Rules::new("./en_rules.bin").unwrap();
let annotation = lib::Annotation::from_str(&example).unwrap();
let text = annotation.to_string();
let original = annotation.to_original();
let suggestions = rules.suggest(&text, &tokenizer);
let original_range = suggestions[0].span().byte();
let result = annotation.find_original_position(&original_range);
println!("HTML: {}", &original);
println!("Text: {}", &text);
println!("Text Range: {:?}", &original_range);
println!("Original Range: {:?}", &result);
println!("Snippet: {:?}", &original[result])
} The only question unsolved right now is the end range |
hey, thanks for this awesome project!
do you consider adding AnnotatedText support?
this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)
right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document
The text was updated successfully, but these errors were encountered: