Ignore nodes? #210

mishushakov · 2022-04-27T22:43:37Z

hi, thanks for creating the library!
the issue i'm having is that the annotation sometimes includes the contents of <script> <iframe> and <style> nodes

how could i interpret these nodes as markup, not text?

davidlday · 2022-04-28T08:06:13Z

Thanks for opening an issue on this. It's a good question. Unfortunately, I don't have a quick answer, but would be happy to investigate. Do you happen to have a sample HTML file you can share that I can use for testing? The underlying parser (rehype) is responsible for identifying what's text vs. markup, so I'll have to see if it's capable of what you're asking.

mishushakov · 2022-04-28T13:06:16Z

thanks for quick reply!

here's a snippet

import * as annotated from 'annotatedtext-rehype'

const body = `
  <h1>Hello world!</h1>
  <script>alert("hello")</script>
  <style>h1 { color: red; }</style>
`

const a = annotated.build(body)
console.log(a)

Produces following annotation

{
  annotation: [
    { interpretAs: '\n', markup: '\n  <h1>', offset: [Object] },
    { offset: [Object], text: 'Hello world!' },
    { interpretAs: '\n\n', markup: '</h1>', offset: [Object] },
    { offset: [Object], text: '\n  ' },
    { interpretAs: '', markup: '<script>', offset: [Object] },
    { offset: [Object], text: 'alert("hello")' },
    { interpretAs: '', markup: '</script>', offset: [Object] },
    { offset: [Object], text: '\n  ' },
    { interpretAs: '', markup: '<style>', offset: [Object] },
    { offset: [Object], text: 'h1 { color: red; }' },
    { interpretAs: '', markup: '</style>', offset: [Object] },
    { offset: [Object], text: '\n' },
    { interpretAs: '', markup: '', offset: [Object] }
  ]
}

mishushakov · 2022-04-28T15:11:56Z

tree-sitter actually handles this correctly! they differentiate between text and raw_text

so, i'm thinking maybe i should write my own annotatedtext converter but using tree-sitter
it doesn't seem like like a big task and the same code could probably work on markdown, yaml, etc

mishushakov · 2022-04-28T19:15:13Z

after 10 minutes of playing around i have made my own annotated-text converter using tree-sitter
example code:

const Parser = require('tree-sitter')
const HTML = require('tree-sitter-html')

const parser = new Parser()
parser.setLanguage(HTML)

const html = `<h1>Hello <b><i>W</i>rld</b></h1>
<p>This is a new line</p>
<style>.style {color: red} </style>`

const tree = parser.parse(html)

const annotated = {
  annotation: []
}

const recursive = (children) => {
  children.forEach(child => {
    if (child.type === 'start_tag' || child.type === 'end_tag'){
      annotated.annotation.push({markup: child.text})
    }

    if (child.type === 'text') {
      annotated.annotation.push({text: child.text})
    }

    if (child.type === 'element') {
      recursive(child.children)
    }
  })
}

recursive(tree.rootNode.children)
console.log(JSON.stringify(annotated))

produces

{"annotation":[{"markup":"<h1>"},{"text":"Hello "},{"markup":"<b>"},{"markup":"<i>"},{"text":"W"},{"markup":"</i>"},{"text":"rld"},{"markup":"</b>"},{"markup":"</h1>"},{"text":"\n"},{"markup":"<p>"},{"text":"This is a new line"},{"markup":"</p>"},{"text":"\n"}]}

davidlday · 2022-04-29T11:11:35Z

Very cool! Perhaps a set of annontatedtext packages based on tree-sitter would help, starting with annotatedtext-tree-sitter-html? Is that something you'd be interested in taking on? If not, I can add it to my to-do list.

mishushakov · 2022-04-29T15:56:22Z

yeah, sure!
do you want to start the repo?

i have also extended my code a little to add default new-line rules, but some things (like code) are still not handled

another cool possibility would be to add spelling correction for text strings in programming languages

prosemd is using that approach
although they skipped LanguageTool and annotations all-together in favour of nlprule

davidlday · 2022-05-08T23:30:40Z

I created a new repo: https://github.com/prosegrinder/annotatedtext-tree-sitter-html

Feel free to do a PR, but don't feel obligated. I have a pretty busy month, but will try to make time to at least get the repo prepped in the next day or two, including Actions.

Wasn't aware of nlprule or prosemd - both look very interesting. While I like LanguageTool, it feels a bit heavy. I've considered looking for a nice lightweight alternative. Will check them out.

mishushakov · 2022-05-29T14:56:31Z

hey,
i have finished the parser, but i'm not sure i want to continue developing on top of LanguageTool (for performance reasons)

thus, i'm currently trying to add AnnotatedText support to NLPRule

if you are willing to learn some Rust, feel free to check out the issue:
bminixhofer/nlprule#79

davidlday added enhancement New feature or request question Further information is requested labels Apr 28, 2022

davidlday self-assigned this Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore nodes? #210

Ignore nodes? #210

mishushakov commented Apr 27, 2022 •

edited

Loading

davidlday commented Apr 28, 2022

mishushakov commented Apr 28, 2022

mishushakov commented Apr 28, 2022

mishushakov commented Apr 28, 2022 •

edited

Loading

davidlday commented Apr 29, 2022

mishushakov commented Apr 29, 2022 •

edited

Loading

davidlday commented May 8, 2022

mishushakov commented May 29, 2022

Ignore nodes? #210

Ignore nodes? #210

Comments

mishushakov commented Apr 27, 2022 • edited Loading

davidlday commented Apr 28, 2022

mishushakov commented Apr 28, 2022

mishushakov commented Apr 28, 2022

mishushakov commented Apr 28, 2022 • edited Loading

davidlday commented Apr 29, 2022

mishushakov commented Apr 29, 2022 • edited Loading

davidlday commented May 8, 2022

mishushakov commented May 29, 2022

mishushakov commented Apr 27, 2022 •

edited

Loading

mishushakov commented Apr 28, 2022 •

edited

Loading

mishushakov commented Apr 29, 2022 •

edited

Loading