Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore nodes? #210

Open
mishushakov opened this issue Apr 27, 2022 · 8 comments
Open

Ignore nodes? #210

mishushakov opened this issue Apr 27, 2022 · 8 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@mishushakov
Copy link

mishushakov commented Apr 27, 2022

hi, thanks for creating the library!
the issue i'm having is that the annotation sometimes includes the contents of <script> <iframe> and <style> nodes

Screenshot 2022-04-28 at 00 41 03

how could i interpret these nodes as markup, not text?

@davidlday
Copy link
Contributor

Thanks for opening an issue on this. It's a good question. Unfortunately, I don't have a quick answer, but would be happy to investigate. Do you happen to have a sample HTML file you can share that I can use for testing? The underlying parser (rehype) is responsible for identifying what's text vs. markup, so I'll have to see if it's capable of what you're asking.

@davidlday davidlday added enhancement New feature or request question Further information is requested labels Apr 28, 2022
@davidlday davidlday self-assigned this Apr 28, 2022
@mishushakov
Copy link
Author

thanks for quick reply!

here's a snippet

import * as annotated from 'annotatedtext-rehype'

const body = `
  <h1>Hello world!</h1>
  <script>alert("hello")</script>
  <style>h1 { color: red; }</style>
`

const a = annotated.build(body)
console.log(a)

Produces following annotation

{
  annotation: [
    { interpretAs: '\n', markup: '\n  <h1>', offset: [Object] },
    { offset: [Object], text: 'Hello world!' },
    { interpretAs: '\n\n', markup: '</h1>', offset: [Object] },
    { offset: [Object], text: '\n  ' },
    { interpretAs: '', markup: '<script>', offset: [Object] },
    { offset: [Object], text: 'alert("hello")' },
    { interpretAs: '', markup: '</script>', offset: [Object] },
    { offset: [Object], text: '\n  ' },
    { interpretAs: '', markup: '<style>', offset: [Object] },
    { offset: [Object], text: 'h1 { color: red; }' },
    { interpretAs: '', markup: '</style>', offset: [Object] },
    { offset: [Object], text: '\n' },
    { interpretAs: '', markup: '', offset: [Object] }
  ]
}

@mishushakov
Copy link
Author

tree-sitter actually handles this correctly! they differentiate between text and raw_text

Screenshot 2022-04-28 at 17 06 27

so, i'm thinking maybe i should write my own annotatedtext converter but using tree-sitter
it doesn't seem like like a big task and the same code could probably work on markdown, yaml, etc

@mishushakov
Copy link
Author

mishushakov commented Apr 28, 2022

after 10 minutes of playing around i have made my own annotated-text converter using tree-sitter
example code:

const Parser = require('tree-sitter')
const HTML = require('tree-sitter-html')

const parser = new Parser()
parser.setLanguage(HTML)

const html = `<h1>Hello <b><i>W</i>rld</b></h1>
<p>This is a new line</p>
<style>.style {color: red} </style>`

const tree = parser.parse(html)

const annotated = {
  annotation: []
}

const recursive = (children) => {
  children.forEach(child => {
    if (child.type === 'start_tag' || child.type === 'end_tag'){
      annotated.annotation.push({markup: child.text})
    }

    if (child.type === 'text') {
      annotated.annotation.push({text: child.text})
    }

    if (child.type === 'element') {
      recursive(child.children)
    }
  })
}

recursive(tree.rootNode.children)
console.log(JSON.stringify(annotated))

produces

{"annotation":[{"markup":"<h1>"},{"text":"Hello "},{"markup":"<b>"},{"markup":"<i>"},{"text":"W"},{"markup":"</i>"},{"text":"rld"},{"markup":"</b>"},{"markup":"</h1>"},{"text":"\n"},{"markup":"<p>"},{"text":"This is a new line"},{"markup":"</p>"},{"text":"\n"}]}

@davidlday
Copy link
Contributor

Very cool! Perhaps a set of annontatedtext packages based on tree-sitter would help, starting with annotatedtext-tree-sitter-html? Is that something you'd be interested in taking on? If not, I can add it to my to-do list.

@mishushakov
Copy link
Author

mishushakov commented Apr 29, 2022

yeah, sure!
do you want to start the repo?

i have also extended my code a little to add default new-line rules, but some things (like code) are still not handled

another cool possibility would be to add spelling correction for text strings in programming languages

prosemd is using that approach
although they skipped LanguageTool and annotations all-together in favour of nlprule

@davidlday
Copy link
Contributor

I created a new repo: https://github.com/prosegrinder/annotatedtext-tree-sitter-html

Feel free to do a PR, but don't feel obligated. I have a pretty busy month, but will try to make time to at least get the repo prepped in the next day or two, including Actions.

Wasn't aware of nlprule or prosemd - both look very interesting. While I like LanguageTool, it feels a bit heavy. I've considered looking for a nice lightweight alternative. Will check them out.

@mishushakov
Copy link
Author

hey,
i have finished the parser, but i'm not sure i want to continue developing on top of LanguageTool (for performance reasons)

thus, i'm currently trying to add AnnotatedText support to NLPRule

if you are willing to learn some Rust, feel free to check out the issue:
bminixhofer/nlprule#79

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants