-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore nodes? #210
Comments
Thanks for opening an issue on this. It's a good question. Unfortunately, I don't have a quick answer, but would be happy to investigate. Do you happen to have a sample HTML file you can share that I can use for testing? The underlying parser (rehype) is responsible for identifying what's text vs. markup, so I'll have to see if it's capable of what you're asking. |
thanks for quick reply! here's a snippet import * as annotated from 'annotatedtext-rehype'
const body = `
<h1>Hello world!</h1>
<script>alert("hello")</script>
<style>h1 { color: red; }</style>
`
const a = annotated.build(body)
console.log(a) Produces following annotation {
annotation: [
{ interpretAs: '\n', markup: '\n <h1>', offset: [Object] },
{ offset: [Object], text: 'Hello world!' },
{ interpretAs: '\n\n', markup: '</h1>', offset: [Object] },
{ offset: [Object], text: '\n ' },
{ interpretAs: '', markup: '<script>', offset: [Object] },
{ offset: [Object], text: 'alert("hello")' },
{ interpretAs: '', markup: '</script>', offset: [Object] },
{ offset: [Object], text: '\n ' },
{ interpretAs: '', markup: '<style>', offset: [Object] },
{ offset: [Object], text: 'h1 { color: red; }' },
{ interpretAs: '', markup: '</style>', offset: [Object] },
{ offset: [Object], text: '\n' },
{ interpretAs: '', markup: '', offset: [Object] }
]
} |
after 10 minutes of playing around i have made my own annotated-text converter using tree-sitter const Parser = require('tree-sitter')
const HTML = require('tree-sitter-html')
const parser = new Parser()
parser.setLanguage(HTML)
const html = `<h1>Hello <b><i>W</i>rld</b></h1>
<p>This is a new line</p>
<style>.style {color: red} </style>`
const tree = parser.parse(html)
const annotated = {
annotation: []
}
const recursive = (children) => {
children.forEach(child => {
if (child.type === 'start_tag' || child.type === 'end_tag'){
annotated.annotation.push({markup: child.text})
}
if (child.type === 'text') {
annotated.annotation.push({text: child.text})
}
if (child.type === 'element') {
recursive(child.children)
}
})
}
recursive(tree.rootNode.children)
console.log(JSON.stringify(annotated)) produces {"annotation":[{"markup":"<h1>"},{"text":"Hello "},{"markup":"<b>"},{"markup":"<i>"},{"text":"W"},{"markup":"</i>"},{"text":"rld"},{"markup":"</b>"},{"markup":"</h1>"},{"text":"\n"},{"markup":"<p>"},{"text":"This is a new line"},{"markup":"</p>"},{"text":"\n"}]} |
Very cool! Perhaps a set of |
yeah, sure! i have also extended my code a little to add default new-line rules, but some things (like code) are still not handled another cool possibility would be to add spelling correction for text strings in programming languages prosemd is using that approach |
I created a new repo: https://github.com/prosegrinder/annotatedtext-tree-sitter-html Feel free to do a PR, but don't feel obligated. I have a pretty busy month, but will try to make time to at least get the repo prepped in the next day or two, including Actions. Wasn't aware of nlprule or prosemd - both look very interesting. While I like LanguageTool, it feels a bit heavy. I've considered looking for a nice lightweight alternative. Will check them out. |
hey, thus, i'm currently trying to add AnnotatedText support to NLPRule if you are willing to learn some Rust, feel free to check out the issue: |
hi, thanks for creating the library!
the issue i'm having is that the annotation sometimes includes the contents of <script> <iframe> and <style> nodes
how could i interpret these nodes as markup, not text?
The text was updated successfully, but these errors were encountered: