Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make available the XPath path of the node where a SyntaxError occurred during Schema validation #3316

Merged

Conversation

ryanong
Copy link
Contributor

@ryanong ryanong commented Oct 15, 2024

What problem is this PR intended to solve?

When validating an XML document with a schema it is difficult to impossible to know where the error happened in the document. With the path on the error, we can identify where in the document the error occurred.

Have you included adequate test coverage?

I believe so. I updated the test in the schema to check to see if path errors were being returned correctly.

Does this change affect the behavior of either the C or the Java implementations?

This change the C implementation gently. Setting the path during error creation.
I looked if xerces-j could do it, but I couldn't find anything on the SAXException that would help.

@ryanong ryanong changed the title add syntax error node path return path where Nokogiri::XML::SyntaxError occurred Oct 15, 2024
@flavorjones
Copy link
Member

@ryanong Thanks. Looks like some rubocop failures and appveyor is showing test failures.

test/xml/test_schema.rb Show resolved Hide resolved
lib/nokogiri/xml/syntax_error.rb Outdated Show resolved Hide resolved
@ryanong ryanong force-pushed the support-syntax-error-node-path branch from 238a586 to d524c8d Compare October 15, 2024 20:37
@ryanong
Copy link
Contributor Author

ryanong commented Oct 15, 2024

It looks like path only returns when validating an in memory document and not a file. Should I document this in the path description?

Copy link
Member

@flavorjones flavorjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like some DOM parser tests for SyntaxError#path, too. Specifically I'd like separate tests for XML and HTML4 to make sure those parsers' behavior is captured.

Don't worry about JRuby, I can spike on support and if I can't make it work I'll make the decision about whether to just document it as "libxml only".

@flavorjones
Copy link
Member

flavorjones commented Oct 15, 2024

It looks like path only returns when validating an in memory document and not a file. Should I document this in the path description?

That's strange, if you can commit a test for each (one passing, one failing) I will investigate to check if that's intended libxml2 behavior.

@flavorjones
Copy link
Member

I'd like some DOM parser tests for SyntaxError#path, too. Specifically I'd like separate tests for XML and HTML4 to make sure those parsers' behavior is captured.

Can you please also exercise in-memory and from-file parsing for these two parsers as well, just to capture the current behavior. Thanks.

@ryanong
Copy link
Contributor Author

ryanong commented Oct 16, 2024

It appears that html doesn't raise Nokogiri::XML::SyntaxError It uses Nokogiri::SyntaxError

I'll add some tests for doc parsing for xml and html

@ryanong
Copy link
Contributor Author

ryanong commented Oct 17, 2024

I think node is returning null because validate_file uses xmlSchemaValidateFile which uses sax parsing and doesn't have a node. There might be a way to get the path from the node->ctxt which is a xmlParserCtxt in sax parsing mode https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-tree.html#xmlParserCtxt

But this is way over might head

@flavorjones
Copy link
Member

@ryanong Can we reset for a minute and talk about your original motivation for this change? What does the path get you that the line/column info won't suffice? Is there another simpler way we can solve this problem?

@ryanong
Copy link
Contributor Author

ryanong commented Oct 17, 2024

I'm attempting to validate an XML file with an XSD and then associate the errors with the inputs that create the file.

users = [
  { id: "1", name: "bob" },
  { id: "2", name: "boboboboboboboboboobobobbob"},
]
doc = build_doc(users)
# Nokogiri::XML
# <users>
#   <user id="1">
#     <name>bob</name>
#   </user>
#   <user id="2">
#     <name>boboboboboboboboboobobobbob</name>
#   </user>
# </users>

xsd_errors = validate(doc)
# [Nokigiri::XML::SyntaxError.new(path: "/users/user[2]/name")] # name has too many characters 
mapped_errors = []
xsd_errors.each do |error|
  id = doc.at_xpath(error.path).xpath("ancestor::user")["id"]
  user = users.find { |u| u[:id] == id }
  mapped_errors << { error: error, parent: user }
end

That way I can return each error to associated user.

Is there a way to know what node there is at a given line a column? I imagine I could do that by walking the entire tree and indexing it and matching against that but this feels cleaner

Copy link
Member

@flavorjones flavorjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Sorry, it's been a long week for me and I'm just now understanding fully what you're trying to do and why it's not working for input files. All of this makes sense.

I'm fine with documenting that SyntaxError#path is populated only when there is an actual DOM object to reference ("best efforts") and that it doesn't work in JRuby.

In other places I've used a string like this to document differences between xerces/neko and libxml2/gumbo:

⚠ HTML5 functionality is not available when running JRuby.

so something like that is probably fine for this.

lib/nokogiri/xml/syntax_error.rb Outdated Show resolved Hide resolved
test/html4/test_document.rb Outdated Show resolved Hide resolved
test/xml/test_schema.rb Outdated Show resolved Hide resolved
@flavorjones flavorjones changed the title return path where Nokogiri::XML::SyntaxError occurred Make available the XPath path of the node where a SyntaxError occurred during Schema validation Oct 18, 2024
@flavorjones flavorjones force-pushed the support-syntax-error-node-path branch from 2018e8e to 5e579ee Compare October 18, 2024 16:58
@flavorjones flavorjones enabled auto-merge (squash) October 18, 2024 16:58
@flavorjones flavorjones self-requested a review October 18, 2024 16:59
auto-merge was automatically disabled October 18, 2024 17:24

Head branch was pushed to by a user without write access

@flavorjones flavorjones enabled auto-merge (squash) October 18, 2024 20:42
@flavorjones flavorjones merged commit cda444f into sparklemotion:main Oct 18, 2024
130 checks passed
@flavorjones
Copy link
Member

@ryanong Thank you again! Will be in the v1.17 release (hopefully soon).

flavorjones added a commit that referenced this pull request Oct 20, 2024
@flavorjones
Copy link
Member

Changelog updated in 1e06cd1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants