Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates HTML5::Document.parse with keyword arguments; #3334

Merged
merged 3 commits into from
Dec 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions lib/nokogiri/html5.rb
Original file line number Diff line number Diff line change
Expand Up @@ -46,16 +46,22 @@ def self.HTML5(...)
# The document and fragment parsing methods support options that are different from
# Nokogiri::HTML4::Document or Nokogiri::XML::Document.
#
# - <tt>Nokogiri.HTML5(html, url = nil, encoding = nil, **options)</tt>
# - <tt>Nokogiri::HTML5.parse(html, url = nil, encoding = nil, **options)</tt>
# - <tt>Nokogiri::HTML5::Document.parse(html, url = nil, encoding = nil, **options)</tt>
# - <tt>Nokogiri::HTML5.fragment(html, encoding = nil, **options)</tt>
# - <tt>Nokogiri::HTML5::DocumentFragment.parse(html, encoding = nil, **options)</tt>
# - <tt>Nokogiri.HTML5(html, url:, encoding:, **parse_options)</tt>
# - <tt>Nokogiri::HTML5.parse(html, url:, encoding:, **parse_options)</tt>
# - <tt>Nokogiri::HTML5::Document.parse(html, url:, encoding:, **parse_options)</tt>
# - <tt>Nokogiri::HTML5.fragment(html, encoding = nil, **parse_options)</tt>
# - <tt>Nokogiri::HTML5::DocumentFragment.parse(html, encoding = nil, **parse_options)</tt>
#
# The four currently supported options are +:max_errors+, +:max_tree_depth+, +:max_attributes+,
# and +:parse_noscript_content_as_text+ described below.
# The four currently supported parse options are
#
# === Error reporting
# - +max_errors:+ (Integer, default 0) Maximum number of parse errors to report in HTML5::Document#errors.
# - +max_tree_depth:+ (Integer, default +Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH+) Maximum tree depth to parse.
# - +max_attributes:+ (Integer, default +Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES+) Maximum number of attributes to parse per element.
# - +parse_noscript_content_as_text:+ (Boolean, default false) When enabled, parse +noscript+ tag content as text, mimicking the behavior of web browsers.
#
# These options are explained in the following sections.
#
# === Error reporting: +max_errors:+
#
# Nokogiri contains an experimental HTML5 parse error reporting facility. By default, no parse
# errors are reported but this can be configured by passing the +:max_errors+ option to
Expand Down Expand Up @@ -112,7 +118,7 @@ def self.HTML5(...)
# are not part of Nokogiri's public API. That is, these are subject to change without Nokogiri's
# major version number changing. These may be stabilized in the future.
#
# === Maximum tree depth
# === Maximum tree depth: +max_tree_depth:+
#
# The maximum depth of the DOM tree parsed by the various parsing methods is configurable by the
# +:max_tree_depth+ option. If the depth of the tree would exceed this limit, then an
Expand All @@ -126,7 +132,7 @@ def self.HTML5(...)
# # raises ArgumentError: Document tree depth limit exceeded
# doc = Nokogiri.HTML5(html, max_tree_depth: -1)
#
# === Attribute limit per element
# === Attribute limit per element: +max_attributes:+
#
# The maximum number of attributes per DOM element is configurable by the +:max_attributes+
# option. If a given element would exceed this limit, then an +ArgumentError+ is thrown.
Expand All @@ -142,7 +148,7 @@ def self.HTML5(...)
# doc = Nokogiri.HTML5(html, max_attributes: -1)
# # parses successfully
#
# === Parse +noscript+ elements' content as text
# === Parse +noscript+ elements' content as text: +parse_noscript_content_as_text:+
#
# By default, the content of +noscript+ elements is parsed as HTML elements. Browsers that
# support scripting parse the content of +noscript+ elements as raw text.
Expand Down
60 changes: 37 additions & 23 deletions lib/nokogiri/html5/document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -43,41 +43,54 @@ class Document < Nokogiri::HTML4::Document

# Get the parser's quirks mode value. See HTML5::QuirksMode.
#
# This method returns `nil` if the parser was not invoked (e.g., `Nokogiri::HTML5::Document.new`).
# This method returns +nil+ if the parser was not invoked (e.g., Nokogiri::HTML5::Document.new).
#
# Since v1.14.0
attr_reader :quirks_mode

class << self
# :call-seq:
# parse(input)
# parse(input, url=nil, encoding=nil, **options)
# parse(input, url=nil, encoding=nil) { |options| ... }
# parse(input) { |parse_options| ... }
# parse(input, url:, encoding:, **parse_options)
#
# Parse HTML5 input.
# Parse \HTML input with a parser compliant with the HTML5 spec. This method uses the
# encoding of +input+ if it can be determined, or else falls back to the +encoding:+
# parameter.
#
# [Parameters]
# - +input+ may be a String, or any object that responds to _read_ and _close_ such as an
# IO, or StringIO.
# [Required Parameters]
# - +input+ (String | IO) the \HTML content to be parsed.
#
# - +url+ (optional) is a String indicating the canonical URI where this document is located.
# [Optional Parameters]
# - +url:+ (String) the base URI of the document.
# - +encoding+ (Encoding) The encoding that should be used when processing the
# document. This option is only used as a fallback when the encoding of +input+ cannot be
# determined.
# - +parse_options+ (Hash) represents keywords arguments that control the behavior of the
# parser. See rdoc-ref:HTML5@Parsing+options for a list of available options.
#
# - +encoding+ (optional) is the encoding that should be used when processing
# the document.
# [Yields]
# If present, the block will be passed a Hash object to modify with parse options before the
# input is parsed. See rdoc-ref:HTML5@Parsing+options for a list of available options.
#
# - +options+ (optional) is a configuration Hash (or keyword arguments) to set options
# during parsing. The three currently supported options are +:max_errors+,
# +:max_tree_depth+ and +:max_attributes+, described at Nokogiri::HTML5.
# ⚠ Note that +url:+ and +encoding:+ cannot be set by the configuration block.
#
# ⚠ Note that these options are different than those made available by
# Nokogiri::XML::Document and Nokogiri::HTML4::Document.
# [Returns] Nokogiri::HTML5::Document
#
# - +block+ (optional) is passed a configuration Hash on which parse options may be set. See
# Nokogiri::HTML5 for more information and usage.
# *Example:* Parse a string with a specific encoding and custom max errors limit.
#
# [Returns] Nokogiri::HTML5::Document
# Nokogiri::HTML5::Document.parse(socket, encoding: "ISO-8859-1", max_errors: 10)
#
# *Example:* Parse a string setting the +:parse_noscript_content_as_text+ option using the
# configuration block parameter.
#
# Nokogiri::HTML5::Document.parse(input) { |c| c[:parse_noscript_content_as_text] = true }
#
def parse(string_or_io, url = nil, encoding = nil, **options, &block)
def parse(
string_or_io,
url_ = nil, encoding_ = nil,
url: url_, encoding: encoding_,
**options, &block
)
yield options if block
string_or_io = "" unless string_or_io

Expand All @@ -98,7 +111,7 @@ def parse(string_or_io, url = nil, encoding = nil, **options, &block)
# Create a new document from an IO object.
#
# 💡 Most users should prefer Document.parse to this method.
def read_io(io, url = nil, encoding = nil, **options)
def read_io(io, url_ = nil, encoding_ = nil, url: url_, encoding: encoding_, **options)
raise ArgumentError, "io object doesn't respond to :read" unless io.respond_to?(:read)

do_parse(io, url, encoding, **options)
Expand All @@ -107,7 +120,7 @@ def read_io(io, url = nil, encoding = nil, **options)
# Create a new document from a String.
#
# 💡 Most users should prefer Document.parse to this method.
def read_memory(string, url = nil, encoding = nil, **options)
def read_memory(string, url_ = nil, encoding_ = nil, url: url_, encoding: encoding_, **options)
raise ArgumentError, "string object doesn't respond to :to_str" unless string.respond_to?(:to_str)

do_parse(string, url, encoding, **options)
Expand Down Expand Up @@ -144,7 +157,8 @@ def initialize(*args) # :nodoc:
# - +markup+ (String) The HTML5 markup fragment to be parsed
#
# [Returns]
# Nokogiri::HTML5::DocumentFragment. This object's children will be empty if `markup` is not passed, is empty, or is `nil`.
# Nokogiri::HTML5::DocumentFragment. This object's children will be empty if +markup+ is not
# passed, is empty, or is +nil+.
#
def fragment(markup = nil)
DocumentFragment.new(self, markup)
Expand Down
21 changes: 21 additions & 0 deletions test/html5/test_api.rb
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,22 @@ def test_url

doc = Nokogiri::HTML5(html, url, max_errors: 1)
assert_equal(url, doc.errors[0].file)

# with keyword args
doc = Nokogiri::HTML5::Document.parse(html, url: nil)
assert_nil(doc.url)

doc = Nokogiri::HTML5::Document.parse(html, url: url)
assert_equal(url, doc.url)

doc = Nokogiri::HTML5::Document.parse(html, url: url, max_errors: 1)
assert_equal(url, doc.errors[0].file)

doc = Nokogiri::HTML5.parse(html, url: url, max_errors: 1)
assert_equal(url, doc.errors[0].file)

doc = Nokogiri::HTML5(html, url: url, max_errors: 1)
assert_equal(url, doc.errors[0].file)
end

def test_parse_encoding
Expand All @@ -57,6 +73,11 @@ def test_parse_encoding
assert_match(/おはようございます/, Nokogiri::HTML5(raw, nil, Encoding::SHIFT_JIS).to_s)
assert_match(/おはようございます/, Nokogiri::HTML5.parse(raw, nil, Encoding::SHIFT_JIS).to_s)
assert_match(/おはようございます/, Nokogiri::HTML5::Document.parse(raw, nil, Encoding::SHIFT_JIS).to_s)

# with kwargs
assert_match(/おはようございます/, Nokogiri::HTML5(raw, encoding: Encoding::SHIFT_JIS).to_s)
assert_match(/おはようございます/, Nokogiri::HTML5.parse(raw, encoding: Encoding::SHIFT_JIS).to_s)
assert_match(/おはようございます/, Nokogiri::HTML5::Document.parse(raw, encoding: Encoding::SHIFT_JIS).to_s)
end

def test_fragment_encoding
Expand Down
12 changes: 10 additions & 2 deletions test/html5/test_nokogumbo.rb
Original file line number Diff line number Diff line change
Expand Up @@ -205,21 +205,29 @@ def test_fragment_default_max_attributes
assert_raises(ArgumentError) { Nokogiri::HTML5.fragment(html) }
end

TWO_ERROR_DOC = "<!DOCTYPE html><html><!-- <!-- --></a>"

def test_parse_errors
doc = Nokogiri::HTML5("<!DOCTYPE html><html><!-- <!-- --></a>", max_errors: 10)
doc = Nokogiri::HTML5(TWO_ERROR_DOC, max_errors: 10)
assert_equal(2, doc.errors.length)
doc = Nokogiri::HTML5("<!DOCTYPE html><html>", max_errors: 10)
assert_empty(doc.errors)
end

def test_max_errors
# This document contains 2 parse errors, but we force limit to 1.
doc = Nokogiri::HTML5("<!DOCTYPE html><html><!-- -- --></a>", max_errors: 1)
doc = Nokogiri::HTML5(TWO_ERROR_DOC, max_errors: 1)
assert_equal(1, doc.errors.length)
doc = Nokogiri::HTML5("<!DOCTYPE html><html>", max_errors: 1)
assert_empty(doc.errors)
end

def test_max_errors_with_config_block
# This document contains 2 parse errors, but we force limit to 1.
doc = Nokogiri::HTML5(TWO_ERROR_DOC) { |c| c[:max_errors] = 1 }
assert_equal(1, doc.errors.length)
end

def test_default_max_errors
# This document contains 200 parse errors, but default limit is 0.
doc = Nokogiri::HTML5("<!DOCTYPE html><html>" + "</p>" * 200)
Expand Down
Loading