Conform encoding handling to Encoding spec #48

sideshowbarker · 2020-08-20T08:46:47Z

This makes the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec.

To that end, this PR branch includes the following important changes:

81c40ef Conform MetaSniffer "x-user-defined" handling
f9b3796 Sync w/ “Changing the encoding while parsing” algo
ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE
9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase()
beb060b Conform encoding-label matching to Encoding spec
8c62d2d Conform supported encodings to Encoding spec
5cfabbe Drop superseded encoding-checking code
343d86f Drop “actual HTML encoding”-related code
03954fb Drop/replace “is ASCII superset”-checking code

It also includes some related minor changes to get things working smoothly.

The beb060b change is particularly notable in that it fixes the parser’s encoding-name matching to conform to the current Encoding spec at https://encoding.spec.whatwg.org/#concept-encoding-get — which requires that only leading and trailing whitespace be removed from a string before checking if it matches any valid encoding names.

Otherwise, without that change, the parser instead implements https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching — which requires deleting “all characters except a-z, A-Z, and 0-9” from a string before checking if it matches any valid encoding names. That difference makes us fail two html5lib-tests cases.

The html5lib-tests cases that we fail without that beb060b change are the following:

src/nu/validator/htmlparser/io/Encoding.java

sideshowbarker · 2020-08-20T16:00:58Z

I’m seeing some possible regressions in the HTML checker as a result of this —

error: The encoding “utf-8” is not an IANA-registered encoding and did not use the “x-” prefix. (Charmod C023)
info warning: The character encoding “utf-8” is not widely supported. Better interoperability may be achieved by using “UTF-8”.
info warning: Using “x-iso-2022-cn-gb” instead of the declared encoding “utf-8”.

That is, this change seems to have caused the parser to fail to parse utf-8 case-insensitively.

And then for some reason, the parser seems to randomly picking some other encoding to use — the encoding name in the last error message above is non-deterministic. When I re-run the tests, different encoding names show up:

info warning: Using “x-machebrew” instead of the declared encoding “utf-8”.

or:

info warning: Using “x-maccroatian” instead of the declared encoding “utf-8”.

… etc.

sideshowbarker · 2020-08-20T16:02:29Z

d’oh — never mind my previous comment; I see now that I just neglected to negate the condition. Will push an update

sideshowbarker · 2020-08-20T18:00:42Z

OK, I found that the existing code had effectively been using, as the internal canonical name, names with all dashes and underscores removed. That results in a conformance violation, because it means that, for example, we would treat the string ut-f8 (invalid as an encoding name) the same as utf-8 or utf8.

So with 92275d9, I’ve updated the checking to conform to exactly what the Encoding spec requires — which is to do an exact comparison of the lower-cased input string to the set of labels in https://encoding.spec.whatwg.org/#names-and-labels

src/nu/validator/htmlparser/io/Encoding.java

sideshowbarker · 2020-09-13T01:19:30Z

OK, I’ve now pushed a large number of commits to this branch that together make the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec. The key changes are the following:

81c40ef Conform MetaSniffer "x-user-defined" handling
f9b3796 Sync w/ “Changing the encoding while parsing” algo
ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE
9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase()
beb060b Conform encoding-label matching to Encoding spec
8c62d2d Conform supported encodings to Encoding spec
5cfabbe Drop superseded encoding-checking code
343d86f Drop “actual HTML encoding”-related code
03954fb Drop/replace “is ASCII superset”-checking code

The branch also includes some related minor changes to get things working smoothly.

I broke the changes out into separate commits in order to facilitate easier review.

So in reviewing this, you probably want to be starting from the https://github.com/validator/htmlparser/pull/48/commits view rather than the https://github.com/validator/htmlparser/pull/48/files view.

sideshowbarker · 2020-09-13T10:37:02Z

OK, I’m now a bit embarrassed to admit: it hadn’t dawned on me until now that the changes to the Encoding class in this patch are completely superseded by the code in the encodings branch of the repo.

So I’m now wondering whether we should just not spend any more time at all on this patch but instead try to get the code on the encodings branch finished and ready to merge to the main branch and to release.

I’m certainly willing to put time into helping with that.

carlosame · 2020-09-17T16:12:03Z

get the code on the encodings branch finished and ready to merge

My understanding was that encodings is an archive branch kept for reference, not initially intended for merging back. Perhaps I'm missing something.

This change renames the private encodingByCookedName map in the Encoding class to encodingByLabel (for consistency with Encoding spec terminology).

This change drops or replaces all code for checking whether a particular encoding is an ASCII superset — because the code no longer corresponds to actual requirements in the Encoding spec (which instead now requires checking only whether an encoding is utf-16be or utf-16le).

This change adds some error-message strings to the Encoding class — for shared used by the htmlparser.io.Driver and htmlparser.io.MetaSniffer classes, which need to emit the same error messages in a number of cases.

This change removes all code related to checking and using the “actual HTML encoding”, which no longer corresponds to current spec requirements.

This change drops some code which performs various encoding checks that no longer correspond to any current requirements in the Encoding spec.

This change conforms the set of supported encodings for the parser to the requirements in the Encoding spec. Specifically, this restricts the set of encoding names and labels to those listed in the table at https://encoding.spec.whatwg.org/#names-and-labels, and the statement: > The table below lists all encodings and their labels user agents must > support. User agents must not support any other encodings or labels.

This change drops the Encoding.toAsciiLowerCase() method and replaces calls to it with calls to string.toLowerCase() — because none of the conforming encoding names have any special characters for which the behavior of Encoding.toAsciiLowerCase() would produce any different results than string.toLowerCase() does.

This change renames the whineAboutEncodingAndReturnActual() method to whineAboutEncodingAndReturnCanonical() — in order to make more clear it’s reporting that the canonical name of the encoding is preferred over any particular label for that encoding.

This change drops all handling for UTF-32 (which is a completely invalid/ unsupported encoding per the Encoding spec), as well as replacing handling for “UTF-16” (which also isn’t a valid/supported encoding) with, instead, handling for the valid/supported encodings UTF-16BE and UTF-16LE.

This change brings the parser into conformance with current requirements in the “Changing the encoding while parsing” algorithm in the HTML spec.

This change just updates HtmlInputStreamReader to use shared message strings from the Encoding class for some of its error messages.

This change just updates MetaSniffer to use shared message strings from the Encoding class for some of its error messages.

This change makes the parser’s meta-prescan code conform to the requirements in the spec for handling of the "x-user-defined" encoding.

This change corrects the code to set the right encoding in the case when the external encoding has been determined to be non-UTF8.

sideshowbarker requested a review from hsivonen August 20, 2020 08:47

hsivonen requested changes Aug 20, 2020

View reviewed changes

src/nu/validator/htmlparser/io/Encoding.java Outdated Show resolved Hide resolved

sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 593fab4 to bf69e45 Compare August 20, 2020 15:46

sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch 2 times, most recently from 818e72f to 92275d9 Compare August 20, 2020 17:51

sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 92275d9 to 39f050a Compare August 21, 2020 06:19

sideshowbarker mentioned this pull request Aug 21, 2020

Investigate html5lib-tests encoding cases that are failing #47

Open

sideshowbarker force-pushed the validator-nu branch from 61325c3 to 5ae26d8 Compare August 21, 2020 09:53

sideshowbarker changed the base branch from validator-nu to master August 21, 2020 17:22

hsivonen reviewed Aug 25, 2020

View reviewed changes

src/nu/validator/htmlparser/io/Encoding.java Outdated Show resolved Hide resolved

sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 39f050a to 622c90b Compare September 13, 2020 01:04

sideshowbarker changed the title ~~Conform encoding-label matching to Encoding spec~~ Conform encoding handling to Encoding spec Sep 13, 2020

sideshowbarker added 12 commits September 2, 2021 20:56

Private encodingByCookedName → encodingByLabel

208fd51

This change renames the private encodingByCookedName map in the Encoding class to encodingByLabel (for consistency with Encoding spec terminology).

Add shared error-message strings to Encoding class

6413b44

This change adds some error-message strings to the Encoding class — for shared used by the htmlparser.io.Driver and htmlparser.io.MetaSniffer classes, which need to emit the same error messages in a number of cases.

Drop “actual HTML encoding”-related code

ff4258b

This change removes all code related to checking and using the “actual HTML encoding”, which no longer corresponds to current spec requirements.

Drop superseded encoding-checking code

2453891

This change drops some code which performs various encoding checks that no longer correspond to any current requirements in the Encoding spec.

Drop obsoleted params from Encoding constructor

1b06e4f

Sync w/ “Changing the encoding while parsing” algo

0cfe6d2

This change brings the parser into conformance with current requirements in the “Changing the encoding while parsing” algorithm in the HTML spec.

Use shared messages in HtmlInputStreamReader

76e6d62

This change just updates HtmlInputStreamReader to use shared message strings from the Encoding class for some of its error messages.

sideshowbarker added 3 commits September 2, 2021 20:59

Use shared message strings in MetaSniffer

2e31f9c

This change just updates MetaSniffer to use shared message strings from the Encoding class for some of its error messages.

Conform MetaSniffer "x-user-defined" handling

0de3e55

This change makes the parser’s meta-prescan code conform to the requirements in the spec for handling of the "x-user-defined" encoding.

Correct non-UTF8 handling in HtmlInputStreamReader

c5a88d6

This change corrects the code to set the right encoding in the case when the external encoding has been determined to be non-UTF8.

sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 622c90b to c5a88d6 Compare September 2, 2021 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conform encoding handling to Encoding spec #48

Conform encoding handling to Encoding spec #48

sideshowbarker commented Aug 20, 2020 •

edited

Loading

sideshowbarker commented Aug 20, 2020

sideshowbarker commented Aug 20, 2020

sideshowbarker commented Aug 20, 2020 •

edited

Loading

sideshowbarker commented Sep 13, 2020

sideshowbarker commented Sep 13, 2020

carlosame commented Sep 17, 2020

Conform encoding handling to Encoding spec #48

Are you sure you want to change the base?

Conform encoding handling to Encoding spec #48

Conversation

sideshowbarker commented Aug 20, 2020 • edited Loading

sideshowbarker commented Aug 20, 2020

sideshowbarker commented Aug 20, 2020

sideshowbarker commented Aug 20, 2020 • edited Loading

sideshowbarker commented Sep 13, 2020

sideshowbarker commented Sep 13, 2020

carlosame commented Sep 17, 2020

sideshowbarker commented Aug 20, 2020 •

edited

Loading

sideshowbarker commented Aug 20, 2020 •

edited

Loading