Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with special characters and php8.4 \Dom\HTMLDocument #17785

Closed
momala454 opened this issue Feb 13, 2025 · 14 comments
Closed

Problems with special characters and php8.4 \Dom\HTMLDocument #17785

momala454 opened this issue Feb 13, 2025 · 14 comments

Comments

@momala454
Copy link

momala454 commented Feb 13, 2025

Description

The following code:

<?php

$text = <<<TEXT
 <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
C'est à test
</body>
</html>
TEXT;

$dom = \Dom\HTMLDocument::createFromString($text, options: LIBXML_NOERROR);

var_dump($dom->saveHtml());

$dom = new \DOMDocument();
$dom->loadHTML($text, LIBXML_NOERROR);
var_dump($dom->saveHtml());

Resulted in this output:

string(137) "<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
C'est �&nbsp; test

</body></html>"
string(241) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
C'est à test
</body>
</html>
"

But I expected this output instead:

string(137) "<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
C'est à test

</body></html>"
string(241) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
C'est à test
</body>
</html>
"

This html is taken from a real email received. Outlook is correctly displaying the à character. Firefox is also correctly displaying it.
But for some reason, php 8.4 \Dom\HTMLDocument replace à with �&nbsp;.
I am parsing received emails, so I can't really control the correctness of the initial html.

The previous DOMDocument was parsing it correctly

PHP Version

8.4.4

Operating System

No response

@cmb69
Copy link
Member

cmb69 commented Feb 13, 2025

Confirmed: https://3v4l.org/AQUlB. Possibly, Lexbor does not support <meta http-equiv> at all, but only <meta charset> (https://3v4l.org/Aodlf). I don't know the exact rules of HTML5 parsing, but that might conform to the specs. @nielsdos likely knows better than me.

Anyway, you might be better off sticking with the classic ::loadHTML() when parsing emails.

@momala454
Copy link
Author

The problem occurs with the value Windows-1252, even with <meta charset> https://3v4l.org/eAKOB

I don't know if à is invalid in this charset

@cmb69
Copy link
Member

cmb69 commented Feb 13, 2025

Oops!

The problem is actually that the HTML is UTF-8 encoded, but the charset states Windows-1252.

@momala454
Copy link
Author

momala454 commented Feb 13, 2025

You are sure ? à is &#xE0;, I don't think it's UTF-8, and it doesn't work : https://3v4l.org/cTBvh

https://fr.wikipedia.org/wiki/Windows-1252

the old DomDocument even replace it with &agrave;

@cmb69
Copy link
Member

cmb69 commented Feb 13, 2025

Character encoding mess! See https://3v4l.org/lPBhV. The first output string contains the hex sequence 20e020746573740 what we expect ( à test encoded as Windows-1252). So does the second line. Without the bin2hex: https://3v4l.org/C1733

Try again with UTF-8 encoded input: https://3v4l.org/NCVS1 (without bin2hex(): https://3v4l.org/AD4p3; that is what you have reported). The first string encodes as c3266e6273703b, but the second as c3a0.

Regarding https://3v4l.org/cTBvh; trying with bin2hex: https://3v4l.org/eYt6Y; the first output string encodes as e0 which is exactly what we expect. The problem is that outputting the string may apply conversion to another encoding (or may assume another encoding in the first place). Using mb_convert_encoding() yields: https://3v4l.org/EqaCU

So there appears to be a relevant difference between the ::saveHtml() calls; when using the new DOM API, no further conversion is applied, but when using the classic DOM API, it is, possibly depending on the respective INI settings.

@momala454
Copy link
Author

Thanks for your help. I think actually my problem is with my mail parser library which seems to convert to utf8

@momala454
Copy link
Author

momala454 commented Feb 13, 2025

@cmb69
The library I use will convert the email to utf8, but keep the meta tag that contains the charset windows-1252.

So when doing this

$dom = \Dom\HTMLDocument::createFromString($html, options: LIBXML_NOERROR);
            var_dump($dom->saveHtml());

it breaks the à and convert it to �&nbsp;
It looks like doing the following will prevent the parsing of the charset on the html

$dom = \Dom\HTMLDocument::createFromString('', options: LIBXML_NOERROR);
            var_dump($dom->documentElement->innerHTML);
            $dom->documentElement->innerHTML = $html;
            $html = $dom->saveHtml();
            var_dump($html);

However, I then need to either remove or modify the meta tag to change the charset. Or fix all characters that could be incorectly parsed.

Is there any way to tell php to encode all special character so when doing saveHtml, any future parsing of the HTML, will already have the à converted to an html tag (&agrave;), so I don't have to modify the charset inside the html ?

@cmb69
Copy link
Member

cmb69 commented Feb 13, 2025

We should probably keep this ticket open, until my comment is resolved:

So there appears to be a relevant difference between the ::saveHtml() calls; when using the new DOM API, no further conversion is applied, but when using the classic DOM API, it is, possibly depending on the respective INI settings.

@cmb69 cmb69 reopened this Feb 13, 2025
@nielsdos
Copy link
Member

nielsdos commented Feb 13, 2025

@momala454 Please don't use innerHTML for that as that's context sensitive. You want to set the encoding at parse time, and ignore the meta charset, therefore you can use the $overrideEncoding argument like so:

$dom = \Dom\HTMLDocument::createFromString($html, LIBXML_NOERROR, 'UTF-8');

See

public static function createFromString(string $source, int $options = 0, ?string $overrideEncoding = null): HTMLDocument {}

@cmb69 Both the old DOM and new DOM API's saveHtml output the html string in the encoding set in $dom->charset, which is taken from the meta charset attribute. (EDIT: unless $overrideEncoding is given, then the encoding comes from there)

@momala454
Copy link
Author

how can I force saveHtml() to encode the html special chars like à ?
Because I am parsing multiple times the html, and if I don't save somewhere the forced charset, the next time I parse the html it will be broken again

@cmb69
Copy link
Member

cmb69 commented Feb 13, 2025

@nielsdos, I still don't understand https://3v4l.org/lv0mi (why is there a visible difference in the output, although both report Windows-1252).

@momala454:

The library I use will convert the email to utf8, but keep the meta tag that contains the charset windows-1252.

That behavior doesn't make any sense. Either they don't convert to UTF-8, or they also change the charset tag.

@nielsdos
Copy link
Member

@momala454

how can I force saveHtml() to encode the html special chars like à ?
Because I am parsing multiple times the html, and if I don't save somewhere the forced charset, the next time I parse the html it will be broken again

You can't, as the HTML standard doesn't define such a way and new DOM follows the HTML standard strictly.
What you can do instead, is use $overrideEncoding as I showed, and then change the meta tag:

$meta_charset = $dom->head->querySelector('meta[http-equiv="Content-Type"]');
$meta_charset->setAttribute('content', 'text/html; charset=utf-8');

@nielsdos
Copy link
Member

@cmb69

First output:
In Windows-1251, C3A0 is some cyrillic letter followed by a non-breaking space. When you call saveHtml() you get, in Windows-1251 encoding, the cyrillic letter back with &nbsp;. The reason the non-breaking space is converted to an entity is because this is the only auto-html-entity-encoding exception IIRC in the serialization algorithm for HTML5.
So it seems right to me. Unless I misunderstand what you mean?

Second output:
The output is again in Windows-1251 encoding, and the output is literally C3A0, which is the same as the input.
I don't see the issue here either?

@cmb69
Copy link
Member

cmb69 commented Feb 13, 2025

The reason the non-breaking space is converted to an entity is because this is the only auto-html-entity-encoding exception IIRC in the serialization algorithm for HTML5.

Ah, now I understand!

Anyway, I think this ticket can be closed, since there is no bug.

@cmb69 cmb69 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants