-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with special characters and php8.4 \Dom\HTMLDocument #17785
Comments
Confirmed: https://3v4l.org/AQUlB. Possibly, Lexbor does not support Anyway, you might be better off sticking with the classic |
The problem occurs with the value I don't know if |
Oops! The problem is actually that the HTML is UTF-8 encoded, but the charset states Windows-1252. |
You are sure ? https://fr.wikipedia.org/wiki/Windows-1252 the old DomDocument even replace it with |
Character encoding mess! See https://3v4l.org/lPBhV. The first output string contains the hex sequence Try again with UTF-8 encoded input: https://3v4l.org/NCVS1 (without Regarding https://3v4l.org/cTBvh; trying with So there appears to be a relevant difference between the |
Thanks for your help. I think actually my problem is with my mail parser library which seems to convert to utf8 |
@cmb69 So when doing this $dom = \Dom\HTMLDocument::createFromString($html, options: LIBXML_NOERROR);
var_dump($dom->saveHtml()); it breaks the $dom = \Dom\HTMLDocument::createFromString('', options: LIBXML_NOERROR);
var_dump($dom->documentElement->innerHTML);
$dom->documentElement->innerHTML = $html;
$html = $dom->saveHtml();
var_dump($html); However, I then need to either remove or modify the meta tag to change the charset. Or fix all characters that could be incorectly parsed. Is there any way to tell php to encode all special character so when doing |
We should probably keep this ticket open, until my comment is resolved:
|
@momala454 Please don't use innerHTML for that as that's context sensitive. You want to set the encoding at parse time, and ignore the meta charset, therefore you can use the $dom = \Dom\HTMLDocument::createFromString($html, LIBXML_NOERROR, 'UTF-8'); See php-src/ext/dom/php_dom.stub.php Line 2045 in 78d934a
@cmb69 Both the old DOM and new DOM API's |
how can I force |
@nielsdos, I still don't understand https://3v4l.org/lv0mi (why is there a visible difference in the output, although both report
That behavior doesn't make any sense. Either they don't convert to UTF-8, or they also change the charset tag. |
You can't, as the HTML standard doesn't define such a way and new DOM follows the HTML standard strictly. $meta_charset = $dom->head->querySelector('meta[http-equiv="Content-Type"]');
$meta_charset->setAttribute('content', 'text/html; charset=utf-8'); |
First output: Second output: |
Ah, now I understand! Anyway, I think this ticket can be closed, since there is no bug. |
Description
The following code:
Resulted in this output:
But I expected this output instead:
This html is taken from a real email received. Outlook is correctly displaying the
à
character. Firefox is also correctly displaying it.But for some reason, php 8.4 \Dom\HTMLDocument replace
à
with�
.I am parsing received emails, so I can't really control the correctness of the initial html.
The previous DOMDocument was parsing it correctly
PHP Version
8.4.4
Operating System
No response
The text was updated successfully, but these errors were encountered: