-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8 characters handling in input edi file #82
Comments
My EDI file starts with this:
|
I succeed with a brute force workaround: Line 139 in 6fc2fe5
|
Hmmm, UNOB does not suppose to use UTF-8 characters, right? |
Could you send the file to my email, so I can try? :) |
Ok, the producer of my input file fixed the UNB to include the UNOY instead of UNOB and fixed some other smaller issues. So, as a summary:
Question: Lines 136 to 139 in 6fc2fe5
What if stripchars will be set after UNB line processed and apply stripchars depending on the encodings? |
Yeah, I don't test actually files with different encodings, thanks for the file, whenever I can I look at it more. |
Depending on the UNB settings in the file, I think one scenario should work:
At this moment it seems the basic sanitization always will happen. Maybe at the weekend I will have time to play with it. |
Hey all. If somebody need UNOE for Cyrillic:
|
This isn't actually fixed, is it? Please keep this open, if it isn't. If only to warn others of expected problems. I'm trying to parse a file (Header starts with Line 191 in c577183
replaces them with non-printable characters and then complains about it. Characters like What exactly is missing for this to work as expected? I've found this on the web: https://blog.sandro-pereira.com/2009/08/15/edifact-encoding-edi-character-set-support/ and according to https://en.wikipedia.org/wiki/ISO/IEC_8859 https://groups.google.com/g/botsmail/c/B6V5mwdcFts/m/DRcsE_K7BgAJ claims that UNOW is for UTF-8 while UNOY seems to be the whole UTF-32, or something… But apparently they only exist as part of syntax version 4, for which I've found this: https://www.gefeg.com/jswg/v4/data/v4_docs.htm |
Ok, I reopen it... I never had some file to actually test, if you can send me one via email I can look... Perhaps we should use multibythe versions of the various functions? |
Do your classes support syntax v4? If so, one could simply set UNOW/Y, pass UTF-8 encoded strings and skip the char replacement. Should theoretically come out fine. But fixing it so UNOC v3 actually allows the supported characters to come out correctly makes sense, too. Right now it seems like your EDI classes don't concern themselves with text encoding whatsoever, do they? |
Actualy I don't know, I usually process EDI v3 only so I never looked in detail... Future work :) |
I have to partially retract my statements. It works fine, if you actually feed it text in one of the explicitly supported character encodings. I was feeding it UTF-8. 🤦♂️ My fault for not knowing enough about EDI. Sorry. Now I think there should be a warning somewhere that UTF-8 isn't supported. Maybe a // Unicode is only supported starting with UNOW syntax which requires syntax v4
// neither are currently (fully?) supported
if (mb_check_encoding($string, 'UTF-8') === true)
trigger_error('UTF-8 encoded text found', E_USER_WARNING); After all, PHP itself usually assumes UTF-8 encoding everywhere, if you don't specify something else. But I think any kind of text conversion should be left to the user. |
I would like to only populate the errors array, user should break execution explicitly by checking that... Is there some way of silently detecting and converting the file? |
A UTF-8 file might still come out fine as long as it contains only letters within the first 127 code points, like in English for example. So breaking execution in that case is overly strict, that's why I chose a mere warning instead. Like I said, I'm not a fan of doing things silently in the background. Detecting character encoding with 100% certainty is impossible anyway. Only the creator of the text can know its encoding for sure. So better not mess with it and use it as is instead. You could only check, if a text matches a certain encoding. So you'd have to write code in the sense of "if the file says it's UNOC, check if it passes an ISO-8859-1 encoding check" and so on. |
I have an edi file with UTF8 characters, like éáűúőí, even these kind of chars: ØÆ
When I load the into the parser, I got error messages (non-printable chars...) and the error message contains the utf8 encoded character, like: \u00c6 (Æ)
My problem, when I print out (echo) json_encode or var_dump, these characters are missing.
Do I miss something? I read #64 but it seems I have no correct chars in var_dump, for example á simply missing.
Could you help me, how to keep the original characters during the Parsing process?
Relevant code:
The text was updated successfully, but these errors were encountered: