UTF8 characters handling in input edi file #82

pongraczi · 2018-08-08T11:44:17Z

I have an edi file with UTF8 characters, like éáűúőí, even these kind of chars: ØÆ
When I load the into the parser, I got error messages (non-printable chars...) and the error message contains the utf8 encoded character, like: \u00c6 (Æ)

My problem, when I print out (echo) json_encode or var_dump, these characters are missing.

Do I miss something? I read #64 but it seems I have no correct chars in var_dump, for example á simply missing.

Could you help me, how to keep the original characters during the Parsing process?

Relevant code:

$edifile = utf8_decode(file_get_contents("example.edi")); //it's a path!
// $edifile = file_get_contents("example.edi"); //it's a path!
// $p = new EDI\Parser($edifile);
$p = new EDI\Parser();
// $p->setStripRegex("//");
$p->loadString($edifile);

if (count($p->errors()) > 0) {
        echo "Error: ";
        echo json_encode($p->errors());
        // return;
}

echo "JSON:";
echo json_encode($p->get());

pongraczi · 2018-08-08T11:45:27Z

My EDI file starts with this:

UNA:+.? '
UNB+UNOB:2+TECHNIQ+TERMIQUE+180720:1105+957'

pongraczi · 2018-08-08T11:50:59Z

I succeed with a brute force workaround:
I commented out the following line to do not remove any characters:

edifact/src/EDI/Parser.php

Line 139 in 6fc2fe5

$line = preg_replace($this->stripChars, '', trim($line));

pongraczi · 2018-08-08T12:00:27Z

Hmmm, UNOB does not suppose to use UTF-8 characters, right?
That is the reason, parser wants to strip as defined by the standard, so, technically your parser does what is right, but the file itself is UNOY.
So, my example file has wrong UNB header content...

sabas · 2018-08-08T12:10:56Z

Could you send the file to my email, so I can try? :)

pongraczi · 2018-08-08T19:05:18Z

Ok, the producer of my input file fixed the UNB to include the UNOY instead of UNOB and fixed some other smaller issues.

So, as a summary:

using utf8 encoded messages are possible, but must check the UNB segment to have UNOY
it seems, hacking the code a little bit can force the utf8 support, but must check the result

Question:
As I can see, Basic sanitization, remove non printable chars always will process the input line, even UNOY will be set and legitimate utf-8 chars exist.
Here you can see the code, which will run before UNB (encoding processed):

edifact/src/EDI/Parser.php

Lines 136 to 139 in 6fc2fe5

    
                       /** 
        
                        * Basic sanitization, remove non printable chars 
        
                        */ 
        
                       $line = preg_replace($this->stripChars, '', trim($line));

What if stripchars will be set after UNB line processed and apply stripchars depending on the encodings?

sabas · 2018-08-08T20:21:33Z

Yeah, I don't test actually files with different encodings, thanks for the file, whenever I can I look at it more.
The stripChars variable theoretically could be set before parsing with setStripRegex(), but as I was checking on your file I didn't find a regex that works (tried /[\x01-\x09\x0B-\x0C\x0E-\x1F\x7F-\x9F]/).
Surely needs changing, if you find a nice solution before me it's welcome :-)

pongraczi · 2018-08-09T19:21:16Z

Depending on the UNB settings in the file, I think one scenario should work:

new Reader/Parser without actual file or string
setStripRegex() --- force ignore UNB settings
parser->loadString() --- if stripregex is empty, practically preg_replace will not harm the string

At this moment it seems the basic sanitization always will happen. Maybe at the weekend I will have time to play with it.

k0mar12 · 2020-06-02T09:28:45Z

Hey all.

If somebody need UNOE for Cyrillic:

$parser->setStripRegex("/[\x20-\x7E]\xA0-\xFF/");

gaxweb · 2023-08-09T09:54:14Z

This isn't actually fixed, is it? Please keep this open, if it isn't. If only to warn others of expected problems.

I'm trying to parse a file (Header starts with UNB+UNOC:3) with German characters like ß in them, and

edifact/src/EDI/Parser.php

Line 191 in c577183

$line = (string) \preg_replace($this->stripChars, '', $lineTrim);

replaces them with non-printable characters and then complains about it. Characters like äöü already arrive broken in the method. The funny part is that those ORDERS files were generated by the php-edifact/edifact-generator with those letters seemingly intact.

What exactly is missing for this to work as expected?

I've found this on the web: https://blog.sandro-pereira.com/2009/08/15/edifact-encoding-edi-character-set-support/ and according to https://en.wikipedia.org/wiki/ISO/IEC_8859 UNOC should allow for those characters to exist because it refers to ISO-8859-1 (Latin-1) encoding.

https://groups.google.com/g/botsmail/c/B6V5mwdcFts/m/DRcsE_K7BgAJ claims that UNOW is for UTF-8 while UNOY seems to be the whole UTF-32, or something… But apparently they only exist as part of syntax version 4, for which I've found this: https://www.gefeg.com/jswg/v4/data/v4_docs.htm

sabas · 2023-08-09T14:52:54Z

Ok, I reopen it... I never had some file to actually test, if you can send me one via email I can look...

Perhaps we should use multibythe versions of the various functions?

gaxweb · 2023-08-09T14:58:48Z

Do your classes support syntax v4? If so, one could simply set UNOW/Y, pass UTF-8 encoded strings and skip the char replacement. Should theoretically come out fine.

But fixing it so UNOC v3 actually allows the supported characters to come out correctly makes sense, too.

Right now it seems like your EDI classes don't concern themselves with text encoding whatsoever, do they?

sabas · 2023-08-10T17:48:25Z

Actualy I don't know, I usually process EDI v3 only so I never looked in detail... Future work :)
For encoding, I only made sure to strip out invalid chars according to the required set (although for example I don't bother about uppercase and one supplier one time complained I sent lowercase characters haha)
Someone tried to use utf8_encode if I recall, although that function is deprecated in php 8.1

gaxweb · 2023-08-15T13:24:36Z

I have to partially retract my statements. It works fine, if you actually feed it text in one of the explicitly supported character encodings. I was feeding it UTF-8. 🤦‍♂️ My fault for not knowing enough about EDI. Sorry.

Now I think there should be a warning somewhere that UTF-8 isn't supported. Maybe a trigger_error() in the Parser::loadString() method? Something like:

// Unicode is only supported starting with UNOW syntax which requires syntax v4
// neither are currently (fully?) supported
if (mb_check_encoding($string, 'UTF-8') === true)
    trigger_error('UTF-8 encoded text found', E_USER_WARNING);

After all, PHP itself usually assumes UTF-8 encoding everywhere, if you don't specify something else. But I think any kind of text conversion should be left to the user.

sabas · 2023-08-15T14:09:07Z

I would like to only populate the errors array, user should break execution explicitly by checking that... Is there some way of silently detecting and converting the file?
I would like to test it if you can share a sample...
Currently Parser detects some metadata on the file (although potentially that won't properly work when there's an interchange with multiple messages)
I can look at it in the next days probably

gaxweb · 2023-08-15T14:41:51Z

A UTF-8 file might still come out fine as long as it contains only letters within the first 127 code points, like in English for example. So breaking execution in that case is overly strict, that's why I chose a mere warning instead.

Like I said, I'm not a fan of doing things silently in the background. Detecting character encoding with 100% certainty is impossible anyway. Only the creator of the text can know its encoding for sure. So better not mess with it and use it as is instead.

You could only check, if a text matches a certain encoding. So you'd have to write code in the sense of "if the file says it's UNOC, check if it passes an ISO-8859-1 encoding check" and so on.

sabas · 2024-10-16T08:07:10Z

@gaxweb could you check if the contribution from @feyst helps in this case?

sabas closed this as completed Aug 6, 2023

sabas reopened this Aug 9, 2023

gaxweb mentioned this issue Aug 16, 2023

Encoding check and restructured file loading #122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 characters handling in input edi file #82

UTF8 characters handling in input edi file #82

pongraczi commented Aug 8, 2018

pongraczi commented Aug 8, 2018

pongraczi commented Aug 8, 2018

pongraczi commented Aug 8, 2018

sabas commented Aug 8, 2018

pongraczi commented Aug 8, 2018

sabas commented Aug 8, 2018

pongraczi commented Aug 9, 2018

k0mar12 commented Jun 2, 2020

gaxweb commented Aug 9, 2023 •

edited

Loading

sabas commented Aug 9, 2023

gaxweb commented Aug 9, 2023 •

edited

Loading

sabas commented Aug 10, 2023

gaxweb commented Aug 15, 2023

sabas commented Aug 15, 2023 •

edited

Loading

gaxweb commented Aug 15, 2023

sabas commented Oct 16, 2024

UTF8 characters handling in input edi file #82

UTF8 characters handling in input edi file #82

Comments

pongraczi commented Aug 8, 2018

pongraczi commented Aug 8, 2018

pongraczi commented Aug 8, 2018

pongraczi commented Aug 8, 2018

sabas commented Aug 8, 2018

pongraczi commented Aug 8, 2018

sabas commented Aug 8, 2018

pongraczi commented Aug 9, 2018

k0mar12 commented Jun 2, 2020

gaxweb commented Aug 9, 2023 • edited Loading

sabas commented Aug 9, 2023

gaxweb commented Aug 9, 2023 • edited Loading

sabas commented Aug 10, 2023

gaxweb commented Aug 15, 2023

sabas commented Aug 15, 2023 • edited Loading

gaxweb commented Aug 15, 2023

sabas commented Oct 16, 2024

gaxweb commented Aug 9, 2023 •

edited

Loading

gaxweb commented Aug 9, 2023 •

edited

Loading

sabas commented Aug 15, 2023 •

edited

Loading