-
-
Notifications
You must be signed in to change notification settings - Fork 125
[Bug]: deal with UTF-16(with surrogate pair) file error #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this issue like come on!best regards for you! |
UString u from 7zip bit7z must Reverse this work, when communicate with 7zip and get all UString from 7zip do this work can this work ??? |
Hi! Yeah, as I mentioned in that other issue, this is not much a bit7z's bug, but rather 7-Zip's incorrect string encoding handling. Bit7z expects UTF-32 wide strings on Linux and macOS, as it is basically the standard encoding used in 99% of the cases for such kind of strings. Unfortunately, 7-Zip doesn't return valid UTF-32 strings when the original string contains UTF-16 surrogate pairs (and also in other cases). This is actually the major issue blocking the release of the v4.1-beta, as finding a good solution is not easy.
Initially, this was my idea, but unfortunately it is not so simple. 7-Zip behavior changes according to the archive format: 7z archives use UTF-16 strings, so your fix should work, but zip and other archives use other encodings, and 7-Zip seems to decode such strings incorrectly as well. For example, if the zip archive stores UTF-8 strings, 7-Zip will provide the UTF-8 code units as 32-bit wide characters, which is not a UTF-32 encoded string. The main problem is then that bit7z's code for the string conversion doesn't have access to the information about the archive format from which the string was read, so it doesn't know which input encoding should convert from. Also, some formats might use different encodings (e.g, the zip format can use other encodings than UTF-8). I'm working on a solution, which will come in the next v4.1-beta, but it will require some time. |
Yeah, the problem is that all these options like
As I said, your solution should work for 7z archives. A similar solution could be used for zip archives, if we just make UTF-8 the default encoding, but then we have to allow the user to specify a different encoding (basically implementing the Also, I wouldn't use the Finally, I still need to evaluate the performance impact of such string conversions, since you are essentially calling the same iconv conversion function for each character of the string, rather than calling it once over a string. I'll have to do some analysis in that sense to figure out the best approach. As I said, on bit7z side there's some restructuring to do as the information on the archive format doesn't reach the string conversion functions. |
i known now,you want create interface to user to set ccharset,if not provide interface,every is sample,just do with ustring in and out |
you provide this interface to set charset, just used by 7z internal, do not affect you read and write ustring,because you always read ustring and write ustring, and ustring format is clear。Am I getting it right? |
Sorry for the late reply.
I'm not sure, as I'm not quite sure what you're trying to say in your last two comments, sorry. Bit7z cannot change the behavior of 7-Zip's internal code, so it has to deal with the wrongly encoded wide strings (UString). If the wide characters in such strings are actually UTF-8 code units, or they're simply ASCII characters, then converting them to narrow UTF-8 strings (i.e., the strings used in the bit7z API) is easy. All other cases will require some thought about how to do the conversion without unnecessary overhead. Also, the lack of support for the Bit7z could provide the raw UStrings directly to the user, but wide strings are unusual and difficult to handle on Linux and macOS, so I will not consider this solution, as it would make the library harder to use. |
bit7z version
4.0.8
Compilation options
No response
7-zip version
v24.09
7-zip shared library used
lib7zip.dylib
Compilers
Clang
Compiler versions
No response
Architecture
x86_64
Operating system
macOS
Operating system versions
No response
Bug description
item.name throw Exception on macOS
Steps to reproduce
on macOS
git clone https://github.com/debugee/bit7zip-test.git
cd bit7zip-test
cmake -B build .
cmake --build build --target install
./build/test ./😊123你好.zip
./😊123你好.zip
Processing archive: ./😊123你好.zip
Archive properties
Items count: 2
Folders count: 0
Files count: 2
Size: 220
Packed size: 92
Archived items
Item index: 0
Name: Exception: wstring_convert: to_bytes error
Expected behavior
No response
Relevant compilation output
Code of Conduct
The text was updated successfully, but these errors were encountered: