fix: emoji and unicode decoding bug #2404

yanfanvip · 2024-06-04T05:36:34Z

Describe your changes

[Fix]: emoji and unicode decoding bug
[Fix]: fix the typescript build fail on the dlhandle

Issue ticket number and link

Checklist before requesting a review

[√] I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
I have added thorough documentation for my code.
I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed.
If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution.

Demo

Before:

After:

Steps to Reproduce

choose a llama model
Generate a reply with emoji eg: Please use emoji to reply to me: hello
on the stream token print [��]

Notes

cebtenzzre · 2024-06-04T16:07:19Z

@jacoobes @iimez Curious if you think it would be better to implement this kind of thing in the C++ layer like this instead of in the bindings?

This is how the llama.cpp server does it: https://github.com/ggerganov/llama.cpp/blob/adc9ff384121f4d550d28638a646b336d051bf42/examples/server/server.cpp#L1147-L1167

cebtenzzre

As-is this PR is in conflict with #2403.

jacoobes · 2024-06-04T16:50:32Z

This would be great; all the bindings would benefit. We can remove the extra code from JavaScript and Python bindings!

iimez · 2024-06-04T18:41:23Z

100% better in cpp. From maintenance pov it's better for everybody, and I can't think of anything useful to do with the partial utf-8 within the bindings or further downstream. (Or the ability to stop generation with a partial character.)

gpt4all-bindings/typescript/binding.gyp

Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>

This reverts commit 64906b2. Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>

Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>

jacoobes · 2024-06-08T03:32:00Z

I'll try to test this when i'm free!

cosmic-snow · 2024-06-09T08:31:39Z

Quite some time ago, I was involved with when this was fixed in the Python bindings. Although I don't really remember all the details. Note: this is not against this PR or even the idea, just to give more context on the problem itself.

If you really, truly wanted to fix this on the side of the backend, you'd expose the response directly in Unicode. However, there are several issues to consider:

The first is how to expose this through the C API. Or rather, how to deal with representation. You're not talking C++ when using the bindings, and there is no notion of UTF-8 in C strings. So typically it's done in raw bytes instead on a "it's-UTF-8-trust-me" basis. Which somewhat does make sense because other languages can typically turn that into their own Unicode representations. But if you really wanted to do it properly, you'd already use a library for Unicode handling on the C++ side instead of rolling your own.
Right now, the API is based on tokens. Which are the "actual atoms" of a model. If you do the handling before the bindings, then that is no longer true (breaking change, although not too severe).
Typically, tokens form valid UTF-8 byte sequences. But there isn't actually a 100% guarantee that they do. It depends on the model's vocabulary (which could techincally be inspected) and how it was trained (which results in a black box, so no guarantees). I have never seen a model create malformed UTF-8 in my tests (but I prefer English and no emojis in my responses, so that doesn't say much). In any case, you would definitely not want it to crash in the backend in the rare case when it does. Bindings languages are usually better equipped to deal with turning raw bytes into Unicode.
It's not just emojis, it's also multibyte characters like in Japanese, Chinese or Korean.
Emojis (and maybe others? I'm not even an expert) come with the added "fun" that some of them consist of several Unicode codepoints, which are several valid raw UTF-8 byte sequences. So handling the multibyte problem is a very important step, but it still doesn't address the Unicode problem in full.

For reference, here's where the decoding happens in the Python bindings:

gpt4all/gpt4all-bindings/python/gpt4all/_pyllmodel.py

Lines 560 to 594 in 41c9013

    
           def _raw_callback(token_id: int, response: bytes) -> bool: 
        
               nonlocal self, callback 
        
               decoded = [] 
        
               for byte in response: 
        
                   bits = "{:08b}".format(byte) 
        
                   (high_ones, _, _) = bits.partition('0') 
        
                   if len(high_ones) == 1:  
        
                       # continuation byte 
        
                       self.buffer.append(byte) 
        
                       self.buff_expecting_cont_bytes -= 1 
        
                   else:  
        
                       # beginning of a byte sequence 
        
                       if len(self.buffer) > 0: 
        
                           decoded.append(self.buffer.decode(errors='replace')) 
        
                           self.buffer.clear() 
        
                       self.buffer.append(byte) 
        
                       self.buff_expecting_cont_bytes = max(0, len(high_ones) - 1) 
        
                   if self.buff_expecting_cont_bytes <= 0:  
        
                       # received the whole sequence or an out of place continuation byte 
        
                       decoded.append(self.buffer.decode(errors='replace')) 
        
                       self.buffer.clear() 
        
                       self.buff_expecting_cont_bytes = 0 
        
               if len(decoded) == 0 and self.buff_expecting_cont_bytes > 0: 
        
                   # wait for more continuation bytes 
        
                   return True

jacoobes · 2024-06-11T21:48:28Z

Keep in mind C++ can use C files in compilation, while there technically is C, its for downstream interop. I think this implementation even though it is implemented in C++, bindings should be able to use it. ( I may be missing something ),
Keep in mind this impl is used in the llama cpp completion server.

I still need to try to get cuda working on windows before I can test this

cosmic-snow · 2024-06-11T22:09:35Z

Keep in mind C++ can use C files in compilation, while there technically is C, its for downstream interop. I think this implementation even though it is implemented in C++, bindings should be able to use it. ( I may be missing something ),
Keep in mind this impl is used in the llama cpp completion server.

Just to clarify: I mean yes, the C++ code is exposed through a C interface for interop with the bindings. C is basically the lowest common denominator in all of this. So everything you do in C++ you somehow have to squeeze into a C-compatible shape to be able to expose it to the bindings. And C by itself is notoriously weak in terms of expressiveness or "batteries included" (as they say in Python), even compared to C++. Also, C++ by itself (without libraries) doesn't solve the Unicode problem. And on a side note, different OSs have their own APIs.

cosmic-snow · 2024-06-17T22:23:05Z

Alright, in any case and as I said initially: I didn't write here to interfere with this. In a nutshell, I wanted to give more context to the overall Unicode problem.

So, assuming there are no further discussions and you're happy with going this route, it's not my intention to stand in the way of this PR getting merged.

cebtenzzre requested a review from jacoobes June 4, 2024 16:04

cebtenzzre requested changes Jun 4, 2024

View reviewed changes

jacoobes requested changes Jun 5, 2024

View reviewed changes

gpt4all-bindings/typescript/binding.gyp Outdated Show resolved Hide resolved

fritz.f.yan added 3 commits June 6, 2024 08:48

fix: emoji and unicode decoding bug

6beb837

Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>

Revert "fix: emoji and unicode decoding bug"

961d572

This reverts commit 64906b2. Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>

fix: emoji and unicode decoding bug

4447a27

Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>

yanfanvip force-pushed the main branch from dc07f85 to 4447a27 Compare June 6, 2024 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: emoji and unicode decoding bug #2404

fix: emoji and unicode decoding bug #2404

yanfanvip commented Jun 4, 2024

cebtenzzre commented Jun 4, 2024

cebtenzzre left a comment

jacoobes commented Jun 4, 2024

iimez commented Jun 4, 2024

jacoobes commented Jun 8, 2024

cosmic-snow commented Jun 9, 2024

jacoobes commented Jun 11, 2024

cosmic-snow commented Jun 11, 2024

cosmic-snow commented Jun 17, 2024

fix: emoji and unicode decoding bug #2404

Are you sure you want to change the base?

fix: emoji and unicode decoding bug #2404

Conversation

yanfanvip commented Jun 4, 2024

Describe your changes

Issue ticket number and link

Checklist before requesting a review

Demo

Steps to Reproduce

Notes

cebtenzzre commented Jun 4, 2024

cebtenzzre left a comment

Choose a reason for hiding this comment

jacoobes commented Jun 4, 2024

iimez commented Jun 4, 2024

jacoobes commented Jun 8, 2024

cosmic-snow commented Jun 9, 2024

jacoobes commented Jun 11, 2024

cosmic-snow commented Jun 11, 2024

cosmic-snow commented Jun 17, 2024