Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: emoji and unicode decoding bug #2404

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

yanfanvip
Copy link

Describe your changes

[Fix]: emoji and unicode decoding bug
[Fix]: fix the typescript build fail on the dlhandle

Issue ticket number and link

Checklist before requesting a review

  • [√] I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • I have added thorough documentation for my code.
  • I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed.
  • If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution.

Demo

Before:
image

After:
image

Steps to Reproduce

  1. choose a llama model
  2. Generate a reply with emoji eg: Please use emoji to reply to me: hello
  3. on the stream token print [����]

Notes

@cebtenzzre cebtenzzre requested a review from jacoobes June 4, 2024 16:04
@cebtenzzre
Copy link
Member

@jacoobes @iimez Curious if you think it would be better to implement this kind of thing in the C++ layer like this instead of in the bindings?

This is how the llama.cpp server does it: https://github.com/ggerganov/llama.cpp/blob/adc9ff384121f4d550d28638a646b336d051bf42/examples/server/server.cpp#L1147-L1167

Copy link
Member

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As-is this PR is in conflict with #2403.

@jacoobes
Copy link
Collaborator

jacoobes commented Jun 4, 2024

This would be great; all the bindings would benefit. We can remove the extra code from JavaScript and Python bindings!

@iimez
Copy link
Collaborator

iimez commented Jun 4, 2024

100% better in cpp. From maintenance pov it's better for everybody, and I can't think of anything useful to do with the partial utf-8 within the bindings or further downstream. (Or the ability to stop generation with a partial character.)

fritz.f.yan added 3 commits June 6, 2024 08:48
Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>
This reverts commit 64906b2.

Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>
Signed-off-by: fritz.f.yan <fritz.f.yan@newegg.com>
@jacoobes
Copy link
Collaborator

jacoobes commented Jun 8, 2024

I'll try to test this when i'm free!

@cosmic-snow
Copy link
Collaborator

Quite some time ago, I was involved with when this was fixed in the Python bindings. Although I don't really remember all the details. Note: this is not against this PR or even the idea, just to give more context on the problem itself.

If you really, truly wanted to fix this on the side of the backend, you'd expose the response directly in Unicode. However, there are several issues to consider:

  • The first is how to expose this through the C API. Or rather, how to deal with representation. You're not talking C++ when using the bindings, and there is no notion of UTF-8 in C strings. So typically it's done in raw bytes instead on a "it's-UTF-8-trust-me" basis. Which somewhat does make sense because other languages can typically turn that into their own Unicode representations. But if you really wanted to do it properly, you'd already use a library for Unicode handling on the C++ side instead of rolling your own.

  • Right now, the API is based on tokens. Which are the "actual atoms" of a model. If you do the handling before the bindings, then that is no longer true (breaking change, although not too severe).

  • Typically, tokens form valid UTF-8 byte sequences. But there isn't actually a 100% guarantee that they do. It depends on the model's vocabulary (which could techincally be inspected) and how it was trained (which results in a black box, so no guarantees). I have never seen a model create malformed UTF-8 in my tests (but I prefer English and no emojis in my responses, so that doesn't say much). In any case, you would definitely not want it to crash in the backend in the rare case when it does. Bindings languages are usually better equipped to deal with turning raw bytes into Unicode.

  • It's not just emojis, it's also multibyte characters like in Japanese, Chinese or Korean.

  • Emojis (and maybe others? I'm not even an expert) come with the added "fun" that some of them consist of several Unicode codepoints, which are several valid raw UTF-8 byte sequences. So handling the multibyte problem is a very important step, but it still doesn't address the Unicode problem in full.

For reference, here's where the decoding happens in the Python bindings:

def _raw_callback(token_id: int, response: bytes) -> bool:
nonlocal self, callback
decoded = []
for byte in response:
bits = "{:08b}".format(byte)
(high_ones, _, _) = bits.partition('0')
if len(high_ones) == 1:
# continuation byte
self.buffer.append(byte)
self.buff_expecting_cont_bytes -= 1
else:
# beginning of a byte sequence
if len(self.buffer) > 0:
decoded.append(self.buffer.decode(errors='replace'))
self.buffer.clear()
self.buffer.append(byte)
self.buff_expecting_cont_bytes = max(0, len(high_ones) - 1)
if self.buff_expecting_cont_bytes <= 0:
# received the whole sequence or an out of place continuation byte
decoded.append(self.buffer.decode(errors='replace'))
self.buffer.clear()
self.buff_expecting_cont_bytes = 0
if len(decoded) == 0 and self.buff_expecting_cont_bytes > 0:
# wait for more continuation bytes
return True

@jacoobes
Copy link
Collaborator

Keep in mind C++ can use C files in compilation, while there technically is C, its for downstream interop. I think this implementation even though it is implemented in C++, bindings should be able to use it. ( I may be missing something ),
Keep in mind this impl is used in the llama cpp completion server.

I still need to try to get cuda working on windows before I can test this

@cosmic-snow
Copy link
Collaborator

Keep in mind C++ can use C files in compilation, while there technically is C, its for downstream interop. I think this implementation even though it is implemented in C++, bindings should be able to use it. ( I may be missing something ),
Keep in mind this impl is used in the llama cpp completion server.

Just to clarify: I mean yes, the C++ code is exposed through a C interface for interop with the bindings. C is basically the lowest common denominator in all of this. So everything you do in C++ you somehow have to squeeze into a C-compatible shape to be able to expose it to the bindings. And C by itself is notoriously weak in terms of expressiveness or "batteries included" (as they say in Python), even compared to C++. Also, C++ by itself (without libraries) doesn't solve the Unicode problem. And on a side note, different OSs have their own APIs.

@cosmic-snow
Copy link
Collaborator

Alright, in any case and as I said initially: I didn't write here to interfere with this. In a nutshell, I wanted to give more context to the overall Unicode problem.

So, assuming there are no further discussions and you're happy with going this route, it's not my intention to stand in the way of this PR getting merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants