-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give error when initializing tokenizer with too high stride #1306
Conversation
The documentation is not available anymore as the PR was closed or merged. |
99a9897
to
06261db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
The code is good.
It's a breaking change, but it's only on the Rust side.
However I'm very curious when this is actually an issue. Having such large strides really seems a bug in the first place to me, almost looks like you're trying to get a sliding window out of it, is that it ?
Ok(()) | ||
} | ||
|
||
/// Disable truncation | ||
#[pyo3(text_signature = "(self)")] | ||
fn no_truncation(&mut self) { | ||
self.tokenizer.with_truncation(None); | ||
let _ = self.tokenizer.with_truncation(None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expect
here ? no_truncation should never fail, but it's better to actually fail than be in the dark.
/// Fails if `stride` is too high relative to `max_length` and `post_processor.added_tokens()` | ||
pub fn with_truncation(&mut self, trunc: Option<TruncationParams>) -> Result<&mut Self> { | ||
if let Some(trunc_params) = &trunc { | ||
let n_added_tokens = self.get_n_added_tokens(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why false
? It depends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that fewer tokens were added when is_pair=false
(I'm not actually sure if this is a good assumption). So, when with_truncation()
is called, we check if there is any possible case where this is a valid combination of max_length
and stride
for this tokenizer, and fail if it isn't.
Another reason why I kept the original assert!()
from the inference part of the code is that it's possible that any additional special token(s) when is_pair=true
will push it over the edge and make the stride
too large for the max_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Narsil can you confirm whether my assumption in the above comment is a good one (i.e. is added_tokens(false)
always less than added_tokens(true)
?)
You are right that this should probably never happen if someone is reading the documentation carefully (and properly understands HuggingFace's idiosyncratic definition of "stride"). However, this issue might come up if the user is using some toy examples. E.g. if you're playing around with tokenizers and you want to look at example outputs, you might try looking at the output of, say, the following tokenizer = tokenizers.Tokenizer.from_pretrained('bert-base-cased')
tokenizer.enable_truncation(5, stride=3)
print(tokenizer.encode("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.") would lead to this error. I think users would be particularly inclined to "play around" with tokenizers like this (I certainly did) since the interplay of "stride", "max length", and "added tokens" is hard to wrap your head around without examples. |
Related to #1269. Implements improvements requested here.
Enabling truncation with bad parameters (specifically
stride
andmax_length
) on a tokenizer now gives an error (in the form of a proper Python exception, explaining the relation of added special tokens and max length to stride).Note that I didn't remove the original
assert!()
for a few reasons:is_pair = false
. I'm not sure if this is true/a justified assumptionErr
output of the function. It might be nice to implement that functionality as part of this PR.