-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089
base: master
Are you sure you want to change the base?
Conversation
…fy returning arrays that might be null
…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I went ahead and pushed an optimization to the Term.ToString()
method, since System.Text.Unicode.Utf8
supports fallback without an exception.
I also discovered that the default fallback character for encodings in .NET doesn't match that in Java, so we need to adjust the value set to IOUtils.ENCODING_UTF_8_NO_BOM
to:
private static UTF8Encoding CreateUtf8Encoding()
{
// Create a UTF8Encoding instance
UTF8Encoding encoding = new UTF8Encoding(
encoderShouldEmitUTF8Identifier: false, // No BOM, similar to Java
throwOnInvalidBytes: false // Match Java's silent handling of invalid bytes
);
// Set the DecoderFallback to use '\uFFFD' as the replacement character
encoding.DecoderFallback = new DecoderReplacementFallback("\uFFFD");
// Set the EncoderFallback to use '\uFFFD' for invalid character encoding
encoding.EncoderFallback = new EncoderReplacementFallback("\uFFFD");
return encoding;
}
For the Encoding
-derived classes, the default fallback character is ?
, but in Java it is \uFFFD
. The new System.Text.Unicode.Utf8
class uses the same default fallback character as in Java.
Note that the fallback character has been defined in other places in the codebase as ?
, as well. Those should be reviewed.
Since this PR is encoding-related, I think it would be fine to put them here, but feel free to open separate issues if you want to keep them separate.
/// </remarks> | ||
public static Encoding WithDecoderExceptionFallback(this Encoding encoding) | ||
{ | ||
Encoding newEncoding = (Encoding)encoding.Clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the syntax, but it does a heap allocation every time the extension method is called. Let's change it to use a ConcurrentDictionary
as a cache.
internal static class EncodingExtensions
{
private static readonly ConcurrentDictionary<Encoding, Encoding> exceptionFallbackCache = new ConcurrentDictionary<Encoding, Encoding>();
public static Encoding WithDecoderExceptionFallback(this Encoding encoding)
{
return exceptionFallbackCache.GetOrAdd(encoding, e =>
{
Encoding newEncoding = (Encoding)e.Clone();
newEncoding.DecoderFallback = DecoderFallback.ExceptionFallback;
return newEncoding;
});
}
}
GetOrAdd()
isn't atomic so there is a slight chance we will get an extra instance if 2 threads call it at the same time, but it doesn't affect correctness and isn't very likely to happen.
According to ChatGPT, the GetHashCode()
method of Encoding
is not particularly well optimized, so it might be worth it to make a key based on the CodePage
and DecoderFallback
setting to avoid that overhead. But it would need benchmarking to see whether that makes much of a difference.
Use DecoderFallback.ExceptionFallback to match Java's
CodingErrorAction.REPORT
Fixes #1076
Description
This adds an extension method to make it easy to match Java's
CodingErrorAction.REPORT
use inCharSet
where it will throw an exception upon encountering invalid character sequences. All places where this is used in Lucene have been updated to use this extension method.This should improve reliability of ensuring valid input in cases where Lucene would throw on invalid input, particularly with
IOUtils.GetDecodingReader
which, per its docs, was intended to throw on invalid input, but was doing a character replacement in Lucene.NET previously.