Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

paulirwin · 2025-01-06T19:36:11Z

You've read the Contributor Guide and Code of Conduct.
You've included unit or integration tests for your change, where applicable.
You've included inline docs for your change, where applicable.
There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT

Description

This adds an extension method to make it easy to match Java's CodingErrorAction.REPORT use in CharSet where it will throw an exception upon encountering invalid character sequences. All places where this is used in Lucene have been updated to use this extension method.

This should improve reliability of ensuring valid input in cases where Lucene would throw on invalid input, particularly with IOUtils.GetDecodingReader which, per its docs, was intended to throw on invalid input, but was doing a character replacement in Lucene.NET previously.

…1076

…fy returning arrays that might be null

…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.

NightOwl888

Thanks for the PR.

I went ahead and pushed an optimization to the Term.ToString() method, since System.Text.Unicode.Utf8 supports fallback without an exception.

I also discovered that the default fallback character for encodings in .NET doesn't match that in Java, so we need to adjust the value set to IOUtils.ENCODING_UTF_8_NO_BOM to:

    private static UTF8Encoding CreateUtf8Encoding()
    {
        // Create a UTF8Encoding instance
        UTF8Encoding encoding = new UTF8Encoding(
            encoderShouldEmitUTF8Identifier: false, // No BOM, similar to Java
            throwOnInvalidBytes: false             // Match Java's silent handling of invalid bytes
        );

        // Set the DecoderFallback to use '\uFFFD' as the replacement character
        encoding.DecoderFallback = new DecoderReplacementFallback("\uFFFD");

        // Set the EncoderFallback to use '\uFFFD' for invalid character encoding
        encoding.EncoderFallback = new EncoderReplacementFallback("\uFFFD");

        return encoding;
    }

For the Encoding-derived classes, the default fallback character is ?, but in Java it is \uFFFD. The new System.Text.Unicode.Utf8 class uses the same default fallback character as in Java.

Note that the fallback character has been defined in other places in the codebase as ?, as well. Those should be reviewed.

Since this PR is encoding-related, I think it would be fine to put them here, but feel free to open separate issues if you want to keep them separate.

NightOwl888 · 2025-01-08T14:04:01Z

src/Lucene.Net/Support/Text/EncodingExtensions.cs

+        /// </remarks>
+        public static Encoding WithDecoderExceptionFallback(this Encoding encoding)
+        {
+            Encoding newEncoding = (Encoding)encoding.Clone();


I like the syntax, but it does a heap allocation every time the extension method is called. Let's change it to use a ConcurrentDictionary as a cache.

internal static class EncodingExtensions { private static readonly ConcurrentDictionary<Encoding, Encoding> exceptionFallbackCache = new ConcurrentDictionary<Encoding, Encoding>(); public static Encoding WithDecoderExceptionFallback(this Encoding encoding) { return exceptionFallbackCache.GetOrAdd(encoding, e => { Encoding newEncoding = (Encoding)e.Clone(); newEncoding.DecoderFallback = DecoderFallback.ExceptionFallback; return newEncoding; }); } }

GetOrAdd() isn't atomic so there is a slight chance we will get an extra instance if 2 threads call it at the same time, but it doesn't affect correctness and isn't very likely to happen.

According to ChatGPT, the GetHashCode() method of Encoding is not particularly well optimized, so it might be worth it to make a key based on the CodePage and DecoderFallback setting to avoid that overhead. But it would need benchmarking to see whether that makes much of a difference.

paulirwin added 2 commits January 6, 2025 12:23

Use DecoderFallback.ExceptionFallback to match Java behavior, apache#…

961d3e1

…1076

Add unit test for WithDecoderExceptionFallback

ba468c1

paulirwin added the notes:improvement An enhancement to an existing feature label Jan 6, 2025

paulirwin requested a review from NightOwl888 January 6, 2025 19:36

paulirwin and others added 3 commits January 6, 2025 16:50

Fix unit test namespace and doc comment

fa73676

Lucene.Net.Support.Buffers: Added ArrayPoolExtensions class to simpli…

f7a9186

…fy returning arrays that might be null

Lucene.Net.Index.Term::ToString(): Optimized writing UTF8 string on t…

fcc412e

…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.

NightOwl888 requested changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

paulirwin commented Jan 6, 2025

NightOwl888 left a comment

NightOwl888 Jan 8, 2025

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Are you sure you want to change the base?

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Conversation

paulirwin commented Jan 6, 2025

Description

NightOwl888 left a comment

Choose a reason for hiding this comment

NightOwl888 Jan 8, 2025

Choose a reason for hiding this comment