Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

paulirwin
Copy link
Contributor

  • You've read the Contributor Guide and Code of Conduct.
  • You've included unit or integration tests for your change, where applicable.
  • You've included inline docs for your change, where applicable.
  • There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT

Fixes #1076

Description

This adds an extension method to make it easy to match Java's CodingErrorAction.REPORT use in CharSet where it will throw an exception upon encountering invalid character sequences. All places where this is used in Lucene have been updated to use this extension method.

This should improve reliability of ensuring valid input in cases where Lucene would throw on invalid input, particularly with IOUtils.GetDecodingReader which, per its docs, was intended to throw on invalid input, but was doing a character replacement in Lucene.NET previously.

@paulirwin paulirwin added the notes:improvement An enhancement to an existing feature label Jan 6, 2025
@paulirwin paulirwin requested a review from NightOwl888 January 6, 2025 19:36
paulirwin and others added 3 commits January 6, 2025 16:50
…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.
Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

I went ahead and pushed an optimization to the Term.ToString() method, since System.Text.Unicode.Utf8 supports fallback without an exception.

I also discovered that the default fallback character for encodings in .NET doesn't match that in Java, so we need to adjust the value set to IOUtils.ENCODING_UTF_8_NO_BOM to:

    private static UTF8Encoding CreateUtf8Encoding()
    {
        // Create a UTF8Encoding instance
        UTF8Encoding encoding = new UTF8Encoding(
            encoderShouldEmitUTF8Identifier: false, // No BOM, similar to Java
            throwOnInvalidBytes: false             // Match Java's silent handling of invalid bytes
        );

        // Set the DecoderFallback to use '\uFFFD' as the replacement character
        encoding.DecoderFallback = new DecoderReplacementFallback("\uFFFD");

        // Set the EncoderFallback to use '\uFFFD' for invalid character encoding
        encoding.EncoderFallback = new EncoderReplacementFallback("\uFFFD");

        return encoding;
    }

For the Encoding-derived classes, the default fallback character is ?, but in Java it is \uFFFD. The new System.Text.Unicode.Utf8 class uses the same default fallback character as in Java.

Note that the fallback character has been defined in other places in the codebase as ?, as well. Those should be reviewed.

Since this PR is encoding-related, I think it would be fine to put them here, but feel free to open separate issues if you want to keep them separate.

/// </remarks>
public static Encoding WithDecoderExceptionFallback(this Encoding encoding)
{
Encoding newEncoding = (Encoding)encoding.Clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the syntax, but it does a heap allocation every time the extension method is called. Let's change it to use a ConcurrentDictionary as a cache.

internal static class EncodingExtensions
{
    private static readonly ConcurrentDictionary<Encoding, Encoding> exceptionFallbackCache = new ConcurrentDictionary<Encoding, Encoding>();
    
    public static Encoding WithDecoderExceptionFallback(this Encoding encoding)
    {
        return exceptionFallbackCache.GetOrAdd(encoding, e =>
        {
            Encoding newEncoding = (Encoding)e.Clone();
            newEncoding.DecoderFallback = DecoderFallback.ExceptionFallback;
            return newEncoding;
        });
    }
}

GetOrAdd() isn't atomic so there is a slight chance we will get an extra instance if 2 threads call it at the same time, but it doesn't affect correctness and isn't very likely to happen.

According to ChatGPT, the GetHashCode() method of Encoding is not particularly well optimized, so it might be worth it to make a key based on the CodePage and DecoderFallback setting to avoid that overhead. But it would need benchmarking to see whether that makes much of a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notes:improvement An enhancement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review use of Encoding fallback handling
2 participants