Skip to content

Commit

Permalink
[StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty beha…
Browse files Browse the repository at this point in the history
…viour regarding UTF-16 surrogates (#1327)

* [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty…
…behaviour regarding UTF-16 surrogates

Both signatures of StringUtils::indexOfAnyBut currently behave
inconsistently in matching UTF-16 supplementary characters and single
UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
they differ unnecessarily in their algorithmic implementations, use
their own incomplete and faulty interpretation of UTF-16 and don't take
full advantage of the standard library.

The example cases below show that they may yield contradictory results
or correct results for the wrong reasons.

This proposal gives a unified algorithmic implementation of both
signatures that
a) is much easier to grasp due to a clear mathematical set approach and
   safe iteration and doesn't become entangled in index arithmetic;
   stresses the set semantics of the 2nd argument
b) fully relies on the standard library for defined UTF-16
   handling/interpretation;
   paired surrogates are merged into one codepoint, unpaired surrogates
   are left as they are
c) scales much better with input sizes and result index position
d) can benefit from current and future improvements in the standard
   library and JVM
   (streams implementation, parallelization, JIT optimization, JEP 218,
   ???…)

The algorithm boils down to:
find index i of first char in cs such that
(cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉
codepoints(searchChars) })

Examples:
---------

<H>: high-surrogate character
<L>: low-surrogate character
(<H><L>): valid supplementary character
signature 1: StringUtils::indexOfAnyBut(final CharSequence seq,
final CharSequence searchChars)
signature 2: StringUtils::indexOfAnyBut(final CharSequence cs,
final char... searchChars)

Case 1: matching of unpaired high-surrogate
---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---

 1.1     <H>aaaa      <H>abcd          !found       !found      !found
  sig.2: 'a' happens to follow <H> in searchChars;
  sig.1: 'a' is somewhere in searchChars

 1.2     <H>baaa      <H>abcd          !found       !found      0
  sig.1: 'b' is somewhere in searchChars

 1.3     <H>aaaa      (<H><L>)abcd     0            !found      0
  sig.1: 'a' is somewhere in searchChars

 1.4     aaaa<H>      (<H><L>)abcd     4            !found      !found
  sig.1+2 don't interpret suppl. character

Case 2: matching of unpaired low-surrogate
---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---

 2.1     <L>aaaa      (<H><L>)abcd     0            !found      !found
  sig.1+2 don't interpret suppl. character

 2.2     aaaa<L>      (<H><L>)abcd     4            !found      !found
  sig.1+2 don't interpret suppl. character

Case 3: matching of supplementary character
---------seq/cs-------------searchChars-----exp./new----sig.1-----sig.2-

 3.1     (<H><L>)aaaa       <L>ab<H>cd      0           !found    0
  sig.1: <L> is somewhere in searchChars

 3.2     (<H><L>)aaaa       abcd            0           1         0
  sig.1 always points to low-surrogate of (fully) unmatched
  suppl. character

 3.3     (<H><L>)aaaa       abcd<H>         0           0         1
 3.4     (<H><L>)aaaa       abcd<L>         0           !found    0
  sig.1: <H> skipped by algorithm

* [StringUtils::indexOfAnyBut] further reduction of algorithm

by simplifying set consideration:
find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈
codepoints(searchChars) })

* [StringUtils::indexOfAnyBut] simplify input-sequence iteration

by transforming ListIterator loop into index-based loop,
advancing by Character.charCount(codepoint);
enabling short-circuit processing, avoiding full in-advance processing of
input-sequence

* [StringUtils:indexOfAnyBut] parameterization of test functions

providing a single source-of-truth (arguments stream) for the two
function variants

* [StringUtils:indexOfAnyBut] remove comment

Set::contains of immutable Set has unclear desastrous performance issues
when searching for large values (here: >0xffff) in a set of smaller
values (including JDK 23)

---------

Co-authored-by: IBue <>
  • Loading branch information
IBue authored Jan 6, 2025
1 parent 1f83019 commit 665f047
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 40 deletions.
7 changes: 5 additions & 2 deletions src/main/java/org/apache/commons/lang3/StringUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@
import java.util.List;
import java.util.Locale;
import java.util.Objects;
import java.util.Set;
import java.util.function.Supplier;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

import org.apache.commons.lang3.function.Suppliers;
import org.apache.commons.lang3.stream.LangCollectors;
Expand Down Expand Up @@ -2853,11 +2855,12 @@ public static int indexOfAnyBut(final CharSequence seq, final CharSequence searc
if (isEmpty(seq) || isEmpty(searchChars)) {
return INDEX_NOT_FOUND;
}
final int[] codePoints = searchChars.codePoints().sorted().toArray();
final Set<Integer> searchSetCodePoints = searchChars.codePoints()
.boxed().collect(Collectors.toSet());
// advance character index from one interpreted codepoint to the next
for (int curSeqCharIdx = 0; curSeqCharIdx < seq.length();) {
final int curSeqCodePoint = Character.codePointAt(seq, curSeqCharIdx);
if (Arrays.binarySearch(codePoints, curSeqCodePoint) < 0) {
if (!searchSetCodePoints.contains(curSeqCodePoint)) {
return curSeqCharIdx;
}
curSeqCharIdx += Character.charCount(curSeqCodePoint); // skip indices to paired low-surrogates
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,17 @@
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;
import static org.junit.jupiter.params.provider.Arguments.arguments;

import java.nio.CharBuffer;
import java.util.Locale;
import java.util.stream.Stream;

import org.hamcrest.core.IsNot;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

/**
* Tests {@link StringUtils} - Equals/IndexOf methods
Expand Down Expand Up @@ -90,6 +95,30 @@ public String toString() {

private static final String[] FOOBAR_SUB_ARRAY = {"ob", "ba"};

static Stream<Arguments> indexOfAnyBut_withSurrogateChars() {
// @formatter:off
return Stream.of(
arguments(CharU20000 + CharU20001, CharU20000, 2),
arguments(CharU20000 + CharU20001, CharU20001, 0),
arguments(CharU20000 + CharU20001, "abcd" + CharUSuppCharLow, 0),
arguments(CharU20000 + CharU20001, "abcd" + CharUSuppCharHigh, 0),
arguments(CharU20000, CharU20000, -1),
arguments(CharU20000, CharU20001, 0),
arguments(CharUSuppCharHigh + "aaaa", CharUSuppCharHigh + "abcd", -1),
arguments(CharUSuppCharHigh + "baaa", CharUSuppCharHigh + "abcd", -1),
arguments(CharUSuppCharHigh + "aaaa", CharU20000 + "abcd", 0),
arguments("aaaa" + CharUSuppCharHigh, CharU20000 + "abcd", 4),
arguments(CharUSuppCharLow + "aaaa", CharU20000 + "abcd", 0),
arguments("aaaa" + CharUSuppCharLow, CharU20000 + "abcd", 4),
arguments(CharU20000 + "aaaa", CharUSuppCharLow + "ab" + CharUSuppCharHigh + "cd", 0),
arguments(CharU20000 + "aaaa", "abcd", 0),
arguments(CharU20000 + "aaaa", "abcd" + CharUSuppCharHigh, 0),
arguments(CharU20000 + "aaaa", "abcd" + CharUSuppCharLow, 0),
arguments("aaaa" + CharU20000, CharU20000 + "abcd", -1)
);
// @formatter:on
}

@Test
public void testCompare_StringString() {
assertEquals(0, StringUtils.compare(null, null));
Expand Down Expand Up @@ -445,25 +474,10 @@ public void testIndexOfAnyBut_StringCharArray() {
assertEquals(0, StringUtils.indexOfAnyBut("aba", 'z'));
}

@Test
public void testIndexOfAnyBut_StringCharArrayWithSurrogateChars() {
assertEquals(2, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20000.toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20001.toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, ("abcd" + CharUSuppCharLow).toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, ("abcd" + CharUSuppCharHigh).toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut(CharU20000, CharU20000.toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000, CharU20001.toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", (CharUSuppCharHigh + "abcd").toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "baaa", (CharUSuppCharHigh + "abcd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", (CharU20000 + "abcd").toCharArray()));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharHigh, (CharU20000 + "abcd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharLow + "aaaa", (CharU20000 + "abcd").toCharArray()));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharLow, (CharU20000 + "abcd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", (CharUSuppCharLow + "ab" + CharUSuppCharHigh + "cd").toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd".toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", ("abcd" + CharUSuppCharHigh).toCharArray()));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", ("abcd" + CharUSuppCharLow).toCharArray()));
assertEquals(-1, StringUtils.indexOfAnyBut("aaaa" + CharU20000, (CharU20000 + "abcd").toCharArray()));
@ParameterizedTest
@MethodSource("indexOfAnyBut_withSurrogateChars")
public void testIndexOfAnyBut_StringCharArrayWithSurrogateChars(final CharSequence seq, final String searchChars, final int expected) {
assertEquals(expected, StringUtils.indexOfAnyBut(seq, searchChars.toCharArray()));
}

@Test
Expand All @@ -483,25 +497,10 @@ public void testIndexOfAnyBut_StringString() {
assertEquals(0, StringUtils.indexOfAnyBut("ab", "z"));
}

@Test
public void testIndexOfAnyBut_StringStringWithSurrogateChars() {
assertEquals(2, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20000));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, CharU20001));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, "abcd" + CharUSuppCharLow));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, "abcd" + CharUSuppCharHigh));
assertEquals(-1, StringUtils.indexOfAnyBut(CharU20000, CharU20000));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000, CharU20001));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", CharUSuppCharHigh + "abcd"));
assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "baaa", CharUSuppCharHigh + "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", CharU20000 + "abcd"));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharHigh, CharU20000 + "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharLow + "aaaa", CharU20000 + "abcd"));
assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharLow, CharU20000 + "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", CharUSuppCharLow + "ab" + CharUSuppCharHigh + "cd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd"));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd" + CharUSuppCharHigh));
assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd" + CharUSuppCharLow));
assertEquals(-1, StringUtils.indexOfAnyBut("aaaa" + CharU20000, CharU20000 + "abcd"));
@ParameterizedTest
@MethodSource("indexOfAnyBut_withSurrogateChars")
public void testIndexOfAnyBut_StringStringWithSurrogateChars(final CharSequence seq, final CharSequence searchChars, final int expected) {
assertEquals(expected, StringUtils.indexOfAnyBut(seq, searchChars));
}

@Test
Expand Down

0 comments on commit 665f047

Please sign in to comment.