Library to compare two text input. First the input will be normalized then through one or more algorithms, already well implemented in the debatty / java-string similarity, the comparison will begin and finally the result will be evaluated with a personal strategy.
In order to define the rules that will prepare the text for comparison we must use the class PhraseNormalizerFactory
We can set the following rules:
-
skip the word with length < n
-
lower casing all the input
-
splitting in chunks using the splitterDelimiter
-
replacing one or more word with an other
-
discarding one or more a word
PhraseNormalizerFactory.newOne() .withMinWordLength(3) .withApplyLowerCase(true) .withSplitterDelimiter("/") .addReplacement( word, replace) .addDiscardWord(String word) .build()
After applying PhraseNormalizerFactory the output is an organised object where the text is subdivided in Phrase, Chunks and Words.
Each of these objects belongs to the family of the CharacterSequence and have three main fields useful in order to know where it come from,what place it had in its context and what it contains:
- parent defines from where is has been extracted
- sort index indicates the order inside is root context
- sequence its content as string
These fields can be important in a comparison between two CharacterSequence.
This process involves comparing in any combination each CharacterSequence normalized of the same type, Phrase with Phrase, Chunk with Chunk, Word with Word. In any compare will be used the distance algorithms defined in the StringDistanceAlgorithms class and each of them will produce CharacterSequenceComparison, an Object containing the two CharacterSequence compared and and a scoring map generated by the algorithms, it is also possible through the element compared inside having access to the sort index property and parent length to understand the place of each element in its root context and use this to have a better result evaluation.
Each CharacterSequenceComparison will be collected in a ComparingResult and then passed as input to the strategies already defined that will evaluate if the compare is passed or not.
We can define one or more Compare result evaluation strategy then relate them to each other through the logical operators OR or AND. Implementing a ComparationPassedStrategy interface we can write how to the CompareResult Object will be evaluate in order to consider the compare passed or not.
public class OneWordIsEnoughStrategy implements ComparationPassedStrategy {
@Override
public boolean isPassed(ComparingResult r) {
return r.getByUnit(CharacterSequenceUnit.WORD)
.stream()
.flatMap(c->c.getScoreMap().values().stream())
.anyMatch(score->score==0);
}
}
Similarity s = Similarity.Builder
.newOne()
.withFirstFactorNormalizationRules(PhraseNormalizerFactory.newOne()
.withMinWordLength(3)
.withApplyLowerCase(true)
.withSplitterDelimiter("/")
.build())
.withSecondFactorNormalizationRules(PhraseNormalizerFactory.newOne()
.withMinWordLength(3)
.withApplyLowerCase(true)
.withSplitterDelimiter("-")
.build())
.addPassedStrategy(new OneWordIsEnoughStrategy(), Similarity.Builder.LogicalOperator.OR)
.build();
boolean willbetrue= s.compare("soccer/player/ball", "soccer and play-shoes-bombastic");
boolean willbefalse= s.compare("tennis/player/field", "rock and base-shoes-bombastic");
TODO
- PhraseNormalizerFactory adding apply trim
- consider when a replacement completely replaces the entire input and therefore only space remains