GitHub - fulmicotone/fulmicotone-strings-similarity: Simple library to compare two phrase through different algorithms, already well implemented in debatty / java-string-similarity. The library allows to comparing two text, first as entirety, then as a piece by piece and finally as single words in any combinations. Then each one can implements its own strategy to consider valid a comparison or not.

Strings-similarity

Library to compare two text input. First the input will be normalized then through one or more algorithms, already well implemented in the debatty / java-string similarity, the comparison will begin and finally the result will be evaluated with a personal strategy.

Normalization

In order to define the rules that will prepare the text for comparison we must use the class PhraseNormalizerFactory

We can set the following rules:

skip the word with length < n
lower casing all the input
splitting in chunks using the splitterDelimiter
replacing one or more word with an other

discarding one or more a word

  PhraseNormalizerFactory.newOne()
  .withMinWordLength(3)
  .withApplyLowerCase(true)
  .withSplitterDelimiter("/")
  .addReplacement( word, replace)
  .addDiscardWord(String word)
  .build()

After applying PhraseNormalizerFactory the output is an organised object where the text is subdivided in Phrase, Chunks and Words.

Each of these objects belongs to the family of the CharacterSequence and have three main fields useful in order to know where it come from,what place it had in its context and what it contains:

parent defines from where is has been extracted
sort index indicates the order inside is root context
sequence its content as string

These fields can be important in a comparison between two CharacterSequence.

Comparison

This process involves comparing in any combination each CharacterSequence normalized of the same type, Phrase with Phrase, Chunk with Chunk, Word with Word. In any compare will be used the distance algorithms defined in the StringDistanceAlgorithms class and each of them will produce CharacterSequenceComparison, an Object containing the two CharacterSequence compared and and a scoring map generated by the algorithms, it is also possible through the element compared inside having access to the sort index property and parent length to understand the place of each element in its root context and use this to have a better result evaluation.

Each CharacterSequenceComparison will be collected in a ComparingResult and then passed as input to the strategies already defined that will evaluate if the compare is passed or not.

Compare Result Evaluation

We can define one or more Compare result evaluation strategy then relate them to each other through the logical operators OR or AND. Implementing a ComparationPassedStrategy interface we can write how to the CompareResult Object will be evaluate in order to consider the compare passed or not.

public class OneWordIsEnoughStrategy implements ComparationPassedStrategy {
@Override
public boolean isPassed(ComparingResult r) {
   return r.getByUnit(CharacterSequenceUnit.WORD)
            .stream()
            .flatMap(c->c.getScoreMap().values().stream())
            .anyMatch(score->score==0);

}
}

EXAMPLE

  Similarity s = Similarity.Builder
  .newOne()
  .withFirstFactorNormalizationRules(PhraseNormalizerFactory.newOne()
  .withMinWordLength(3)
  .withApplyLowerCase(true)
  .withSplitterDelimiter("/")
  .build())
  .withSecondFactorNormalizationRules(PhraseNormalizerFactory.newOne()
  .withMinWordLength(3)
  .withApplyLowerCase(true)
  .withSplitterDelimiter("-")
  .build())
  .addPassedStrategy(new OneWordIsEnoughStrategy(), Similarity.Builder.LogicalOperator.OR)
  .build();
          
  boolean willbetrue= s.compare("soccer/player/ball", "soccer and play-shoes-bombastic");

  boolean willbefalse= s.compare("tennis/player/field", "rock and base-shoes-bombastic");

TODO

PhraseNormalizerFactory adding apply trim
consider when a replacement completely replaces the entire input and therefore only space remains

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.idea		.idea
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
README.MD		README.MD
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strings-similarity

Normalization

Comparison

Compare Result Evaluation

EXAMPLE

About

Releases

Packages

Languages

fulmicotone/fulmicotone-strings-similarity

Folders and files

Latest commit

History

Repository files navigation

Strings-similarity

Normalization

Comparison

Compare Result Evaluation

EXAMPLE

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages