The purpose of this example is to get familiarize with Text Processing & Information Retrieval By :
- Calculing Term Frequency,
- Tokenizing Vectors,
- Calculating Cosine Similarity,
- and, Vecort Product.
In this Project, we'll:
- Read two lines of text from two files, and
- Tokenize them;
- Read a list of stop words from another file, and
- Filter them out;
- Compute the cosine similarity of the two lines of text (using frequencies), and
- Write the result into a file.
Cosine Similarity is defined as vector similarity in terms of the angle separating two vectors. It is calculated by Dot product of vectors. to get similarity ranging from -1 to 1 where
- 1 is Exact match
- -1 is Exact Unmatched
- 0 is Unmatched