Name - TCD Id
Sujit Jadhav - 19310363
/LuceneIndexing
/LuceneIndexing/data/results
/LuceneIndexing/data/trec_eval-9.0.7
/LuceneIndexing/rawdata
query_results_[SimilarityFunctionNamePassedAsArgument]
TrecEval_Result_[SimilarityFunctionNamePassedAsArgument].txt
PRGraph_[SimilarityFunctionNamePassedAsArgument].jpeg
cd to above project path and type mvn clean install
Above command will clean the target folder, download all the dependencies, and give interactive options.
- Financial Times Limited (1991, 1992, 1993, 1994)
- Federal Register (1994)
- Foreign Broadcast Information Service (1996)
- Los Angeles Times (1989, 1990)
Started with Standard Analyzer first, which gave map score of 0.38 Trying various other analyzers did not improve the map value by much, therefore, Custom Analyzer.
- Tried StandardTokenizer with ClassicFilter first, which gave map score of approx. 0.20.
- Used ClassicTokenizer along with ClassicFilter to get the map score of 0.30. Have used various filters in the order mentioned as ASCIIFoldingFilter, LengthFilter(min length = 3, max length = 25), LowerCaseFilter, SynonymFilter, StopFilter(created stopword list manually), KStemFilter and PorterStemFilter.
- The map value wasn’t much better when used filters like EnglishPossessiveFilter, ASCIIFoldingFilter, WordDelimiterFilter.
For indexing various news publication documents, I took the folder name as parameter, passed it to be read by a function. This function will read the nrews document file that is to be indexed.
The most inportant fields that were indexed are : HEADLINE/TITLE/DOCTITLE, TEXT, SUMMARY, GRAPHICS, SUPPLEMENTARY
The queries were topics containing topic number, title, text and narrative. Narrative contained information that can be either relevant or irrelevant. For searching, one would use multipart search. In that case, we had only searched for a single field, the field ALL, which is mentioned above in the indexing part. Boosting was provided to the search query parameters. Relevant and irrelevant narrative have to be dealt carefully with boosting values.
The following Similarities have been used:
MultiSimilarity = (BM25Similarity + LMJSimilarity)
The trec eval runs along the mvn command. To run it separately, use the command from project path given below data/trec_eval-9.0.7/trec_eval data/cran/cranqrel data/results/query_results_[SimilarityFunctionNamePassedAsArgument]
To get proper knowledge about how the indexer and searcher is performing, a graph of Recall VS Precision is plotted and in the results folder. Along with my graph, I have also plotted the ideal graph of Recall VS Precision. Below is my best Graph.
Analyzer = (ClassicFilter + ASCIIFoldingFilter + LengthFiler + LowerCaseFilter + SynonymFilter + StopFilter + KStemFilter + PorterStemFilter)
IndexedDocs = Indexer(Analyzer)
Query Results = Searcher(IndexedDocs + BM25 Function + Analyzer + Maximum Hits = 1000) (Query Ranking file is saved)
Recall VS Precision Graph is created.