-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.txt
377 lines (331 loc) · 13.3 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
FOLDERS:
---------------------------------------------------------------------------------------------------------------------------------------------------------
CORPUS
INPUT_FOLDER
PHASE 1
PHASE 2
PHASE 3
---------------------------------------------------------------------------------------------------------------------------------------------------------
GeneratingCorpus.py :
A program for prasing and tokenizing the input documents
It takes the given 3204 raw HTML documents as input and the output is generated
in the CORPUS folder
Procedure to run:
python GeneratingCorpus.py
--------------------------
CORPUS:
The 3204 output parsed text documents generated for GeneratingCorpus.py
---------------------------------------------------------------------------------------------------------------------------------------------------------
INPUT_FOLDER:
Consists of all the documents given as part of the question.
--------------------------
INPUT_FOLDER -> cacm:
Consists of the 3204 raw HTML input documents
---------------------------------------------------------------------------------------------------------------------------------------------------------
PHASE 1 -> TASK 1 -> BM25:
Consists of all the documents relating to BM25 Baseline run
--------------------------
CORPUS:
The 3204 input parsed documents for BM25 scoring
--------------------------
BM25.py:
The program is used for ranking documents based on BM25 algorithm
Procedure to run:
python BM25.py
--------------------------
BM25_SCORE_LIST.txt:
The output of ranked documents based on BM25
--------------------------
cacm.rel.txt :
The set of relevance information
--------------------------
query.txt:
Contains the list of queries submitted
--------------------------------------------------------------------------------
PHASE 1 -> TASK 1 -> Lucene:
Consists of all the documents relating to Lucene Baseline run
--------------------------
CORPUS:
The 3204 input parsed documents for Lucene scoring
--------------------------
JAR FILES:
Consists of the JAR files to be imported for Lucene (Version- 4.7.2)
--------------------------
LUCENE_SCORE_LIST.txt:
The output file generated for Lucene ranking
--------------------------
LuceneImplementation.java
The java code implementing the ranking of documents based on Lucene
Procedure to run:
javac HW4.java
java HW4
When prompted;
Enter the FULL path where the index will be created: (e.g. /Usr/index or c:\temp\index)
**Enter the path where the output file should be located**
When prompted;
Enter the FULL path to add into the index (q=quit): (e.g. /home/mydir/docs or c:\Users\mydir\docs)
[Acceptable file types: .xml, .html, .html, .txt]
** Enter the path of parsed_docs folder **
** Enter "q" to quit **
--------------------------
query.txt:
Contains the list of queries submitted
--------------------------------------------------------------------------------
PHASE 1 -> TASK 1 -> SMOOTHED_QUERY_LIKELIHOOD:
Consists of all the documents relating to Smoothed Query Likelihood Baseline run
--------------------------
CORPUS:
The 3204 input parsed documents for SmoothedQueryLikelihood scoring
--------------------------
query.txt:
Contains the list of queries submitted
--------------------------
SMOOTHED_MODEL_SCORE_LIST.txt
The output file generated for Smoothed query likelihood ranking
--------------------------
SmoothedQueryLikelihood.py:
The python program to rank the documents based on Smoothed query likelihood model
Procedure to run:
python SmoothedQueryLikelihood.py
--------------------------------------------------------------------------------
PHASE 1 -> TASK 1 -> TF-IDF:
Consists of all the documents relating to TF-IDF Baseline run
--------------------------
CORPUS:
The 3204 input parsed documents for TF-IDF scoring
--------------------------
query.txt:
Contains the list of queries submitted
--------------------------
TF_IDF_SCORE_LIST.txt :
The output file generated of ranked documents based on TF_IDF mechanism
--------------------------
TF-IDF.py:
The python programm for ranking documents based on TF-IDF algorithm
Procedure to run:
python TF-IDF.py
--------------------------------------------------------------------------------
PHASE 1 -> TASK 2 -> PSEUDO_RELEVANCE_FEEDBACK:
Consists of all the documents relating to Pseudo relevance feedback run
--------------------------
CORPUS:
The 3204 input parsed documents for PseudoRelevanceFeedback
--------------------------
query.txt:
Contains the list of queries submitted
--------------------------
cacm.rel.txt:
Consists of the relevance information
--------------------------
PSEUDO_RELEVANCE_BM25_SCORE_LIST.txt:
Consists of the output of the pesudo relevance ranking algorithm
--------------------------
PseudoRelevanceFeedbackBM25.py:
The python program for pesudo relevance feedback algorithm
Procedure to run:
python PseudoRelevanceFeedbackBM25.py
--------------------------------------------------------------------------------
PHASE 1 -> TASK 3 -> TASK 3A:
Consists of all the files relating to the 3 basseline runs for Stopped corpus
--------------------------
CORPUS:
The 3204 input parsed documents
--------------------------
STOP_CORPUS:
The 3204 documents generated after stopping operation
--------------------------
common_words.txt
The list of given stop words
--------------------------
common_words_parse.py:
The python program for removing the stopped words from the parsed corpus
Procedure to run:
python common_words_parse.py
--------------------------
BM25.py:
The python program for BM25 algorithm for ranking the documents for which Stopping
operation had been performed.
Procedure to run:
As mentioned before
--------------------------
BM25_SCORE_STOPPED_LIST.txt:
The BM25 ouput file generated for stopped corpus
--------------------------
cacm.rel.txt:
The relevance information provided
--------------------------
LUCENE_SCORE_STOPPED_LIST.txt:
The ouput ranked documents generated for Stopped words using Lucene
--------------------------
LuceneImplementation.java:
The Lucene program for ranking the stopped list of documents using Lucene
Procedure to run:
As mentioned before
--------------------------
TF_IDF_SCORE_STOPPED_LIST.txt:
The output file generated for stopped list of documents using TF-IDF algorithm
--------------------------
TF-IDF.py:
The python program for ranking documents using TF-IDF
Procedure to run:
As mentioned before
--------------------------------------------------------------------------------
PHASE 1 -> TASK 3 -> TASK 3B:
Consists of all the files relating to the 3 basseline runs for stemmed corpus
--------------------------
CORPUS:
The 3204 input parsed documents
--------------------------
cacm.rel.txt:
The relevance information provided
--------------------------
cacm_stem.query.txt:
The set of stemmed query provided
--------------------------
cacm_stem.txt:
It consists of the raw unparsed stemmed documents
--------------------------
GeneratingCorpusUsingStemming.py:
The python program for parsing the documents using the cacm_stem.txt
--------------------------
BM25_SCORE_STEMMED_LIST.txt:
The outpur file generated after BM25 ranking on stemmed corpus
--------------------------
BM25ForStemming.py:
The program for ranking the documents based on BM25 algorithm for stemmed corpus
Procedure to run:
As mentioned before
--------------------------
LUCENE_SCORE_STEMMED_LIST.txt:
The output file generated for stmmed corpus using Lucene
--------------------------
LuceneImplementationStemmedList.java
The program for ranking stemmed documents using Lucene
Procedure to run:
As mentioned before
--------------------------
TF_IDF_SCORE_STEMMED_LIST.txt:
The outpur file generated for ranked documents using TF-IDF
--------------------------
TF-IDF-ForStemming.py:
The program for ranking stemmed documents using TF-IDF
Procedure to run:
As mentioned before
---------------------------------------------------------------------------------------------------------------------------------------------------------
PHASE 2 -> SNIPPET_GENERATION :
Consists of the documents relating to implementation of Snippet generation
--------------------------
cacm:
The set of 3204 unprocessed documents
--------------------------
lucene_score -> LUCENE_SCORE_LIST.txt:
The ranked documents generated from Lucene program
--------------------------
cacm.query:
The set of queries submitted
--------------------------
SnippetGeneration.py:
The program for generated snippet based from Lucene output of ranked documents
Procedure to run:
python SnippetGeneration.py
Requirements:
import colorama (for Windows and run in command prompt)
Output generated:
The output is generated in command prompt and the query terms in documents
are highlighted
--------------------------
unprocessed_query.txt:
The set of queries generated just removing the tags
---------------------------------------------------------------------------------------------------------------------------------------------------------
PHASE 3 -> SNIPPET_GENERATION :
Consists the files relating to the evaluation of eight distinct runs of 64 queries
--------------------------
cacm.rel.txt:
Consists of the relevance information
--------------------------
Evaluation.py:
Program for computing the various evaluation measure:
1- MAP
2- MRR
3- P@K, K=5 and K=20
4- Precision & Recall
--------------------------
MAP_SCORES.txt:
The ouput file consisting of the Mean average Precision scores of the documents
--------------------------
MRR_SCORES.txt:
The output file consisting of the Mean Reciprocal Rank scores of the documents
--------------------------
BM25_SCORE_LIST_FOR_EVALUATION.txt:
The ranked list of documents generated for BM25 ranking algorithm
--------------------------
BM25_SCORE_LIST_FOR_EVALUATION_P5_SCORE.txt:
The precision score generated using BM25 at K=5
--------------------------
BM25_SCORE_LIST_FOR_EVALUATION_P20_SCORE.txt:
The precision score generated using BM25 at K=20
--------------------------
BM25_SCORE_LIST_FOR_EVALUATION_PRECISION-RECALL.txt:
The output generated contains the Precision and Recall of documents ranked using BM25
--------------------------
BM25_SCORE_STOPPED_LIST_FOR_EVALUATION.txt:
The ranked list of stopped documents generated for BM25 ranking algorithm
--------------------------
BM25_SCORE_STOPPED_LIST_FOR_EVALUATION_P5_SCORE.txt:
The precision score generated using BM25 for stopped corpus at K=5
--------------------------
BM25_SCORE_STOPPED_LIST_FOR_EVALUATION_P20_SCORE.txt:
The precision score generated using BM25 for stopped corpus at K=20
--------------------------
BM25_SCORE_STOPPED_LIST_FOR_EVALUATION_PRECISION-RECALL.txt:
The output generated for stopped corpus contains the Precision and Recall of
documents ranked using BM25
--------------------------
LUCENE_SCORE_LIST_FOR_EVALUATION.txt:
The ranked list of documents generated using Lucene
--------------------------
LUCENE_SCORE_LIST_FOR_EVALUATION_P5_SCORE.txt:
The precision score generated using Lucene at K=5
--------------------------
LUCENE_SCORE_LIST_FOR_EVALUATION_P20_SCORE.txt:
The precision score generated using Lucene at K=20
--------------------------
LUCENE_SCORE_LIST_FOR_EVALUATION_PRECISION-RECALL.txt
The precision and recall scores generated for Lucene
--------------------------
LUCENE_SCORE_STOPPED_LIST_FOR_EVALUATION.txt:
The ranked list of stopped documents dgenerated using Lucene
--------------------------
LUCENE_SCORE_STOPPED_LIST_FOR_EVALUATION_P5_SCORE.txt:
The precision score for stopped corpus using Lucene at K=5
--------------------------
LUCENE_SCORE_STOPPED_LIST_FOR_EVALUATION_P20_SCORE.txt:
The precision score for stopped corpus using Lucene at K=20
--------------------------
LUCENE_SCORE_STOPPED_LIST_FOR_EVALUATION_PRECISION-RECALL.txt:
The precision and recall scores generated for stopped corpus using Lucene
--------------------------
PSEUDO_RELEVANCE_BM25_SCORE_LIST_FOR_EVALUATION.txt
PSEUDO_RELEVANCE_BM25_SCORE_LIST_FOR_EVALUATION_P5_SCORE.txt
PSEUDO_RELEVANCE_BM25_SCORE_LIST_FOR_EVALUATION_P20_SCORE.txt
PSEUDO_RELEVANCE_BM25_SCORE_LIST_FOR_EVALUATION_PRECISION-RECALL.txt
The evaluation measures generated for Peseudo relevance algorithm
--------------------------
SMOOTHED_MODEL_SCORE_LIST_FOR_EVALUATION.txt
SMOOTHED_MODEL_SCORE_LIST_FOR_EVALUATION_P5_SCORE.txt
SMOOTHED_MODEL_SCORE_LIST_FOR_EVALUATION_P20_SCORE.txt
SMOOTHED_MODEL_SCORE_LIST_FOR_EVALUATION_PRECISION-RECALL.txt
The evaluation measures generated for Smoothed query likelihood model
--------------------------
TF_IDF_SCORE_LIST_FOR_EVALUATION.txt
TF_IDF_SCORE_LIST_FOR_EVALUATION_P5_SCORE.txt
TF_IDF_SCORE_LIST_FOR_EVALUATION_P20_SCORE.txt
TF_IDF_SCORE_LIST_FOR_EVALUATION_PRECISION-RECALL.txt
The evaluation measures generated for TF-IDF
--------------------------
TF_IDF_SCORE_STOPPED_LIST_FOR_EVALUATION.txt
TF_IDF_SCORE_STOPPED_LIST_FOR_EVALUATION_P5_SCORE.txt
TF_IDF_SCORE_STOPPED_LIST_FOR_EVALUATION_P20_SCORE.txt
TF_IDF_SCORE_STOPPED_LIST_FOR_EVALUATION_PRECISION-RECALL.txt
The evaluation measures generated for TF-IDF on stopped corpus
---------------------------------------------------------------------------------------------------------------------------------------------------------