forked from castorini/anserini
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmsmarco-doc-docTTTTTquery-per-passage.template
87 lines (59 loc) · 4.54 KB
/
msmarco-doc-docTTTTTquery-per-passage.template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Anserini: Regressions for MS MARCO Document Ranking
This page documents regression experiments for the [MS MARCO document ranking task](https://github.com/microsoft/MSMARCO-Document-Ranking), which is integrated into Anserini's regression testing framework.
Note that there are four different regression conditions for this task, and this page describes the following:
+ **Indexing Condition:** each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing
+ **Expansion Condition:** doc2query-T5
In the passage indexing condition, we select the score of the highest-scoring passage from a document as the score for that document to produce a document ranking; this is known as the MaxP technique.
All four conditions are described in detail [here](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini), in the context of doc2query-T5.
The exact configurations for these regressions are stored in [this YAML file](../src/main/resources/regression/msmarco-doc-docTTTTTquery-per-passage.yaml).
Note that this page is automatically generated from [this template](../src/main/resources/docgen/templates/msmarco-doc-docTTTTTquery-per-passage.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead.
## Indexing
Typical indexing command:
```
${index_cmds}
```
The directory `/path/to/msmarco-doc-docTTTTTquery-per-passage/` should be a directory containing the expanded document collection; see [this link](https://github.com/castorini/docTTTTTquery#reproducing-ms-marco-document-ranking-results-with-anserini) for how to prepare this collection.
For additional details, see explanation of [common indexing options](common-indexing-options.md).
## Retrieval
Topics and qrels are stored in [`src/main/resources/topics-and-qrels/`](../src/main/resources/topics-and-qrels/).
The regression experiments here evaluate on the 5193 dev set questions.
After indexing has completed, you should be able to perform retrieval as follows:
```
${ranking_cmds}
```
Evaluation can be performed using `trec_eval`:
```
${eval_cmds}
```
## Effectiveness
With the above commands, you should be able to reproduce the following results:
${effectiveness}
Explanation of settings:
+ The setting "default" refers the default BM25 settings of `k1=0.9`, `b=0.4`.
+ The setting "tuned" refers to `k1=2.56`, `b=0.59`, tuned to optimize for recall@100 (i.e., for first-stage retrieval) on 2019/12.
In these runs, we are retrieving the top 1000 hits for each query and using `trec_eval` to evaluate all 1000 hits.
Since we're in the passage condition, we fetch the 10000 passages and select the top 1000 documents using MaxP.
This lets us measure R@100 and R@1000; the latter is particularly important when these runs are used as first-stage retrieval.
Beware, an official MS MARCO document ranking task leaderboard submission comprises only 100 hits per query.
See [this page](experiments-msmarco-doc-leaderboard.md) for details on Anserini baseline runs that were submitted to the official leaderboard.
The MaxP passage retrieval functionality is only available in `SearchCollection`; we use a simple script to convert the output back into the MS MARCO format for evaluation.
To generate an MS MARCO submission with the BM25 tuned parameters, corresponding to "BM25 (tuned)" above:
```bash
$ target/appassembler/bin/SearchCollection -topicreader TsvString \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-index indexes/lucene-index.msmarco-doc-docTTTTTquery-per-passage.pos+docvectors+raw \
-output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.trec \
-bm25 -bm25.k1 2.56 -bm25.b 0.59 -hits 1000 \
-selectMaxPassage -selectMaxPassage.delimiter "#" -selectMaxPassage.hits 100
$ python tools/scripts/msmarco/convert_trec_to_msmarco_run.py \
--input runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.trec \
--output runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.txt
$ python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc-docTTTTTquery-per-passage.bm25-tuned.txt
#####################
MRR @100: 0.32081861579183746
QueriesRanked: 5193
#####################
```
This run corresponds to the MS MARCO document ranking leaderboard entry "Anserini's BM25 + doc2query-T5 expansion (per passage), parameters tuned for recall@100 (k1=2.56, b=0.59)" dated 2020/12/11, and is reported in the Lin et al. (SIGIR 2021) Pyserini paper.