Skip to content

Latest commit

 

History

History
124 lines (87 loc) · 6.27 KB

README.md

File metadata and controls

124 lines (87 loc) · 6.27 KB

OSIRRC Docker Image for OldDog

Docker Build Status DOI

Chris Kamphuis and Arjen de Vries

This is the docker image for the OldDog project (based on work by Mühleisen et al.) conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. This image is available on Docker Hub has been tested with the jig at commit e51fe7a (06/25/2019).

  • Supported test collections: robust04, core18
  • Supported hooks: init, index, search, interact

Quick Start

The following jig command can be used to index TREC disks 4/5 for robust04:

python run.py prepare \
  --repo osirrc2019/olddog \
  --tag v1.0.0 \
  --collections robust04=/path/to/disk45=trectext

The following jig command can be used to perform a retrieval run on the collection with the robust04 test collection.

python run.py search \
  --repo osirrc2019/olddog \
  --tag v1.0.0 \
  --output $(pwd)/out \
  --qrels qrels/qrels.robust04.txt \
  --topic topics/topics.robust04.txt \
  --collection robust04 \
  --opts out_file_name="run.bm25.robust04"

The --opts argument can be extended by adding mode='disjunctive' for disjunctive query processing.

The following jig command can be used to start an interactive session:

python run.py interact \
  --repo osirrc2019/olddog \
  --tag v1.0.0 \

Retrieval Methods

The OldDog image supports the following retrieval models:

  • BM25 (optionally conjunctive variant): k1=1.2, b=0.75 (Robertson et al., 1995)

Expected Results

The following results should be able to be re-produced using the jig search command.

robust04

MAP conjunctive BM25 disjunctive BM25
TREC 2004 Robust Track Topics 0.1736 0.2434
P@30 conjunctive BM25 disjunctive BM25
TREC 2004 Robust Track Topics 0.2526 0.2985

core18

MAP conjunctive BM25 disjunctive BM25
TREC 2018 Common Core Track Topics 0.1802 0.2381
P@30 conjunctive BM25 disjunctive BM25
TREC 2018 Common Core Track Topics 0.3167 0.3313

Implementation

The following is a quick breakdown of what happens in each of the scripts in the repo.

Dockerfile

The Dockerfile installs dependencies (python3, monetdb, etc.), copies scripts to the root dir, and sets the working dir to /work

init

The init script is a bash script (via the #!/bin/bash she-bang) that invokes wget to download an anserini JAR from Maven Central. Then it downloads the OldDog project from github, which then is build using maven.

index

The index Python script (via the #!/usr/bin/python3 she-bang) reads a JSON string (see here) containing at least one collection to index (including the name, path, and format). The collection is indexed using Anserini (Yang et al., 2017) and placed in a directory, with the same name as the collection, in the working dir (i.e., /work/robust04). After the Lucene index has been created, the OldDog software uses this index to creates csv files from it that can be loaded in the monetdb (Boncz, 2002) column store. A monetDB databse is created and the csv-files are loaded into the database. It is possible to index muliple collection using one container. This is followed by removing the Lucene index so commiting the image takes less time. At this point, jig takes a snapshot and the indexed collections are persisted for the search hook.

search

The search script reads a JSON string (see here) containing the collection name (to map back to the index directory from the index hook) and topic path, among other options. The retrieval run is performed and output is placed in /output for the jig to evaluate using trec_eval.

interact

The interact script starts the monetdb deamon. After this deamon has been started it is possible to open mclient to issue SQL queries. The following command can be used to open mclient:

docker exec -it $(docker ps -aql) mclient -d robust04

Notes

  • re:v0.1.0 We can not guarantee that version v0.1.0 still works. This version cloned the OldDog github repository. New versions download a released version and should keep working.

References

  • Hannes Mühleisen, Thaer Samar, Jimmy Lin, Arjen de Vries (2014) Old Dogs Are Great at New Tricks: Column Stores for IR Prototyping. SIGIR
  • Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Mike Gatford, and A. Payne. (1995) Okapi at TREC-4. TREC
  • Peilin Yang, Hui Fang, and Jimmy Lin (2017) Anserini: Enabling the Use of Lucene for Information Retrieval Research. SIGIR
  • Peter Boncz (2002) Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. PhD Thesis

Reviews