irc_bm25_altmetric
:
This run submission combines a BM25 baseline with altmetrics. The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the<query>
,<question>
, and<narrative>
tags. We rerank the baseline by adding the logarithmized Altmetric Attention Score.irc_logreg_tfidf
:
This run submission combines a BM25 baseline with a logistic regression based reranker trained on tfidf-features in combination with relevance judgments of the first round. The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the<query>
,<question>
, and<narrative>
tags. Documents are reranked for those topics where relevance judgments are available (1-30), otherwise the baseline ranking remains unaltered (31-35).
As part of TREC-COVID, we submit automatic runs based on (pseudo) relevance feedback in combination with a reranking approach.
The reranker is trained on relevance feedback data that is retrieved from PubMed/PubMed Central (PMC).
The training data is retrieved with queries using the contents of the <query>
tags only.
For each topic a new reranker is trained. We consider those documents retrieved by the specific topic query as relevant training data, and the documents of the other 29 topics as non-relevant training data. Given a baseline run, the trained system reranks documents.
The baseline run is retrieved with the default ranker of Elasticsearch/Lucene (BM25) and queries using the contents of the <query>
tags only.
For our reranker we use GloVe embeddings in combination with the Deep Relevance Matching Model (DRMM).
Our three run submissions differ by the way training data is retrieved from PubMed/PMC.
irc_entrez
:
The first run is trained on titles and abstracts retrieved from the Entrez Programming Utilities API with "type=relevance".irc_pubmed
:
The second run is trained on titles and abstracts retrieved from PubMed's search interface with "best match". We scrape the PMIDs and retrieve the titles and abstracts afterwards.irc_pmc
:
The third run is trained on full text documents retrieved from PMC.
Our retrieval pipeline relies on the following dependencies:
[docker][elasticsearch][requests][beautifulsoup][matchzoo]
- Install docker. When running on SciComp (Ubuntu VM):
sudo usermod -aG docker $USER
- Make virtual environment and activate it
python3 -m venv venv
source venv/bin/activate
- Install requirements:
pip3 install -r requirements.txt
Run python3
and install nltk data:
python3 -m nltk.downloader punkt
- Download data from semanticscholar, extract it and place it in
./data/
.
./scripts/getDataSets.sh
- Fetch data for 30 topics from PubMed (will be written to
artifact
directory with timestamp)
python3 scripts/fetchPubmedData.py
- Convert embeddings from bin to txt
python3 scripts/convert_word2vec.py
- Optional: Adapt settings in
config.py
- Download image and run Elasticsearch container
python3 scripts/docker-run.py
- Index data
python3 scripts/index.py
- Write baseline run file
python3 scripts/base.py
- Optional: Delete the docker container and remove the image
python3 scripts/docker-rm.py
- Train model for each of the 30 topics and save models to
./artifact/model/<model-type>
python3 scripts/train.py
- Rerank baseline ranking:
python3 scripts/rerank.py
param | comment |
---|---|
DOCS | dictionary with index names as keys and paths to data as values |
BULK | if set to True data is indexed in bulk |
SINGLE_IDX | if is not None , all data is indexed into one instance |
TOPIC | path to topic file |
BASELINE | name of the baseline run |
DATA | path to directory with subsets |
META | path to metadata.csv |
VALID_ID | path to xml file with valid doc ids |
ESEARCH | pubmed eutils api to retrieve pmids given a query term |
EFETCH | pubmed eutils to retrieve document data given one or more pmids |
RETMODE | datatype of pubmed eutils results |
PUBMED_FETCH | directory to fetched data from pubmed |
PUBMED_DUMP_DATE | specify date of pubmed data for training |
MODEL_DUMP | path to directory where model weights are stored |
MODEL_TYPE | specify model type. at the moment dense and drmm are supported |
RUN_DIR | path to the output runs |
RERANKED_RUN | name of the reranked run |
PUBMED_SCRAPE | bool. if set to True , pmids are scraped from pubmed frontend |
PUBMED_FRONT | URL of the pubmed frontend |
RESULT_SIZE | number of results to be retrieved from PUBMED_FRONT |
RERANK_WEIGHT | weight param for reranker score. default: 0.5 |
IMAGE_TAG | |
CONTAINER_NAME | |
FULLTEXT_PMC | |
RUN_TAG | |
ESEARCH_PMC | |
EFETCH_PMC | |
EMBEDDING | |
EMBED_DIR | |
BIOWORDVEC |
name | link |
---|---|
comm |
commercial use subset |
noncomm |
non-commercial use subset |
custom |
custom license subset |
biorxiv |
bioRxiv/medRxiv subset |