Structure Aware Dense Retrieval (SANTA)

Source code and dataset for ACL 2023 Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data.

Environment

(1) Install the following packages using Pip or Conda under this environment:

transformers==4.22.2
nltk==3.7
numpy==1.23.2
datasets>=2.4.0
tree-sitter==0.0.5
faiss==1.7.4
scikit-learn>=1.1.2
pandas==1.5.0
pytrec-eval==0.5
tensorboard

(2) install openmatch. To download OpenMatch as a library and obtain openmatch-thunlp-0.0.1.

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install .

Pretrained Checkpoint

HuggingFace Link

(1) The checkpoint of the pretrained SANTA model on Python data is here.

(2) The checkpoint of the pretrained SANTA model on ESCI (large) data is here.

Pretraining Paraments

learning_rate=5e-5
num_train_epochs=6
train_n_passages=1 
per_device_train_batch_size=16

Data Download

Code

(1) CodeSearchNet

git clone https://github.com/github/CodeSearchNet.git
cd CodeSearchNet/
script/setup

(2) Adv

wget https://github.com/microsoft/CodeXGLUE/raw/main/Text-Code/NL-code-search-Adv/dataset.zip
unzip dataset.zip && rm -r dataset.zip && mv dataset AdvTest && cd AdvTest
wget https://zenodo.org/record/7857872/files/python.zip
unzip python.zip && python preprocess.py && rm -r python && rm -r *.pkl && rm python.zip

(3) CodeSearch

wget https://github.com/microsoft/CodeBERT/raw/master/GraphCodeBERT/codesearch/dataset.zip
unzip dataset.zip && rm -r dataset.zip && mv dataset CSN && cd CSN
bash run.sh

Product

you can download ESCI data from here: ESCI

Process Data

Process Code

(1) Collect pretraining code data.

The pretraining data for the code is sourced from the downloaded CodeSearchNet, which consists of six programming languages. The collection of pretraining data is based on the data statistics provided by CodeT5 and CodeRetriever. For the five programming languages other than Python, the train from CodeSearchNet will be merged to create the pretraining data ${PRETRAIN_RAW_PATH}/${pretrain_raw_data}. However, for Python, both the train and test from CodeSearchNet will be merged as the pretraining data. When selecting a checkpoint for pretraining, the valid from CodeSearchNet will be used as the dev for all programming languages.

(2) Process pretraining code data.

process the raw code pretraining data and make it suitable for pretraining inputs <query, positive, label>. Enter the folder shell and run the shell script:

bash process-pretrain-code.sh

(3) Process finetuning code data.

For the Adv and CodeSearch two code retriever tasks, you can process the raw training file train into input path ${FINETUNE_RAW_PATH}/${finetune_raw_data}.

bash process-finetune-code.sh

Process Product

(1) Collect pretraining product data.

To use the ESCI (large) data for pretraining, please ensure that the following two files, shopping_queries_dataset_examples.parquet and shopping_queries_dataset_products.parquet, are downloaded and available in the pretraining path ${PRETRAIN_RAW_PATH}.

(2) Process pretraining produc data.

processing the raw product pretraining data makes it suitable for pretraining inputs <query, positive, label> and save dev set into ${PRETRAIN_PATH}/${pretrain_eval_data} for selecting pretraining checkpoint.

bash process-pretrain-product.sh

(3) Process finetuning produc data.

For the product search task, we use ESCI (small) data for finetuning. you can process the raw training file which includes shopping_queries_dataset_examples.parquet and shopping_queries_dataset_products.parquet into input path ${FINETUNE_PATH}/${finetune_data} for fintuning and eval path ${FINETUNE_PATH}/${finetune_eval_data} for selecting finetuning checkpoint.

bash process-finetune-product.sh

P.S. Due to the absence of product descriptions for some items in the ESCI dataset, when processing the product data for fine-tuning, we concatenate the product title and product description using the following format: title: + product title + text: + product description. If a product doesn't have a description, we concatenate it as follows: title: + product title + text:.

Pretraining

Pretraing for code search

To continue pretraining CodeT5 for different programming languages, utilize the corresponding processed code pretraining data. For instance, if you want to train a Python code retrieval model, only use Python pretraining data for training.

bash pretrain-code.sh

Pretraing for product search

Continuing pretraining T5 using the processed product pretraining data to get a product retrieval model.

bash pretrain-product.sh

Finetuning

Finetuning for code search

For the pretrained checkpoint, finetuning is performed on downstream tasks related to code retrieval, such as Adv and CodeSearch, useing processed finetuned data. For example, If you have pretrained on Python code, you should also fine-tune on Python code.

bash finetune-code.sh

Finetuning for product search

For the product retrieval task, the pretrained checkpoint is fine-tuned on ESCI (small), useing processed finetuned data. You can find more details about this task here.

bash finetune-product.sh

P.S. If you want to use hard negatives, you need to set the parameter train_n_passages to n+1, where n is the number of hard negatives.

Select Checkpoint

The previous pretraining and finetuning tasks saved checkpoints in their respective directories, and this section is about selecting the best checkpoint on their respective dev sets.

Best dev checkpoint for code retrieval pretraining

Evaluating all the checkpoints saved after pretraining on the previously processed dev data, and save the best checkpoint.

bash dev_code_pretrain.sh

Best dev checkpoint for code retrieval finetuning

Evaluating all the checkpoints saved after finetuning on the previously processed dev data, and save the best checkpoint.

bash dev_code_finetune.sh

Best dev checkpoint for product retrieval pretraining

Evaluating all the checkpoints saved after pretraining on the previously processed dev data located at ${PRETRAIN_PATH}/${pretrain_eval_data}, and save the best checkpoint.

bash dev_product_pretrain.sh

Best dev checkpoint for product retrieval finetuning

Evaluating all the checkpoints saved after finetuning on the previously processed dev data located at ${FINETUNE_PATH}/${finetune_eval_data}, and save the best checkpoint.

bash dev_product_finetune.sh

Evaluating

Before evaluating the code and product retrieval tasks, it is necessary to download OpenMatch.

git clone https://github.com/OpenMatch/OpenMatch.git

Evaluating Code Retrieval

For code retrieval tasks, you need to generate test data to conform to OpenMatch's input.

bash build-code-test.sh

Then, you need to build the Faiss index and obtain the necessary files for inference.

bash index-code.sh

Evaluate using the obtained inference files.

bash evaluate_code.sh

Evaluating Product Retrieval

For product retrieval task, you need to generate test data to conform to OpenMatch's input.

bash build-product-test.sh

Encode the query and description of the product as embeddings and save them.

bash index-product.sh

Calculate scores for the encoded embeddings and sort them to obtain two files hypothesis.results and test.qrels, which will be used to calculate the NDCG score.

bash evaluate_product.sh

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.idea		.idea
code_best_dev		code_best_dev
evaluate_code		evaluate_code
evaluate_product		evaluate_product
processing		processing
product_best_dev		product_best_dev
shell		shell
README.md		README.md
data_collator.py		data_collator.py
model.py		model.py
santa_arguments.py		santa_arguments.py
train_santa.py		train_santa.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structure Aware Dense Retrieval (SANTA)

Environment

Pretrained Checkpoint

HuggingFace Link

Pretraining Paraments

Data Download

Code

Product

Process Data

Process Code

Process Product

Pretraining

Pretraing for code search

Pretraing for product search

Finetuning

Finetuning for code search

Finetuning for product search

Select Checkpoint

Best dev checkpoint for code retrieval pretraining

Best dev checkpoint for code retrieval finetuning

Best dev checkpoint for product retrieval pretraining

Best dev checkpoint for product retrieval finetuning

Evaluating

Evaluating Code Retrieval

Evaluating Product Retrieval

About

Releases

Packages

Languages

OpenMatch/SANTA

Folders and files

Latest commit

History

Repository files navigation

Structure Aware Dense Retrieval (SANTA)

Environment

Pretrained Checkpoint

HuggingFace Link

Pretraining Paraments

Data Download

Code

Product

Process Data

Process Code

Process Product

Pretraining

Pretraing for code search

Pretraing for product search

Finetuning

Finetuning for code search

Finetuning for product search

Select Checkpoint

Best dev checkpoint for code retrieval pretraining

Best dev checkpoint for code retrieval finetuning

Best dev checkpoint for product retrieval pretraining

Best dev checkpoint for product retrieval finetuning

Evaluating

Evaluating Code Retrieval

Evaluating Product Retrieval

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages