This project was developed for the SOICT Legal Document Retrieval Challenge, focusing on building an efficient and accurate system for retrieving Vietnamese legal documents based on semantic similarity. Our team achieved a Top 8 position in SoICT Hackathon 2024.
The Legal Document Retrieval Challenge focuses on solving the problem of querying Vietnamese legal document data, with an emphasis on semantic understanding and accurate retrieval.
The competition centers on a single task: developing an efficient retrieval system for Vietnamese legal documents.
The competition data provided by the organizers includes three sets:
- Training data: 119,456 labeled query-document pairs for model training
- Public test: 10,000 queries for model evaluation
- Private test: 50,000 queries for final evaluation
All datasets share a common repository of legal documents.
The system performance is evaluated using MRR@10 (Mean Reciprocal Rank at 10), which represents the system's ability to find relevant documents in the shared document repository. See the Evaluation section for details.
Our solution combines multiple advanced techniques:
- BM25 for initial document retrieval
- Bi-encoder for semantic encoding
- Cross-encoder for result reranking
- Data chunking for efficient text processing
Download the BM25 model checkpoint:
pip install gdown
gdown "1VFT7UiMXgoJzGKqWxId4LORq9v0VcW7T" -O bm25/bm25_model.pkl
Access our fine-tuned BGE-M3 model:
Download the dataset:
pip install gdown
gdown --folder 1LO4wmj54lWgQvYiGKKAUSLSv5ypMjcDA
Our team achieved a Top 8 position using the combined approach of BM25, bi-encoder, and cross-encoder reranking with data chunking techniques.
Watch our project report video here: Project Report Project Report Video