Legal Document Retrieval System

Overview

This project was developed for the SOICT Legal Document Retrieval Challenge, focusing on building an efficient and accurate system for retrieving Vietnamese legal documents based on semantic similarity. Our team achieved a Top 8 position in SoICT Hackathon 2024.

Project Description

Topic

The Legal Document Retrieval Challenge focuses on solving the problem of querying Vietnamese legal document data, with an emphasis on semantic understanding and accurate retrieval.

Task

The competition centers on a single task: developing an efficient retrieval system for Vietnamese legal documents.

Data

The competition data provided by the organizers includes three sets:

Training data: 119,456 labeled query-document pairs for model training
Public test: 10,000 queries for model evaluation
Private test: 50,000 queries for final evaluation

All datasets share a common repository of legal documents.

Evaluation Metric

The system performance is evaluated using MRR@10 (Mean Reciprocal Rank at 10), which represents the system's ability to find relevant documents in the shared document repository. See the Evaluation section for details.

Technical Approach

Our solution combines multiple advanced techniques:

BM25 for initial document retrieval
Bi-encoder for semantic encoding
Cross-encoder for result reranking
Data chunking for efficient text processing

Model Checkpoints

BM25 Model

Download the BM25 model checkpoint:

pip install gdown
gdown "1VFT7UiMXgoJzGKqWxId4LORq9v0VcW7T" -O bm25/bm25_model.pkl

Fine-tuned Language Model

Access our fine-tuned BGE-M3 model:

Model: Quintu/bge-m3-legal_retrieval

Dataset Setup

Download the dataset:

pip install gdown
gdown --folder 1LO4wmj54lWgQvYiGKKAUSLSv5ypMjcDA

Performance

Our team achieved a Top 8 position using the combined approach of BM25, bi-encoder, and cross-encoder reranking with data chunking techniques.

Video

Watch our project report video here: Project Report Project Report Video

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
bm25		bm25
evaluation		evaluation
index_data		index_data
preprocessing		preprocessing
result		result
retrieval		retrieval
README.md		README.md
report.pdf		report.pdf
slide.pdf		slide.pdf
train_bge_m3.ipynb		train_bge_m3.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal Document Retrieval System

Overview

Project Description

Topic

Task

Data

Evaluation Metric

Technical Approach

Model Checkpoints

BM25 Model

Fine-tuned Language Model

Dataset Setup

Performance

Video

About

Releases

Packages

Languages

Zhennor/Legal_Document_Retrieval

Folders and files

Latest commit

History

Repository files navigation

Legal Document Retrieval System

Overview

Project Description

Topic

Task

Data

Evaluation Metric

Technical Approach

Model Checkpoints

BM25 Model

Fine-tuned Language Model

Dataset Setup

Performance

Video

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages