Ria Hari Gusmita, Asep Fajar Firmansyah, Hamada Zahera, Axel-Cyrille Ngonga Ngomo
We present ELEVATE-ID, a framework for evaluating Large Language Models (LLMs) in Entity Linking (EL) tasks for Low-resource Languages (LrLs) using IndEL. This study extends our prior work, in which we developed IndEL and evaluated five traditional EL systems. In this work, we assess multilingual and Indonesian monolingual LLMs in both zero-shot and fine-tuning settings. The multilingual LLMs include GPT-3.5, GPT-4, and LLaMA-3, while the monolingual Indonesian LLMs include Komodo and Merak. Our evaluation consists of three main analyses: accuracy analysis, generalization analysis, and error analysis. Accuracy analysis measures how effectively LLMs identify and link entities to their corresponding entries on Wikidata, using precision, recall, and F1-score as metrics. Generalization analysis evaluates the LLMs’ ability to transfer knowledge across domains (cross-domain evaluation) and to perform in mixed-domain settings, where models are fine-tuned on combined general and specific domain data and tested on specific domains. Finally, error analysis investigates common types of mistakes made by LLMs, such as misidentifying entities or failing to establish correct links. These analyses provide critical insights into the strengths, limitations, and potential improvements for LLMs in EL tasks for LrLs.
IndEL is the first Indonesian EL benchmark dataset, covering both general and specific domains. It uses Wikidata as the knowledge base and is manually annotated following meticulous guidelines. The entities in the general domain are sourced from the Indonesian NER benchmark dataset, NER UI, while those in the specific domain are gathered from IndQNER, an Indonesian NER benchmark dataset based on the Indonesian translation of the Quran. IndEL has been utilized to evaluate five multilingual EL systems, including Babelfy, DBpedia Spotlight, MAG, OpenTapioca, and WAT using the GERBIL framework platform. Details on the dataset as well as experiment results can be seen here.
Similar to human-based annotation, where annotation guidelines ensure standard and correct results, we define relevant prompts for the LLMs. These prompts comprise two parts: task description and desired outputs, as shown below.
Instruction Template | |
---|---|
Task Description | Find entities and their corresponding entry links in Wikidata within the following sentence. Use the context of the sentence to determine the correct entries in Wikidata. |
Output Format | The output should be formatted as: [entity1=link1, entity2=link2]. No explanations are needed. |
Sample Sentence | Pria kelahiran Bogor, 16 Maret 60 tahun silam itu juga ditunjuk sebagai salah satu direktur Indofood dalam RUPS Juni 2008 silam. (A man born in Bogor, 60 years ago on March 16, was also appointed as one of the directors of Indofood in the General Meeting of Shareholders in June 2008.) |
In the zero-shot setting, we prompt the LLMs using an instruction format, where the prompt includes only the task description and output format. Meanwhile, in the fine-tuning setting, the LLMs are provided with detailed prompts and example sentences from the dataset. To support the zero-shot and fine-tuning experiments, we split IndEL into training, validation, and test sets using an 8:1:1 ratio. The following are the details of the split:
Domain | Total Sentences | Train | Validation | Test |
---|---|---|---|---|
General Domain | 2114 | 1673 | 229 | 212 |
Specific Domain | 2621 | 2075 | 283 | 263 |
- Model: GPT-4
- Datasets preparation: Consider to update the domain manually on
preparing_dataset.py
domain = "general-domain" # change to 'specific-domain' if you want to generate test dataset for specific domain and vice-versa
cd scripts/gpt
python preparing_dataset.py
- Execute zero-shot-based prediction
cd scripts/gpt
python run_predictions.py
- Model: Komodo-7b-base, Llama-3-8B-Instruct, Merak-7B-v4
- Datasets preparation The datasets are available in datasets directory
- Execute zero-shot-based prediction
The value of
domain
andbase_model_name
are subject to change.
cd scripts/llama
python run_predictions.py
- Datasets preparation: To prepare the datasets for training the GPT model, refer to step 2 in the LLaMA Family section. Consider the path of the source dataset, the domain, and the name of the file where the processed data will be stored.
- Fine-tuning process
- GPT-3.5 We use GPT-3.5 to fine-tune the model due to its availability. To perform this process, you can follow the procedure on OpenAI's fine-tuning platform. In this experiment, we set the hyperparameters as follows: number of epochs = 3, batch size = 8, and learning rate multiplier = 2.
- Llama Family
The value of
domain
andbase_model_name
are subject to change.
cd scripts/llama
python llm-finetuning.py
We evaluate four LLMs, GPT-4, Komodo, LLaMA-3, and Merak in the EL task using the IndEL dataset in the zero-shot setting. Additionally, we evaluate GPT-3.5 (we did not have access to fine-tune GPT-4), Komodo, LLaMA-3, and Merak in the fine-tuning setting. The followings are the results measured in precision, recall, and F1-score.
General Domain with Zero-shot | ||||
---|---|---|---|---|
Metrics | GPT-4 | Komodo | LLaMA-3 | Merak |
Precision | 0.083 | 0.000 | 0.003 | 0.000 |
Recall | 0.089 | 0.000 | 0.003 | 0.000 |
F1 | 0.083 | 0.000 | 0.003 | 0.000 |
Specific Domain with Zero-shot | ||||
Precision | 0.010 | 0.000 | 0.000 | 0.000 |
Recall | 0.016 | 0.000 | 0.000 | 0.000 |
F1 | 0.012 | 0.000 | 0.000 | 0.000 |
General Domain with Fine-tuning | ||||
---|---|---|---|---|
Metrics | GPT-3.5 | Komodo | LLaMA-3 | Merak |
Precision | 0.385 | 0.018 | 0.084 | 0.045 |
Recall | 0.373 | 0.026 | 0.117 | 0.039 |
F1 | 0.373 | 0.021 | 0.093 | 0.041 |
Specific Domain with Fine-tuning | ||||
Precision | 0.616 | 0.221 | 0.415 | 0.446 |
Recall | 0.610 | 0.471 | 0.444 | 0.393 |
F1 | 0.611 | 0.285 | 0.409 | 0.407 |
If you have any questions or feedbacks, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id