Skip to content

An ever-expanding overview of the knowledge on large language models (LLMs), speech technologies, and other NLP technologies for Slovenian language.

Notifications You must be signed in to change notification settings

clarinsi/Slovenian-Language-Technologies-Overview

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Slovenian Language Technologies Overview

A collaborative overview of the knowledge on large language models (LLMs), speech technologies, and other natural-language processing (NLP) technologies for Slovenian language. The overview is curated by the CLARIN Knowledge Centre for South Slavic languages CLASSLA.

For an overview of freely-available datasets for Slovenian language, including general text collections, and training and test datasets for various NLP tasks, see the Frequently-Asked Questions for Slovenian, provided by CLASSLA. The FAQ also provides information about resources and technologies for linguistic annotation of Slovenian texts.

Main sites where you can find language technologies for Slovenian:

Content:

Generative Models (LLMs) for Slovenian

Open-Source Instruction-Tuned Models:

  • specialised for Slovenian: recently-available instruction-tuned GaMS model by CJVT: GaMS-1B-Chat (Vreš et al., 2024): 1B model, developed as part of the POVEJMO project - bigger models will follow as the final products of this project
  • multilingual models that performed well on Slovenian and South Slavic languages (and dialects) based on the COPA task (see paper by Ljubešić et al., 2024):
  • other open-source instruction-tuned and base models that are often used by researchers for fine-tuning experiments in Slovenian language:
    • Llama model families Llama 3.1 and Llama 3.2: Llama 3.1 provides good results for Slovenian summarization,
    • Gemma and instruction-tuned Gemma-it models: slightly worse results in Slovenian summarization.
  • based on experience (e.g., paper by Ljubešić et al., 2024, using its predecesor GPT-4), closed-source GPT-4o by OpenAI still performs the best for Slovenian for classification tasks

Other Decoder-Style Models:

  • t5-sl-small and t5-sl-large (Ulčar and Robnik-Šikonja, 2023): Slovene T5 models that can be used for generative tasks (summarization, text simplification, etc.). The smaller model exhibits comparable performance to the larger model. However, in scenarios where extensive fine-tuning data is accessible, the larger model is expected to surpass the performance of the smaller model.

Benchmarks:

  • SloBench evaluation for generative models: a framework that supports evaluation of generative models on SloBench tasks (using Slovene SuperGLUE and SI-NLI datasets).
  • Slovenian LLM Evaluation: a framework that supports evaluation of generative models on the Slovenian LLM Evaluation Dataset. The dataset comprises multiple common English benchmarks (ARC Challenge, ARC Easy, BoolQ, HellaSwag, NQ Open, OpenBookQA, PIQA, TriviaQA, Winogrande) that were machine-translated to Slovenian.

Papers:

Embedding Models & RAG for Slovenian

Open-Source Embedding Models:

Benchmarks:

Papers:

Machine Translation for Slovenian

Open-Source Models:

Benchmarks:

Papers:

BERT-Like Pretrained Models for Slovenian

Monolingual / Smaller Multilingual Models:

Massively Multilingual Models:

Papers:

Fine-Tuned Models for Slovenian

Models & Papers:

Benchmarks:

Papers:

Speech Technologies for Slovenian

Automatic Speech Recognition (ASR) Models:

Other technologies:

Benchmarks:

Other Language Technologies for Slovenian

Tools:

Authors

This document is supported by CLASSLA, the CLARIN knowledge centre for South Slavic languages. For any questions or suggestions related to this document, write to CLASSLA helpdesk: helpdesk.classla@clarin.si.

To be informed of new resources, technologies, events and projects for South Slavic languages:

The main author and curator of this document is: Taja Kuzman (Department of Knowledge Technologies, Jožef Stefan Institute).

Special thanks also to other contributors:

  • Peter Rupnik (Department of Knowledge Technologies, Jožef Stefan Institute)
  • Matej Martinc (Department of Knowledge Technologies, Jožef Stefan Institute)
  • Erik Novak (Department for Artificial Intelligence, Jožef Stefan Institute)
  • Domen Vreš (Faculty of Computer and Information Science, University of Ljubljana)
  • Aleš Žagar (Faculty of Computer and Information Science, University of Ljubljana)
  • Simon Dobrišek (Faculty of Electrical Engineering, University of Ljubljana)

About

An ever-expanding overview of the knowledge on large language models (LLMs), speech technologies, and other NLP technologies for Slovenian language.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published