Slovenian Language Technologies Overview

A collaborative overview of the knowledge on large language models (LLMs), speech technologies, and other natural-language processing (NLP) technologies for Slovenian language. The overview is curated by the CLARIN Knowledge Centre for South Slavic languages CLASSLA.

For an overview of freely-available datasets for Slovenian language, including general text collections, and training and test datasets for various NLP tasks, see the Frequently-Asked Questions for Slovenian, provided by CLASSLA. The FAQ also provides information about resources and technologies for linguistic annotation of Slovenian texts.

Main sites where you can find language technologies for Slovenian:

Content:

Generative models (LLMs) for Slovenian
Embedding models & RAG for Slovenian
Machine translation for Slovenian
BERT-like pretrained models for Slovenian
Fine-tuned models for Slovenian
Speech technologies for Slovenian
Other language technologies for Slovenian
Authors

Generative Models (LLMs) for Slovenian

Open-Source Instruction-Tuned Models:

specialised for Slovenian: recently-available instruction-tuned GaMS model by CJVT: GaMS-1B-Chat (Vreš et al., 2024): 1B model, developed as part of the POVEJMO project - bigger models will follow as the final products of this project
multilingual models that performed well on Slovenian and South Slavic languages (and dialects) based on the COPA task (see paper by Ljubešić et al., 2024):
- Mixtral
- mt0-xxl
- Aya
other open-source instruction-tuned and base models that are often used by researchers for fine-tuning experiments in Slovenian language:
- Llama model families Llama 3.1 and Llama 3.2: Llama 3.1 provides good results for Slovenian summarization,
- Gemma and instruction-tuned Gemma-it models: slightly worse results in Slovenian summarization.
based on experience (e.g., paper by Ljubešić et al., 2024, using its predecesor GPT-4), closed-source GPT-4o by OpenAI still performs the best for Slovenian for classification tasks

Other Decoder-Style Models:

t5-sl-small and t5-sl-large (Ulčar and Robnik-Šikonja, 2023): Slovene T5 models that can be used for generative tasks (summarization, text simplification, etc.). The smaller model exhibits comparable performance to the larger model. However, in scenarios where extensive fine-tuning data is accessible, the larger model is expected to surpass the performance of the smaller model.

Benchmarks:

SloBench evaluation for generative models: a framework that supports evaluation of generative models on SloBench tasks (using Slovene SuperGLUE and SI-NLI datasets).
Slovenian LLM Evaluation: a framework that supports evaluation of generative models on the Slovenian LLM Evaluation Dataset. The dataset comprises multiple common English benchmarks (ARC Challenge, ARC Easy, BoolQ, HellaSwag, NQ Open, OpenBookQA, PIQA, TriviaQA, Winogrande) that were machine-translated to Slovenian.

Papers:

Generative Model for Less-Resourced Language with 1 Billion Parameters (Vreš et al., 2024)
JSI and WüNLP at the DIALECT-COPA Shared Task: In-Context Learning From Just a Few Dialectal Examples Gets You Quite Far (Ljubešić et al., 2024)
Sequence-to-sequence pretraining for a less-resourced Slovenian language (Ulčar and Robnik-Šikonja, 2023)

Embedding Models & RAG for Slovenian

Open-Source Embedding Models:

based on paper evaluating retrieval capabilities (Kuzman et al., 2024), the best smaller-sized open-source embedding models for Slovenian are BGE-M3 and Multilingual-E5-large

Benchmarks:

RAG benchmark for retrieval capabilities of the RAG pipeline: PandaChat-RAG Benchmark

Papers:

PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications (Kuzman et al., 2024)

Machine Translation for Slovenian

Open-Source Models:

No Language Left Behind (NLLB) massively multilingual models are frequently used for large-scale machine translation.
Inside the ParlaMint project dealing with parliamentary texts, the OPUS-MT models used through EasyNMT library were shown to be the most useful for our purposes. For Slovenian to English, we used the opus-mt-sla-en model.
Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6: MT model, developed inside the RSDO project. There is a demo available. Code for the API service is available here.

Benchmarks:

SloBench Machine Translation benchmarks: Slovenian-to-English and English-to-Slovenian

Papers:

BERT-Like Pretrained Models for Slovenian

Monolingual / Smaller Multilingual Models:

SloBERTa: monolingual Slovenian BERT-like model, available also on the CLARIN.SI repository (Ulčar and Robnik-Šikonja, 2021)
CroSloEngual BERT: a trilingual model trained on Croatian, Slovenian, and English corpora (Ulčar and Robnik-Šikonja, 2020)
SloBERTa-SlEng: a Slovenian-English model based on SloBERTa, which was further pre-trained on the conversational English and Slovene corpora. The model is especially appropriate for tasks applied on conversational, non-standard, and slang language (Yadav et al., 2024).
sloberta-finetuned-dlib-1850-1919: a SloBERTa model, fine-tuned on Slovenian texts from the period 1850-1919. The texts were collected from the Slovenian Digital Library (https://dlib.si).

Massively Multilingual Models:

Massively multilingual XLM-RoBERTa model: frequently used for fine-tuning on Slovenian and multilingual data for various NLP tasks (Conneau et al., 2019)
Multilingual parliamentary model XLM-R-parla: XLM-RoBERTa model, additionally pretrained on parliamentary data, including Slovenian, to be used for NLP tasks applied on parliamentary texts (Mochtak et al., 2024)

Papers:

FinEst BERT and CroSloEngual BERT: less is more in multilingual models (Ulčar and Robnik-Šikonja, 2020)
Evaluation of contextual embeddings on less-resourced languages (Ulčar et al., 2021)

Fine-Tuned Models for Slovenian

Models & Papers:

Sentiment in parliamentary texts: Multilingual parliament sentiment regression model XLM-R-ParlaSent (Mochtak et al., 2024)
Text genre prediction: X-GENRE classifier - multilingual text genre classifier (Kuzman et al., 2023)
News topic prediction: Text classification model SloBERTa-Trendi-Topics 1.0 (Kosem et al., 2023)
Hate speech classification in social media content: Multilingual Hate Speech Classifier for Social Media Content (Pelicon et al., 2021)
Summarization of Slovenian texts: SloSummarizer (Žagar and Robnik-Šikonja, 2021). Summarization models are available here. Demo is available here.
Slovenian Question-Answering models: SloQA
Named Entity Recognition: PyTorch model for Slovenian Named Entity Recognition SloNER. Demo is available here. The source code is available on GitHub.
Coreference Resolution for Slovenian: PyTorch model for Slovenian Coreference Resolution (Klemen and Žitnik, 2022). Demo is available here. The source code is available on GitHub.
Relation Extraction for Slovenian language: SloREL tool
Word-sense disambiguation: SloWSD model
Annotation of incorrect spelling in Slovenian language: SloBERTa Incorrect Spelling Annotator
Prediction of commonsense descriptions in a natural language: Slovenian commonsense reasoning model SloMET-ATOMIC 2020 (also available on GitHub) (Mladenić Grobelnik et al., 2022)
Fine-tuned BERT model for semantic frame extraction in olfactory events (Menini, 2024)

Benchmarks:

Natural language inference benchmark at SloBench
Slovene SuperGLUE benchmark at SloBench (Žagar and Robnik-Šikonja, 2022)
Named Entity Recognition benchmark at SloBench
Universal Dependency Parsing benchmark at SloBench
Semantic Change Detection Evaluation Dataset (Pranjić et al., 2024)

Papers:

Code-mixed Sentiment and Hate-speech Prediction (Yadav et al., 2024)
The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings (Mochtak et al., 2024)
Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models (Kuzman et al., 2024)
Investigating cross-lingual training for offensive language detection (Pelicon et al., 2021)
Zero-Shot Learning for Cross-Lingual News Sentiment Classification (Pelicon et al., 2020)

Speech Technologies for Slovenian

Automatic Speech Recognition (ASR) Models:

Slovene Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0: ASR model, developed inside the RSDO project, that is available on the CLARIN.SI repository and GitHub (demo). Note: The maximal accepted audio duration is 300s.
Whisper model: open-source OpenAI model that is massively multilingual.

Other technologies:

Detection of Filled Pauses in Speech
Slovenian Text Normalizator RSDO-DS2-NORM, also available on GitHub
Slovenian Text Denormalizator RSDO-DS2-DENORM, also available on GitHub
Slovenian Grapheme-to-Phoneme Converter
Slovenian Punctuation and Capitalisation model RSDO-DS2-P&C, code for an API service available on GitHub
Speech Denoising Tool
Slovenian Speech and Transcription Alignment using Montreal Forced Aligner
Slovenian Speech Anonymization
Speech Audio Validation

Benchmarks:

SloBench Speech Recognition benchmark

Other Language Technologies for Slovenian

Tools:

Linguistic Processing Pipeline CLASSLA: the CLASSLA pipeline provides processing of standard and non-standard (Internet) Slovene on the levels of tokenization and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing and named entity recognition (Ljubešić et al., 2024). Demo is available here. More information on tools for linguistic annotation of Slovenian texts is available here.
Diacritic restoration tool for Croatian, Serbian and Slovene (Ljubešić et al., 2016)
Corpus extraction tool LIST (Ključevšek et al., 2018) for extraction of lists of characters, sub-words, words and word sets from text corpora (the instruction manual is available here)
A System for Semantic Change Detection for Slovenian (Montariol et al., 2021)

Authors

This document is supported by CLASSLA, the CLARIN knowledge centre for South Slavic languages. For any questions or suggestions related to this document, write to CLASSLA helpdesk: helpdesk.classla@clarin.si.

To be informed of new resources, technologies, events and projects for South Slavic languages:

you can subscribe to the mailing list
follow CLARIN.SI on X and LinkedIn
join the Discord group "Slovenska skupnost za jezikovne vire in tehnologije"

The main author and curator of this document is: Taja Kuzman (Department of Knowledge Technologies, Jožef Stefan Institute).

Special thanks also to other contributors:

Peter Rupnik (Department of Knowledge Technologies, Jožef Stefan Institute)
Matej Martinc (Department of Knowledge Technologies, Jožef Stefan Institute)
Erik Novak (Department for Artificial Intelligence, Jožef Stefan Institute)
Domen Vreš (Faculty of Computer and Information Science, University of Ljubljana)
Aleš Žagar (Faculty of Computer and Information Science, University of Ljubljana)
Simon Dobrišek (Faculty of Electrical Engineering, University of Ljubljana)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
CLASSLA-k-centre-transparent-background.png		CLASSLA-k-centre-transparent-background.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slovenian Language Technologies Overview

Generative Models (LLMs) for Slovenian

Embedding Models & RAG for Slovenian

Machine Translation for Slovenian

BERT-Like Pretrained Models for Slovenian

Fine-Tuned Models for Slovenian

Speech Technologies for Slovenian

Other Language Technologies for Slovenian

Authors

About

Releases

Packages

clarinsi/Slovenian-Language-Technologies-Overview

Folders and files

Latest commit

History

Repository files navigation

Slovenian Language Technologies Overview

Generative Models (LLMs) for Slovenian

Embedding Models & RAG for Slovenian

Machine Translation for Slovenian

BERT-Like Pretrained Models for Slovenian

Fine-Tuned Models for Slovenian

Speech Technologies for Slovenian

Other Language Technologies for Slovenian

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages