A collaborative overview of the knowledge on large language models (LLMs), speech technologies, and other natural-language processing (NLP) technologies for Slovenian language. The overview is curated by the CLARIN Knowledge Centre for South Slavic languages CLASSLA.
For an overview of freely-available datasets for Slovenian language, including general text collections, and training and test datasets for various NLP tasks, see the Frequently-Asked Questions for Slovenian, provided by CLASSLA. The FAQ also provides information about resources and technologies for linguistic annotation of Slovenian texts.
Main sites where you can find language technologies for Slovenian:
- CLARIN.SI repository
- CJVT organization profile at Hugging Face
- CLASSLA organization profile at Hugging Face
Content:
- Generative models (LLMs) for Slovenian
- Embedding models & RAG for Slovenian
- Machine translation for Slovenian
- BERT-like pretrained models for Slovenian
- Fine-tuned models for Slovenian
- Speech technologies for Slovenian
- Other language technologies for Slovenian
- Authors
Open-Source Instruction-Tuned Models:
- specialised for Slovenian: recently-available instruction-tuned GaMS model by CJVT: GaMS-1B-Chat (Vreš et al., 2024): 1B model, developed as part of the POVEJMO project - bigger models will follow as the final products of this project
- multilingual models that performed well on Slovenian and South Slavic languages (and dialects) based on the COPA task (see paper by Ljubešić et al., 2024):
- other open-source instruction-tuned and base models that are often used by researchers for fine-tuning experiments in Slovenian language:
- based on experience (e.g., paper by Ljubešić et al., 2024, using its predecesor GPT-4), closed-source GPT-4o by OpenAI still performs the best for Slovenian for classification tasks
Other Decoder-Style Models:
- t5-sl-small and t5-sl-large (Ulčar and Robnik-Šikonja, 2023): Slovene T5 models that can be used for generative tasks (summarization, text simplification, etc.). The smaller model exhibits comparable performance to the larger model. However, in scenarios where extensive fine-tuning data is accessible, the larger model is expected to surpass the performance of the smaller model.
Benchmarks:
- SloBench evaluation for generative models: a framework that supports evaluation of generative models on SloBench tasks (using Slovene SuperGLUE and SI-NLI datasets).
- Slovenian LLM Evaluation: a framework that supports evaluation of generative models on the Slovenian LLM Evaluation Dataset. The dataset comprises multiple common English benchmarks (ARC Challenge, ARC Easy, BoolQ, HellaSwag, NQ Open, OpenBookQA, PIQA, TriviaQA, Winogrande) that were machine-translated to Slovenian.
Papers:
- Generative Model for Less-Resourced Language with 1 Billion Parameters (Vreš et al., 2024)
- JSI and WüNLP at the DIALECT-COPA Shared Task: In-Context Learning From Just a Few Dialectal Examples Gets You Quite Far (Ljubešić et al., 2024)
- Sequence-to-sequence pretraining for a less-resourced Slovenian language (Ulčar and Robnik-Šikonja, 2023)
Open-Source Embedding Models:
- based on paper evaluating retrieval capabilities (Kuzman et al., 2024), the best smaller-sized open-source embedding models for Slovenian are BGE-M3 and Multilingual-E5-large
Benchmarks:
- RAG benchmark for retrieval capabilities of the RAG pipeline: PandaChat-RAG Benchmark
Papers:
- PandaChat-RAG: Towards the Benchmark for Slovenian RAG Applications (Kuzman et al., 2024)
Open-Source Models:
- No Language Left Behind (NLLB) massively multilingual models are frequently used for large-scale machine translation.
- Inside the ParlaMint project dealing with parliamentary texts, the OPUS-MT models used through EasyNMT library were shown to be the most useful for our purposes. For Slovenian to English, we used the opus-mt-sla-en model.
- Neural Machine Translation model for Slovene-English language pair RSDO-DS4-NMT 1.2.6: MT model, developed inside the RSDO project. There is a demo available. Code for the API service is available here.
Benchmarks:
- SloBench Machine Translation benchmarks: Slovenian-to-English and English-to-Slovenian
Papers:
Monolingual / Smaller Multilingual Models:
- SloBERTa: monolingual Slovenian BERT-like model, available also on the CLARIN.SI repository (Ulčar and Robnik-Šikonja, 2021)
- CroSloEngual BERT: a trilingual model trained on Croatian, Slovenian, and English corpora (Ulčar and Robnik-Šikonja, 2020)
- SloBERTa-SlEng: a Slovenian-English model based on SloBERTa, which was further pre-trained on the conversational English and Slovene corpora. The model is especially appropriate for tasks applied on conversational, non-standard, and slang language (Yadav et al., 2024).
- sloberta-finetuned-dlib-1850-1919: a SloBERTa model, fine-tuned on Slovenian texts from the period 1850-1919. The texts were collected from the Slovenian Digital Library (https://dlib.si).
Massively Multilingual Models:
- Massively multilingual XLM-RoBERTa model: frequently used for fine-tuning on Slovenian and multilingual data for various NLP tasks (Conneau et al., 2019)
- Multilingual parliamentary model XLM-R-parla: XLM-RoBERTa model, additionally pretrained on parliamentary data, including Slovenian, to be used for NLP tasks applied on parliamentary texts (Mochtak et al., 2024)
Papers:
- FinEst BERT and CroSloEngual BERT: less is more in multilingual models (Ulčar and Robnik-Šikonja, 2020)
- Evaluation of contextual embeddings on less-resourced languages (Ulčar et al., 2021)
Models & Papers:
- Sentiment in parliamentary texts: Multilingual parliament sentiment regression model XLM-R-ParlaSent (Mochtak et al., 2024)
- Text genre prediction: X-GENRE classifier - multilingual text genre classifier (Kuzman et al., 2023)
- News topic prediction: Text classification model SloBERTa-Trendi-Topics 1.0 (Kosem et al., 2023)
- Hate speech classification in social media content: Multilingual Hate Speech Classifier for Social Media Content (Pelicon et al., 2021)
- Summarization of Slovenian texts: SloSummarizer (Žagar and Robnik-Šikonja, 2021). Summarization models are available here. Demo is available here.
- Slovenian Question-Answering models: SloQA
- Named Entity Recognition: PyTorch model for Slovenian Named Entity Recognition SloNER. Demo is available here. The source code is available on GitHub.
- Coreference Resolution for Slovenian: PyTorch model for Slovenian Coreference Resolution (Klemen and Žitnik, 2022). Demo is available here. The source code is available on GitHub.
- Relation Extraction for Slovenian language: SloREL tool
- Word-sense disambiguation: SloWSD model
- Annotation of incorrect spelling in Slovenian language: SloBERTa Incorrect Spelling Annotator
- Prediction of commonsense descriptions in a natural language: Slovenian commonsense reasoning model SloMET-ATOMIC 2020 (also available on GitHub) (Mladenić Grobelnik et al., 2022)
- Fine-tuned BERT model for semantic frame extraction in olfactory events (Menini, 2024)
Benchmarks:
- Natural language inference benchmark at SloBench
- Slovene SuperGLUE benchmark at SloBench (Žagar and Robnik-Šikonja, 2022)
- Named Entity Recognition benchmark at SloBench
- Universal Dependency Parsing benchmark at SloBench
- Semantic Change Detection Evaluation Dataset (Pranjić et al., 2024)
Papers:
- Code-mixed Sentiment and Hate-speech Prediction (Yadav et al., 2024)
- The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings (Mochtak et al., 2024)
- Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models (Kuzman et al., 2024)
- Investigating cross-lingual training for offensive language detection (Pelicon et al., 2021)
- Zero-Shot Learning for Cross-Lingual News Sentiment Classification (Pelicon et al., 2020)
Automatic Speech Recognition (ASR) Models:
- Slovene Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0: ASR model, developed inside the RSDO project, that is available on the CLARIN.SI repository and GitHub (demo). Note: The maximal accepted audio duration is 300s.
- Whisper model: open-source OpenAI model that is massively multilingual.
Other technologies:
- Detection of Filled Pauses in Speech
- Slovenian Text Normalizator RSDO-DS2-NORM, also available on GitHub
- Slovenian Text Denormalizator RSDO-DS2-DENORM, also available on GitHub
- Slovenian Grapheme-to-Phoneme Converter
- Slovenian Punctuation and Capitalisation model RSDO-DS2-P&C, code for an API service available on GitHub
- Speech Denoising Tool
- Slovenian Speech and Transcription Alignment using Montreal Forced Aligner
- Slovenian Speech Anonymization
- Speech Audio Validation
Benchmarks:
Tools:
- Linguistic Processing Pipeline CLASSLA: the CLASSLA pipeline provides processing of standard and non-standard (Internet) Slovene on the levels of tokenization and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing and named entity recognition (Ljubešić et al., 2024). Demo is available here. More information on tools for linguistic annotation of Slovenian texts is available here.
- Diacritic restoration tool for Croatian, Serbian and Slovene (Ljubešić et al., 2016)
- Corpus extraction tool LIST (Ključevšek et al., 2018) for extraction of lists of characters, sub-words, words and word sets from text corpora (the instruction manual is available here)
- A System for Semantic Change Detection for Slovenian (Montariol et al., 2021)
This document is supported by CLASSLA, the CLARIN knowledge centre for South Slavic languages. For any questions or suggestions related to this document, write to CLASSLA helpdesk: helpdesk.classla@clarin.si.
To be informed of new resources, technologies, events and projects for South Slavic languages:
- you can subscribe to the mailing list
- follow CLARIN.SI on X and LinkedIn
- join the Discord group "Slovenska skupnost za jezikovne vire in tehnologije"
The main author and curator of this document is: Taja Kuzman (Department of Knowledge Technologies, Jožef Stefan Institute).
Special thanks also to other contributors:
- Peter Rupnik (Department of Knowledge Technologies, Jožef Stefan Institute)
- Matej Martinc (Department of Knowledge Technologies, Jožef Stefan Institute)
- Erik Novak (Department for Artificial Intelligence, Jožef Stefan Institute)
- Domen Vreš (Faculty of Computer and Information Science, University of Ljubljana)
- Aleš Žagar (Faculty of Computer and Information Science, University of Ljubljana)
- Simon Dobrišek (Faculty of Electrical Engineering, University of Ljubljana)