Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Latvian language support? #682

Open
Gn3s opened this issue Dec 12, 2024 · 5 comments
Open

Adding Latvian language support? #682

Gn3s opened this issue Dec 12, 2024 · 5 comments
Labels
languages Dictionary or language related issues

Comments

@Gn3s
Copy link

Gn3s commented Dec 12, 2024

Hi! Is there a way that I could help to add Latvian language support to T9? I see that there already is Lithuanian, but, unfortunately, our languages are quite different, so I can't really use that. What files are required get these changes?

@sspanak
Copy link
Owner

sspanak commented Dec 12, 2024

Hi!

Is there a respected academy, university, or institute that regulates the language? In many countries such academical bodies issue the "big dictionary of X". If you have such big dictionary of Latvian in a downloadable format, it would be great to share it here. Such dictionaries are spell-checked and contain many different word forms, which results in very good word predictions.

I have already developed strategy for Latin- and Cyrillic-based languages, so my only problem is finding a good dictionary. Since I don't speak so many foreign languages, I can't search in foreign websites. I really need a hand with this. The rest of the technical stuff, I'll take care about it, don't worry.

@sspanak sspanak added the languages Dictionary or language related issues label Dec 12, 2024
@Gn3s
Copy link
Author

Gn3s commented Dec 13, 2024

@sspanak I'm using a source from the language department of the local university. It contains a literary language dictionary, modern language dictionary and a general dictionary. Also found this one for spell checking.

I have these files, but I'm not sure how and where to get the utf8.csv dictionary file from(assuming most people don't write thousands of table cells by hand)
image
image

@sspanak
Copy link
Owner

sspanak commented Dec 13, 2024

The dictionaries link to here. I guess I can download and extract all words from that website. I'll check it out when I have more free time.

As for wooorm's dictionaries, initially I was also optimistic about them, but with time I've started to notice they contain a lot of misspelled words or words from different languages, despite the fact they are meant to be used for spell checking. I'd rather not use them or use small sets of data only.

Anyway, thanks for sharing tezaurus.lv. I am currently busy with adding East Asian language support, but when I am done, I'll probably get back to the good old European languages, including Latvian. Meanwhile, if you come across another good source of words, feel free to share it here.

@Gn3s
Copy link
Author

Gn3s commented Dec 16, 2024

@sspanak I think tezaurs.lv is the main legitimate one available in our country. I have found some others, but they either require a payment or clearly state that the language data has been gathered from media(subtitles). The link you shared has an option to email them to request a PostgreSQL database dump instead of the available TEI/XML and LMF/XML formats. If that makes things easier I could message them to get it for you?

@sspanak
Copy link
Owner

sspanak commented Dec 19, 2024

XML format should be fine. I'll let you know if I need anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
languages Dictionary or language related issues
Projects
None yet
Development

No branches or pull requests

2 participants