Skip to content

Commit

Permalink
clean-up of the import directory
Browse files Browse the repository at this point in the history
  • Loading branch information
merisiga committed May 6, 2024
1 parent 34740ba commit bc68294
Show file tree
Hide file tree
Showing 7 changed files with 80 additions and 722 deletions.
54 changes: 51 additions & 3 deletions src/import/README
Original file line number Diff line number Diff line change
@@ -1,8 +1,56 @@
Filosoft lexicon into GT infra
Re-convert Filosoft's Vabamorf lexicon into GT infra

source: https://github.com/Filosoft/vabamorf
Use with care!
(if you want to incorporate some changes made in the Vabamorf lexicon, or make big changes in how the files are built)

Source: https://github.com/Filosoft/vabamorf

Lexicon files originate from:

lexicon file:
https://github.com/Filosoft/vabamorf/tree/master/dct/data/mrf/fs_lex
https://github.com/Filosoft/vabamorf/tree/master/dct/data/mrf/fs_suf

Run the following scripts in this directory to overwrite the .lexc files in ../fst/morphology/stems
(plus one or two .xfscript files in ../fst/filters)
NB! This will NOT result in replacing all .lexc files; some .lexc files come from other sources.

./fs_lex2gt.sh; ./fsgt2final.sh; ./fs_lex2pluraletantum.sh

--------------------
fs_lex2gt.sh

Converts entries of Vabamorf lexicon into LEXC-type entries
1. Convert original inflectional type classification.
2. Convert original stems into surface and lexical side representation.

--------------------
fsgt2final.sh

Creates .lexc files
1. Create a name for a lexicon, and create a name for a continuation class sublexicon.
2. Copy every word to a proper (sub)lexicon. This involves classifying the word, based on some tag in the original Vabamorf lexicon, or some phonological pattern of the word itself.
3. Add flag diacritics to some words.
Both sublexicons and flag diacritics are designed to restrict word compunding. A general rule of compounding relies on word class (noun, adverb etc.), but in order to have finer restrictions, one must divide words into sub-lexicons, or add flag diacritics to individual words.

4. Add weights to lexicons. The weight of a lemma is based on its frequency rank in a frequency dictionary. The most frequent words have rank 0.

5. Create a filter for eliminating words from a normative speller lexicon.

--------------------
fs_suf2gt.sh

(called by fsgt2final.sh)
Creates final_components.lexc. This file contains continuation classes, each containing a few words that participate in compound forming more freely than other words.

--------------------
classify_names.sh

(called by fsgt2final.sh)
Groups proper names into geo, persons and other, thus passing the info from Vabamorf lexicon to propernouns.lexc.

--------------------
fs_lex2pluraletantum.sh
Create a filter for removing singular wordforms of plurale tantum words. This filter is applied while constructing a simplex word Estonian transducer.
IF you want to use this filter in the transducer building process, you have to copy it yourselt to ../fst/filters


129 changes: 0 additions & 129 deletions src/import/fs_pref2gt.sh

This file was deleted.

2 changes: 1 addition & 1 deletion src/import/fs_suf2gt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ cat fs_suf \
| LC_COLLATE=C sort | sed 's/ :/_/' \
> suf.tmp1

grep ':' ../fst/stems/*.lexc \
grep ':' ../fst/morphology/stems/*.lexc \
| grep -v '<.*>' \
| grep -v '\/pref' \
| grep -v '\/final' \
Expand Down
54 changes: 27 additions & 27 deletions src/import/fsgt2final.sh
Original file line number Diff line number Diff line change
Expand Up @@ -714,47 +714,47 @@ cat fs_gt.inflecting.tmp1 | grep '+V:' | grep '...eer[iu]ma+' \
| sed '/^seerima+V/s/^\([^:]*\):\([^;]*;\)\(.*\)/@P.Stem.Single@\1:@P.Stem.Single@\2\3/' \
>> verbs.protolexc

# create final_components.lexc
./fs_suf2gt.sh

# NB! this relies on the dir structure being the same as in Giellatekno
#cp *.lexc ../fst/stems
#cp *.lexc ../fst/morphology/stems

# ../fst/stems/abbreviations.lexc
# ../fst/stems/acronyms.lexc
# ../fst/morphology/stems/abbreviations.lexc
# ../fst/morphology/stems/acronyms.lexc
cat adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
> ../fst/stems/adjectives.lexc
> ../fst/morphology/stems/adjectives.lexc

cat adpositions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/adpositions.lexc
cat adverbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/adverbs.lexc
cat cardinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/cardinalnumerals.lexc
cat comparative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/comparative_adjectives.lexc
cat conjunctions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/conjunctions.lexc
# final_components.lexc was made by ./fs_suf2gt.sh; should contain no weights
cat final_components.lexc | sed 's/"weight:[^"]*"//' | ./special_chars.sh > ../fst/stems/final_components.lexc
cat adpositions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/adpositions.lexc
cat adverbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/adverbs.lexc
cat cardinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/cardinalnumerals.lexc
cat comparative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/comparative_adjectives.lexc
cat conjunctions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/conjunctions.lexc

cat genitive_attributes.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
| sed -f badfinal_N.sed \
> ../fst/stems/genitive_attributes.lexc
> ../fst/morphology/stems/genitive_attributes.lexc

cat interjections.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/interjections.lexc
cat noninflecting_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/noninflecting_adjectives.lexc
cat noninflecting_verbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/noninflecting_verbs.lexc
cat interjections.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/interjections.lexc
cat noninflecting_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/noninflecting_adjectives.lexc
cat noninflecting_verbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/noninflecting_verbs.lexc

cat nouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
> ../fst/stems/nouns.lexc
> ../fst/morphology/stems/nouns.lexc

# ../fst/stems/numbers.lexc
cat ordinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/ordinalnumerals.lexc
# ../fst/morphology/stems/numbers.lexc
cat ordinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/ordinalnumerals.lexc
# prefixes.lexc was made by ./fs_pref2gt.sh; no weights
cat prefixes.lexc | ./special_chars.sh > ../fst/stems/prefixes.lexc
cat pronouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/pronouns.lexc
cat propernouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/propernouns.lexc
cat superlative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/superlative_adjectives.lexc
# cat prefixes.lexc | ./special_chars.sh > ../fst/morphology/stems/prefixes.lexc
cat pronouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/pronouns.lexc
cat propernouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/propernouns.lexc
cat superlative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/superlative_adjectives.lexc

cat verbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
| sed '/Pref+/s/"weight:[^"]*"//' \
> ../fst/stems/verbs.lexc
> ../fst/morphology/stems/verbs.lexc

# create final_components.lexc
./fs_suf2gt.sh
cat final_components.lexc | sed 's/"weight:[^"]*"//' | ./special_chars.sh > ../fst/morphology/stems/final_components.lexc


# words that should be filtered out of the speller lexicon
# make them into a filter
Expand All @@ -773,7 +773,7 @@ cat fs_gt.nosp \
echo '] ;' >> nosp
echo 'regex ~[words ?*] ;' >> nosp

cp nosp ../filters/remove-nospell-words.est.xfscript
cp nosp ../fst/filters/remove-nospell-words.est.xfscript



Expand Down
102 changes: 0 additions & 102 deletions src/import/leia_osad.sh

This file was deleted.

Loading

0 comments on commit bc68294

Please sign in to comment.