clean-up of the import directory

giellalt · May 6, 2024 · bc68294 · bc68294
1 parent 34740ba
commit bc68294
Show file tree

Hide file tree

Showing 7 changed files with 80 additions and 722 deletions.
diff --git a/src/import/README b/src/import/README
@@ -1,8 +1,56 @@
-Filosoft lexicon into GT infra
+Re-convert Filosoft's Vabamorf lexicon into GT infra
 
-source: https://github.com/Filosoft/vabamorf
+Use with care!
+(if you want to incorporate some changes made in the Vabamorf lexicon, or make big changes in how the files are built) 
+
+Source: https://github.com/Filosoft/vabamorf
+
+Lexicon files originate from:
 
-lexicon file:
 https://github.com/Filosoft/vabamorf/tree/master/dct/data/mrf/fs_lex
+https://github.com/Filosoft/vabamorf/tree/master/dct/data/mrf/fs_suf
+
+Run the following scripts in this directory to overwrite the .lexc files in ../fst/morphology/stems
+(plus one or two .xfscript files in ../fst/filters) 
+NB! This will NOT result in replacing all .lexc files; some .lexc files come from other sources.
+
+./fs_lex2gt.sh; ./fsgt2final.sh; ./fs_lex2pluraletantum.sh
+
+--------------------
+fs_lex2gt.sh
+
+Converts entries of Vabamorf lexicon into LEXC-type entries
+1. Convert original inflectional type classification.  
+2. Convert original stems into surface and lexical side representation.
+
+--------------------
+fsgt2final.sh
+
+Creates .lexc files
+1. Create a name for a lexicon, and create a name for a continuation class sublexicon.  
+2. Copy every word to a proper (sub)lexicon. This involves classifying the word, based on some tag in the original Vabamorf lexicon, or some phonological pattern of the word itself. 
+3. Add flag diacritics to some words.
+Both sublexicons and flag diacritics are designed to restrict word compunding. A general rule of compounding relies on word class (noun, adverb etc.), but in order to have finer restrictions, one must divide words into sub-lexicons, or add flag diacritics to individual words.
+
+4. Add weights to lexicons. The weight of a lemma is based on its frequency rank in a frequency dictionary. The most frequent words have rank 0. 
+
+5. Create a filter for eliminating words from a normative speller lexicon. 
+
+--------------------
+fs_suf2gt.sh
+
+(called by fsgt2final.sh)
+Creates final_components.lexc. This file contains continuation classes, each containing a few words that participate in compound forming more freely than other words.
+
+--------------------
+classify_names.sh
+
+(called by fsgt2final.sh)
+Groups proper names into geo, persons and other, thus passing the info from Vabamorf lexicon to propernouns.lexc. 
+
+--------------------
+fs_lex2pluraletantum.sh
+Create a filter for removing singular wordforms of plurale tantum words. This filter is applied while constructing a simplex word Estonian transducer.
+IF you want to use this filter in the transducer building process, you have to copy it yourselt to ../fst/filters
 
 
diff --git a/src/import/fs_pref2gt.sh b/src/import/fs_pref2gt.sh
diff --git a/src/import/fs_suf2gt.sh b/src/import/fs_suf2gt.sh
@@ -14,7 +14,7 @@ cat fs_suf \
 | LC_COLLATE=C sort | sed 's/ :/_/' \
 > suf.tmp1
 
-grep ':' ../fst/stems/*.lexc \
+grep ':' ../fst/morphology/stems/*.lexc \
 | grep -v '<.*>' \
 | grep -v '\/pref' \
 | grep -v '\/final' \

diff --git a/src/import/fsgt2final.sh b/src/import/fsgt2final.sh
@@ -714,47 +714,47 @@ cat fs_gt.inflecting.tmp1 | grep '+V:' | grep '...eer[iu]ma+' \
 | sed '/^seerima+V/s/^\([^:]*\):\([^;]*;\)\(.*\)/@P.Stem.Single@\1:@P.Stem.Single@\2\3/' \
  >> verbs.protolexc
 
-# create final_components.lexc
-./fs_suf2gt.sh
-
 # NB! this relies on the dir structure being the same as in Giellatekno
-#cp *.lexc ../fst/stems
+#cp *.lexc ../fst/morphology/stems
 
-# ../fst/stems/abbreviations.lexc
-# ../fst/stems/acronyms.lexc
+# ../fst/morphology/stems/abbreviations.lexc
+# ../fst/morphology/stems/acronyms.lexc
 cat adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
-> ../fst/stems/adjectives.lexc
+> ../fst/morphology/stems/adjectives.lexc
 
-cat adpositions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - -  > ../fst/stems/adpositions.lexc
-cat adverbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/adverbs.lexc
-cat cardinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/cardinalnumerals.lexc
-cat comparative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/comparative_adjectives.lexc
-cat conjunctions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/conjunctions.lexc
-# final_components.lexc was made by ./fs_suf2gt.sh; should contain no weights
-cat final_components.lexc | sed 's/"weight:[^"]*"//' | ./special_chars.sh  > ../fst/stems/final_components.lexc
+cat adpositions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - -  > ../fst/morphology/stems/adpositions.lexc
+cat adverbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/adverbs.lexc
+cat cardinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/cardinalnumerals.lexc
+cat comparative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/comparative_adjectives.lexc
+cat conjunctions.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/conjunctions.lexc
 
 cat genitive_attributes.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
 | sed -f badfinal_N.sed \
-> ../fst/stems/genitive_attributes.lexc
+> ../fst/morphology/stems/genitive_attributes.lexc
 
-cat interjections.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/interjections.lexc
-cat noninflecting_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/noninflecting_adjectives.lexc
-cat noninflecting_verbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/noninflecting_verbs.lexc
+cat interjections.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/interjections.lexc
+cat noninflecting_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/noninflecting_adjectives.lexc
+cat noninflecting_verbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/noninflecting_verbs.lexc
 
 cat nouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
-> ../fst/stems/nouns.lexc
+> ../fst/morphology/stems/nouns.lexc
 
-# ../fst/stems/numbers.lexc
-cat ordinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/ordinalnumerals.lexc
+# ../fst/morphology/stems/numbers.lexc
+cat ordinalnumerals.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/ordinalnumerals.lexc
 # prefixes.lexc was made by ./fs_pref2gt.sh; no weights
-cat prefixes.lexc | ./special_chars.sh > ../fst/stems/prefixes.lexc
-cat pronouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/pronouns.lexc
-cat propernouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/propernouns.lexc
-cat superlative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/stems/superlative_adjectives.lexc
+# cat prefixes.lexc | ./special_chars.sh > ../fst/morphology/stems/prefixes.lexc
+cat pronouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/pronouns.lexc
+cat propernouns.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/propernouns.lexc
+cat superlative_adjectives.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - > ../fst/morphology/stems/superlative_adjectives.lexc
 
 cat verbs.protolexc | ./special_chars.sh | ./insert_weights.py 15miljon.astak - - \
 | sed '/Pref+/s/"weight:[^"]*"//' \
-> ../fst/stems/verbs.lexc
+> ../fst/morphology/stems/verbs.lexc
+
+# create final_components.lexc
+./fs_suf2gt.sh
+cat final_components.lexc | sed 's/"weight:[^"]*"//' | ./special_chars.sh  > ../fst/morphology/stems/final_components.lexc
+
 
 # words that should be filtered out of the speller lexicon
 # make them into a filter
@@ -773,7 +773,7 @@ cat fs_gt.nosp \
 echo '] ;' >> nosp
 echo 'regex ~[words ?*] ;' >> nosp
 
-cp nosp ../filters/remove-nospell-words.est.xfscript
+cp nosp ../fst/filters/remove-nospell-words.est.xfscript
 
 
 

diff --git a/src/import/leia_osad.sh b/src/import/leia_osad.sh