GitHub - Coder1400/LanguageIdentification: Identify between English, French, and Italian with 99% accuracy. Uses language modeling techniques including LaPlace and Good-Turing smoothing.

Coder1400 / LanguageIdentification Public

Notifications You must be signed in to change notification settings
Fork 1
Star 2

Identify between English, French, and Italian with 99% accuracy. Uses language modeling techniques including LaPlace and Good-Turing smoothing.

2 stars 1 fork Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
LangId.sol		LangId.sol
LangId.test		LangId.test
LangId.train.English		LangId.train.English
LangId.train.French		LangId.train.French
LangId.train.Italian		LangId.train.Italian
README.txt		README.txt
dicts.py		dicts.py
letterLangId.out		letterLangId.out
letterLangId.py		letterLangId.py
letterModel.py		letterModel.py
wordLangId.out		wordLangId.out
wordLangId.py		wordLangId.py
wordLangId2.out		wordLangId2.out
wordLangId2.py		wordLangId2.py

Repository files navigation

Language Modeling: letter & word bi-grams for language identification.



========================== SETUP INSTRUCTIONS ===============================

1.) clone this repository https://github.com/Arken94/LanguageIdentification.git from 

2.) Within the newly cloned repository on your local machine, there should be 3 python files: “letterLangId.py”, “wordLangId.py”, and “wordLandId2.py”

3.) letterLangId.py is the letter bigram implementation. wordLangId.py is the word bigram implementation. wordLangId2.py is the word bigram implementation with an advanced smoothing technique (extra credit). 

4.) To run any of these python files simply run the file using the python command, for example: 

	“python wordLangId.py“

the code will open the proper training and test data files (hardcoded, no arguments to the program are needed) and implement the language model for that specific implementation. 

5.) NOTE: when you run any of the python files MAKE SURE that each of the training files and the test files exist in the same directory that you are running the python file from. I have included them in the github repository, so they should already be there. 

6.) the output of each program is printed to an output file with the same name as the python file, except with a “.out” extension. For example, wordLangId.py will print its output to wordLangId.out. These files should already contain the output of each program.