A NLP(natural language process) app trained by millions records from US news, blog and twitter using a back-off smooth and n-gram algorithm to predict next word you type.
If you haven't tried out the app, go (https://biomystery.shinyapps.io/predNextWord/) to try it!
- Predicts next word
- Shows you top 5 words with probablities for each prediction
- It is fast after loading the model data
It is done following these steps:
-
Tokenize: wrote a get_tokens function, taking files and return tokens
-
get n-grams freqency: use data.table library process data quite fast
From n-gram freqency data:
-
Apply Good-Turning discounting for freq<10 1,2,3-gram
-
Using Katz-back off to calculate the p_kz(w3|w1,w2), p_kz(w1|w2)
-
Store the model using the ARPA format
From n-gram freqency data:
- Apply Good-Turning discounting for freq<10 1,2,3-gram
- Using Katz-back off to calculate the p_kz(w3|w1,w2), p_kz(w1|w2)
- Store the model using the ARPA format