pred_next_word_app

Demo
Slides

A NLP(natural language process) app trained by millions records from US news, blog and twitter using a back-off smooth and n-gram algorithm to predict next word you type.

Overview

If you haven't tried out the app, go (https://biomystery.shinyapps.io/predNextWord/) to try it!

Predicts next word
Shows you top 5 words with probablities for each prediction
It is fast after loading the model data

Data processing

It is done following these steps:

Tokenize: wrote a get_tokens function, taking files and return tokens
get n-grams freqency: use data.table library process data quite fast

Algorithm

From n-gram freqency data:

Apply Good-Turning discounting for freq<10 1,2,3-gram
Using Katz-back off to calculate the p_kz(w3|w1,w2), p_kz(w1|w2)
Store the model using the ARPA format

Algorithm (extended)

From n-gram freqency data:

Apply Good-Turning discounting for freq<10 1,2,3-gram
Using Katz-back off to calculate the p_kz(w3|w1,w2), p_kz(w1|w2)
Store the model using the ARPA format

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
presentation		presentation
shiny-app		shiny-app
README.md		README.md
count_ngram.RData		count_ngram.RData
get_count_ngrams.R		get_count_ngrams.R
get_model_data.R		get_model_data.R
tokens.RData		tokens.RData

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pred_next_word_app

Overview

Data processing

Algorithm

Algorithm (extended)

Final Product

About

Releases

Packages

Languages

biomystery/pred_next_word_app

Folders and files

Latest commit

History

Repository files navigation

pred_next_word_app

Overview

Data processing

Algorithm

Algorithm (extended)

Final Product

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages