- Directory
- DataSet
- Annoy ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ํ์ฉํ ๋ฒกํฐ ์ ์ฌ๋ ์ต์ ํ
- t-SNE๋ฅผ ํ์ฉํ ๋ฒกํฐ ๋ฐ์ดํฐ ์๊ฐํ
- ํ๊ตญ์ด ํ ์คํธ ๋ถ๋ฅ๋ฅผ ์ํ BERT ๋ชจ๋ธ์ ํ์ฉํ ๋จ์ด ํํฐ๋ง
- konlpy์ googletrans๋ฅผ ํ์ฉํ ์นดํ ๊ณ ๋ฆฌ ํํ์ ๋ถ์ ๋ฐ ๋ฒ์ญ
- Word2Vec ํ๊ตญ์ด ๋จ์ด ์๋ฒ ๋ฉ ๋ฐ์ดํฐ๋ฒ ์ด์ค ๊ตฌ์ถ
- ์๋ฏธ๋ก ์ ๋จ์ด์ ์ฌ๋๋ฅผ ํ์ฉํ ์นดํ ๊ณ ๋ฆฌ/ํค์๋ ์ถ์ฒ
- ๋ผ์ด์ ์ค
USER_CTGY
: ์ฌ์ฉ์ ๊ธฐ๋ฐ ํ์ ํํฐ๋ง์ ํตํ ์นดํ ๊ณ ๋ฆฌ ์ถ์ฒ ์๊ณ ๋ฆฌ์ฆ(User-based CF)USER_MODL
: ์ฝํ ์ธ ๊ธฐ๋ฐ ํ์ ํํฐ๋ง์ ํตํ ๋ฌ๋๋ณ ๊ธฐ๋ฅ ์ถ์ฒ ์๊ณ ๋ฆฌ์ฆ(Item-based CF)SMLR_RECO
: ํํ์ ๋ถ์, ํ๊น ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ด์ฉํ ์๋ฏธ๋ก ์ ํค์๋, ์นดํ ๊ณ ๋ฆฌ ์ถ์ฒ ์๊ณ ๋ฆฌ์ฆ
- ์ ์ ์ ๋ช
์์ ๋ฐ์ดํฐ์ ์นดํ
๊ณ ๋ฆฌ/๋ชจ๋ ๋ณ ํ๋ ๊ธฐ๋ก์ ๋ถ์ํ ์์์ ํผ๋๋ฐฑ์ ํ์ฉํ ์นดํ
๊ณ ๋ฆฌ/๋ชจ๋ ๋ฒกํฐ ๋ฐ์ดํฐ
- ๋์ ๋ฒกํฐ ๊ฐ์ค์น: ์ฌ์ฉํ ํฌ์ธํธ, ๋๋ณด๊ธฐ ์์ฒญ, ๊ฒ์, 30์ด ์ด์ ์ฒด๋ฅ, ์ ์ฅ/๊ฐฑ์ ํ์ฑํ, ์ข์์ / ๋๊ธ
- ์ ์ ๋ฒกํฐ ๊ฐ์ค์น: ๋ชจ๋, ์นดํ ๊ณ ๋ฆฌ, ํค์๋, ์ฐ๊ฐ ํค์๋
- spellcheck-ko์์ ์ ๊ณตํ๋ ํ๊ตญ์ด๊ธฐ์ด์ฌ์ , ํ์ค๊ตญ์ด๋์ฌ์ , ์ฐ๋ฆฌ๋ง์ ๊ธฐ๋ฐ ํ๊ตญ์ด ๋ง์ถค๋ฒ ์ฌ์
- Facebook์์ ์ ๊ณตํ๋ FastText์ 300์ฐจ์ ๋ฒกํฐ๋ก ํํํ์ฌ ๋จ์ด์ ์๋ฏธ์ ๊ด๊ณ๋ฅผ ๋ฐ์ํ ํ๊ตญ์ด Word2Vec ๋ชจ๋ธ ํ๊ตญ์ด ๋จ์ด ๋ฒกํฐ
- ๋ค์ด๋ฒ ์นดํ ๊ณ ๋ฆฌ๋ฅผ ํํ์ ๋ถ์ํ์ฌ ๋๋ ๋ค์ด๋ฒ ์นดํ ๊ณ ๋ฆฌ ๋ง ๋ญ์น
Annoy
์ Bayesian Optimization
๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ฌ ๋ฒกํฐ ์ ์ฌ๋๋ฅผ ์ต์ ํ ํฉ๋๋ค.
- Annoy ๋ผ์ด๋ธ๋ฌ๋ฆฌ: ๋ฒกํฐ ์ ์ฌ๋๋ฅผ ํจ์จ์ ์ผ๋ก ๊ณ์ฐํ๊ณ ๊ฒ์ํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- Bayesian Optimization: ๋ชฉ์ ํจ์๋ฅผ ์ต์ ํํ๊ธฐ ์ํ ํจ์จ์ ์ธ ์๊ณ ๋ฆฌ์ฆ
- pandas: ๋ฐ์ดํฐ ์กฐ์ ๋ฐ ๊ณ์ฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- numpy: ๋ค์ฐจ์ ๋ฐฐ์ด์ ์ฒ๋ฆฌํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- matplotlib: ๋ฐ์ดํฐ ์๊ฐํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- ์์กด์ฑ ์ค์น:
pip install annoy pandas numpy scikit-learn bayesian-optimization matplotlib
- ์ฝ๋ ์คํ:
python *_optimizeAnnModel.py
์ ๋ช ๋ น์ด๋ฅผ ์คํํ์ฌ Annoy ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ ๋ฒกํฐ ์ ์ฌ๋ ์ต์ ํ๋ฅผ ์ํ
evaluate_n_trees(n_trees)
: Annoy ์ธ๋ฑ์ค์ ์ ํ๋๋ฅผ ์ต์ ํํ๊ธฐ ์ํ ํจ์๋ก, ์ฃผ์ด์ง ํธ๋ฆฌ ์์ ๋ํด ๋ฒกํฐ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๊ณ ์ต๊ทผ์ ์ด์๋ค์ ํ๊ท ๊ฑฐ๋ฆฌ๋ฅผ ๋ฐํBayesianOptimization
: ํธ๋ฆฌ ์(n_trees)๋ฅผ ์กฐ์ ํ์ฌ ๋ชฉ์ ํจ์(evaluate_n_trees)๋ฅผ ์ต์ ์ ํ๊ท ๊ฑฐ๋ฆฌ๋ฅผ ํ์ํ๋ฉฐ ํธ๋ฆฌ ์ ์ต์ ํ
scikit-learn์ t-SNE
์๊ณ ๋ฆฌ์ฆ์ ํ์ฉํ์ฌ ๋ฒกํฐ ๋ฐ์ดํฐ๋ฅผ ์๊ฐํ ํฉ๋๋ค.
- t-SNE: ๊ณ ์ฐจ์ ๋ฐ์ดํฐ์ ๊ตฌ์กฐ๋ฅผ ์ ์งํ๋ฉด์ ์ ์ฐจ์์ผ๋ก ์ถ์ํ์ฌ ์๊ฐํํ๋ ๋ฐ ์ฌ์ฉ๋๋ ์๊ณ ๋ฆฌ์ฆ
- matplotlib: ๋ฐ์ดํฐ ์๊ฐํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- scikit-learn: ๋จธ์ ๋ฌ๋ ๋ชจ๋ธ ๊ตฌํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- pandas: ๋ฐ์ดํฐ ์กฐ์ ๋ฐ ๊ณ์ฐ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- numpy: ๋ค์ฐจ์ ๋ฐฐ์ด์ ์ฒ๋ฆฌํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- ์์กด์ฑ ์ค์น:
pip install scikit-learn matplotlib pandas numpy
- ์ฝ๋ ์คํ:
python visualize_vectors.py
TSNE_3D.png ๋ฐ TSNE_2D.png ์ด๋ฏธ์ง ํ์ผ๋ก 2D ๋ฐ 3D t-SNE ๊ฒฐ๊ณผ๊ฐ ์์ฑ
ํ ์คํธ์์ ๊นจ๋ํ ๋จ์ด๋ฅผ ํํฐ๋งํ๊ธฐ ์ํด ์์ฐ์ด ์ฒ๋ฆฌ(NLP) ์์ ์ ์ฌ์ฉ๋๋ ์ฌ์ ํ๋ จ๋ ์ธ์ด ๋ชจ๋ธ์ธ BERT ๋ชจ๋ธ์ ์ฌ์ฉํฉ๋๋ค.
- BERT: ์๋ฐฉํฅ ํธ๋์คํฌ๋จธ ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ ์ฌ์ ํ๋ จ๋ ์ธ์ด ๋ชจ๋ธ, Smilegate-ai์์ ์ ๊ณตํ๋
kor_unsmile
๋ชจ๋ธ์ ํ์ฉ - Hugging Face Transformers: ๋ค๊ตญ์ด๋ก ๋ ์ฌ๋ฌ ์ฌ์ ํ๋ จ ๋ชจ๋ธ์ ์ ๊ณตํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ, ๋ชจ๋ธ์ ๋ก๋ํ๊ณ ํ ์คํธ ๋ถ๋ฅ๋ฅผ ์ํ
- ์์กด์ฑ ์ค์น:
pip install transformers tqdm
- ์ฌ์ ํ๋ จ๋ BERT ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ค์ด๋ก๋:
from transformers import BertForSequenceClassification, AutoTokenizer model_name = 'smilegate-ai/kor_unsmile' model = BertForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
- ๋จ์ด ํํฐ๋ง ์ํ:
python filter_words.py
spellcheck-ko์์ ์ ๊ณตํ๋ ํ๊ตญ์ด ๋จ์ด๋ฅผ ๋ถ๋ฅํ๊ณ ๊นจ๋ํ ๋จ์ด๋ฅผ ์ถ์ถํ์ฌ ๊ฒฐ๊ณผ๋ data/ko_filtered.txt ํ์ผ์ ์ ์ฅ
-
get_predicated_label(output_labels, min_score)
: BERT ๋ชจ๋ธ์ ์ถ๋ ฅ ๋ ์ด๋ธ์์ ์ง์ ๋ ์ต์ ์ ์ ์ด์์ธ ๋ ์ด๋ธ๋ง์ ๋ฐํํ๋ ํจ์ -
TextClassificationPipeline
: ํ ์คํธ ๋ถ๋ฅ ํ์ดํ๋ผ์ธ์ ์ด๊ธฐํํ๊ณ ์ค์ . ํ ์คํธ๋ฅผ ์ ๋ ฅ์ผ๋ก ๋ฐ์ BERT ๋ชจ๋ธ์ ์ฌ์ฉํ์ฌ ๋ถ๋ฅ๋ฅผ ์ํํ๊ณ ๊ฒฐ๊ณผ๋ฅผ ๋ฐํ
KoNLPy
์ ์ฌ๋ฌ ํ๊น
๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ํ์ฉํ์ฌ ์นดํ
๊ณ ๋ฆฌ๋ฅผ ํํ์ ๋ถ์ํ์ฌ ์ ์๋ฏธํ ๋จ์ด๋ก ์ถ์ถํ๊ณ , googletrans
๋ฅผ ํ์ฉํ์ฌ ์ถ์ถ๋ ๋จ์ด๋ค์ ๋ฒ์ญ ํํ ์ ๊ทํ๋ฅผ ๊ฑฐ์ณ ์๋ก์ด ์ ์ฌ ๋จ์ด๋ค์ ํ๋ณดํฉ๋๋ค.
- konlpy: ํ๊ธ ํํ์ ๋ถ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ, Okt, Hannanum, Kkma, Komoran์ ์ฌ์ฉํ์ฌ ํํ์ ๋ถ์์ ์ํ
- googletrans: Google Translate API๋ฅผ ํ์ฉํ์ฌ ๋จ์ด๋ฅผ ๋ฒ์ญํ๋ ๋ฐ ์ฌ์ฉ
- re: ์ ๊ท ํํ์์ ์ฌ์ฉํ์ฌ ๋จ์ด๋ฅผ ํํฐ๋งํ๋ ๋ฐ ์ฌ์ฉ
- ์์กด์ฑ ์ค์น:
pip install konlpy googletrans
- ์นดํ
๊ณ ๋ฆฌ ํํ์ ๋ถ์ ๋ฐ ๋ฒ์ญ์ ์ํ:
python category_corpus.py
์นดํ ๊ณ ๋ฆฌ์์ ์๋ก์ด ์ ์ฌ๋จ์ด๋ฅผ ์ถ์ถํ์ฌ, output.json ์ output_oneElement.txt ์ ์ ์ฅ
tokenize_and_join(input_file: str) -> Tuple[List[int], List[str]]
: ์ ๋ ฅ ํ์ผ์์ ๊ฐ ๋ผ์ธ์ ์ฝ์ด์ ํํ์ ๋ถ์ ๋ฐ ๋ฒ์ญ์ ์ํํ์ฌ ์ ์๋ฏธํ ๋จ์ด๋ฅผ ์ถ์ถํ๊ณ , ์ด๋ฅผ ํ์ผ๋ก ์ ์ฅ
ํ๊ตญ์ด Word2Vec ์๋ฒ ๋ฉ ๋ชจ๋ธ์ ํ์ฉํ์ฌ ๋จ์ด ๋ฒกํฐ๋ฅผ ์ถ์ถํ๊ณ , ์ ์ฅํฉ๋๋ค.
- Word2Vec: ํ๊ตญ์ด ๋จ์ด์ ๋ถ์ฐ ํํ์ ํ์ตํ๊ธฐ ์ํ ๋จ์ด ์๋ฒ ๋ฉ ๋ชจ๋ธ ๊ธฐ์ , Facebook์์ ์ ๊ณตํ๋ Word2Vec ๋ชจ๋ธ์ ํ์ฉํ์ฌ ๋จ์ด ๋ฒกํฐ๋ฅผ ์ถ์ถํ๊ณ ์ฌ์ฉ
- SQLite: ๊ฒฝ๋ํ DBMS ๋ผ์ด๋ธ๋ฌ๋ฆฌ, ๋จ์ด์ ๊ทธ์ ํด๋นํ๋ ๋ฒกํฐ๋ฅผ ์ ์ฅ
- unicodedata: ์ ๋์ฝ๋ ๋ฌธ์์ ๋ํ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ฅผ ์ ๊ณตํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- pickle: `ํ์ด์ฌ ๊ฐ์ฒด๋ฅผ ์ง๋ ฌํํ๊ณ ์ญ์ง๋ ฌํํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- numpy: ๋ค์ฐจ์ ๋ฐฐ์ด์ ์ฒ๋ฆฌํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- ์์กด์ฑ ์ค์น:
pip install numpy tqdm
- ํ๊ตญ์ด Word2Vec ๋ฐ์ดํฐ๋ฒ ์ด์ค ๊ตฌ์ถ:
python process_vecs_*.py
ํ๊ตญ์ด Word2Vec ๋ชจ๋ธ์์ ๋จ์ด ๋ฒกํฐ๋ฅผ ์ถ์ถํ์ฌ, *_guesses_ko.db ์ *_nearest_ko.dat ์ ์ ์ฅ
is_hangul(text) -> bool
: ์ฃผ์ด์ง ํ ์คํธ๊ฐ ํ๊ธ์ธ์ง ์ฌ๋ถ๋ฅผ ๋ฐํํ๋ ํจ์load_dic(path: str) -> Set[str]
: ์ฃผ์ด์ง ๊ฒฝ๋ก์์ ์ฌ์ ํ์ผ์ ์ฝ์ด์ ์งํฉ(Set)์ผ๋ก ๋ฐํํ๋ ํจ์, ์ฌ์ ์ ํฌํจ๋ ํ๊ธ ๋จ์ด๋ฅผ ์ ๊ทํํ์ฌ ์ ์ฅblocks(files, size=65536)
: ํ์ผ์ ๋ธ๋ก ๋จ์๋ก ๋๋๋ ์ ๋๋ ์ดํฐ ํจ์count_lines(filepath)
: ์ฃผ์ด์ง ํ์ผ์ ์ด ๋ผ์ธ ์๋ฅผ ์ธ์ด ๋ฐํํ๋ ํจ์- ์ฃผ์ด์ง Word2Vec ๋ชจ๋ธ์์ ๋จ์ด ๋ฒกํฐ๋ฅผ ์ถ์ถํ๊ณ , ๋ฐ์ดํฐ๋ฒ ์ด์ค์ ์ ์ฅ
์ ์ฅ๋ ๋จ์ด ๋ฒกํฐ๋ฅผ ํ์ฉํ์ฌ ๋จ์ด ๊ฐ ์ ์ฌ๋๋ฅผ ์ธก์ , ํน์ ๋จ์ด์ ์ ์ฌํ ๋จ์ด๋ค์ ์ฐพ๊ณ , ํด๋น ๋จ์ด๋ค์ ๊ธฐ๋ฐ์ผ๋ก ์นดํ ๊ณ ๋ฆฌ๋ฅผ ์ถ์ฒํ๋ ๊ธฐ๋ฅ์ ์ํํฉ๋๋ค.
- numpy: ๋ค์ฐจ์ ๋ฐฐ์ด์ ์ฒ๋ฆฌํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- pickle: ํ์ด์ฌ ๊ฐ์ฒด๋ฅผ ์ง๋ ฌํํ๊ณ ์ญ์ง๋ ฌํํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- pymysql: MySQL ๋ฐ์ดํฐ๋ฒ ์ด์ค์ ์ฐ๊ฒฐํ๊ณ ์ํธ์์ฉํ๊ธฐ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ
- ์์กด์ฑ ์ค์น:
pip install pymysql, numpy
- ํค์๋ ๊ธฐ๋ฐ ์นดํ
๊ณ ๋ฆฌ/ํค์๋ ์ถ์ฒ, ์นดํ
๊ณ ๋ฆฌ ๊ธฐ๋ฐ ์นดํ
๊ณ ๋ฆฌ ์ถ์ฒ:
python process_smilar_*.py
relCategory.json
: ์นดํ ๊ณ ๋ฆฌ ๊ธฐ๋ฐ ์ถ์ฒ๋ ๊ด๋ จ ์นดํ ๊ณ ๋ฆฌ ์ ๋ณด๋ฅผ JSON ํ์์ผ๋ก ์ ์ฅkeyword/*.dat
: ํค์๋ ๊ธฐ๋ฐ ์ถ์ฒ๋ ๊ด๋ จ ํค์๋ ์ ๋ณด๋ฅผ dat ํ์์ผ๋ก ์ ์ฅcategory/*.json
: ํค์๋ ๊ธฐ๋ฐ ์ถ์ฒ๋ ๊ด๋ จ ์นดํ ๊ณ ๋ฆฌ ์ ๋ณด๋ฅผ json ํ์์ผ๋ก ์ ์ฅ
most_similar(mat: array, idx: int, k: int) -> Tuple[array, array]
: ํน์ ๋จ์ด์ ๋ํด ์ฃผ์ด์ง ํ๋ ฌ์์ ๊ฐ์ฅ ์ ์ฌํ k๊ฐ์ ๋จ์ด์ ๊ทธ ์ ์ฌ๋๋ฅผ ๋ฐํdump_nearest(title: str, values: List[str], words: List[str], mat: array, k: int = 100) -> List[str]
: ๋จ์ด์ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๊ณ , ์ ์ฌํ ๋จ์ด๋ค์ ํ์ผ๋ก ์ ์ฅ, ์ด๋ฏธ ๊ณ์ฐ๋ ๊ฒฐ๊ณผ๊ฐ ์๋ ๊ฒฝ์ฐ ํ์ผ์์ ๋ก๋ํ์ฌ ๋ฐํget_nearest(title: str, values: List[str], words: List[str], mat: array) -> List[str]
: ๋จ์ด์ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๊ณ , ์ด๋ฏธ ๊ณ์ฐ๋ ๊ฒฐ๊ณผ๊ฐ ์๋์ง ํ์ธํ ํ ์์ผ๋ฉด ๋ก๋ํ์ฌ ๋ฐํํ๊ณ , ์์ผ๋ฉด ๋ค์ ๊ณ์ฐํ์ฌ ๋ฐํ
์ด ํ๋ก์ ํธ๋ GPL-3.0 ๋ผ์ด์ ์ค๋ฅผ ๋ฐ๋ฅด๋ฉฐ, ์์ธํ ๋ด์ฉ์ LICENSE ํ์ผ์ ์ฐธ์กฐํ์ธ์.
- Directory
- DataSet
- Optimizing Vector Similarity using the Annoy Library
- Visualizing Vector Data with t-SNE
- Word Filtering using BERT for Korean Text Classification
- Category Morphological Analysis and Translation using konlpy and googletrans
- Building a Word2Vec Korean Word Embedding Database
- Category/Keyword Recommendation using Semantic Word Similarity
- License
USER_CTGY
: User-based Collaborative Filtering for Category RecommendationUSER_MODL
: Item-based Collaborative Filtering for Module-specific Feature RecommendationSMLR_RECO
: Semantic Keyword and Category Recommendation Algorithm using Morphological Analysis and Tagging Libraries
- Analyzing explicit user data and implicit feedback through category/module behavior records to create category/module vector data
- Dynamic Vector Weights: Points used, more requests, searches, stays longer than 30 seconds, activate/save updates, likes/comments
- Static Vector Weights: Module, category, keyword, annual keyword
- Korean Basic Dictionary, Standard Korean Dictionary, Woori-mal-saem based Korean Spelling Dictionary provided by spellcheck-ko
- Korean Word2Vec Model Korean Word Vectors represented in 300 dimensions provided by Facebook
- Naver categories divided and analyzed by morphological analysis
Optimizing vector similarity using the Annoy
library and Bayesian Optimization
.
- Annoy Library: Efficient library for calculating and searching vector similarity
- Bayesian Optimization: Efficient algorithm for optimizing objective functions
- pandas: Library for data manipulation and calculation
- numpy: Library for handling multi-dimensional arrays
- matplotlib: Library for data visualization
- Install Dependencies:
pip install annoy pandas numpy scikit-learn bayesian-optimization matplotlib
- Run the Code:
python *_optimizeAnnModel.py
Run the above command to perform vector similarity optimization using the Annoy library.
evaluate_n_trees(n_trees)
: Function to optimize the accuracy of the Annoy index, calculates vector similarity for a given number of trees, and returns the average distance of the nearest neighborsBayesianOptimization
: Initializes and configures the text classification pipeline, uses the BERT model to perform classification on the input text, and returns the results
Using scikit-learn's t-SNE
algorithm to visualize vector data.
- t-SNE: Algorithm used to visualize high-dimensional data by reducing it to lower dimensions while preserving the structure
- matplotlib: Data visualization library
- scikit-learn: Library for implementing machine learning models
- pandas: Library for data manipulation and calculation
- numpy: Library for handling multi-dimensional arrays
- Install Dependencies:
pip install scikit-learn matplotlib pandas numpy
- Run the Code:
python visualize_vectors.py
Results in 2D and 3D t-SNE visualizations are generated as images named TSNE_2D.png and TSNE_3D.png.
Using the BERT pre-trained language model for natural language processing (NLP) tasks to filter clean words from text.
- BERT: Pre-trained language model based on the bidirectional transformer model, using the
kor_unsmile
model provided by Smilegate-ai - Hugging Face Transformers: Library providing various pre-trained models for different languages, loads the model and performs text classification using BERT
- Install Dependencies:
pip install transformers tqdm
- Download Pre-trained BERT Model and Tokenizer:
from transformers import BertForSequenceClassification, AutoTokenizer model_name = 'smilegate-ai/kor_unsmile' model = BertForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
- Perform Word Filtering:
python filter_words.py
Categorizes Korean words provided by spellcheck-ko, extracts clean words, and saves the results in the data/ko_filtered.txt file.
-
get_predicated_label(output_labels, min_score)
: Function to return only labels from the BERT model's output that have a score greater than or equal to the specified minimum score -
TextClassificationPipeline
: Initializes and configures the text classification pipeline, uses the BERT model to perform classification on the input text, and returns the results
Utilizing various tagging libraries from KoNLPy
for Korean morphological analysis to extract meaningful words from categories. Translates extracted words using googletrans
, then normalizes them to obtain new similar words.
- konlpy: Library for Korean morphological analysis, using Okt, Hannanum, Kkma, and Komoran for morphological analysis
- googletrans: Library using the Google Translate API for word translation
- re: Library for regular expressions used to filter words
- Install Dependencies:
pip install konlpy googletrans
- Perform Category Morphological Analysis and Translation:
python category_corpus.py
Extracts new similar words from categories, saves the results in output.json and output_oneElement.txt.
tokenize_and_join(input_file: str) -> Tuple[List[int], List[str]]
: Reads each line from the input file, performs morphological analysis and translation to extract meaningful words, and saves them to a file
Extracting word vectors using the Korean Word2Vec embedding model and saving them.
- Word2Vec: Technique for learning distributed representations of words, using Facebook's Word2Vec model to extract and use word vectors
- SQLite: Lightweight database management system, used to store words and their corresponding vectors
- unicodedata: Library providing a database for Unicode characters
- pickle: Library for serializing and deserializing Python objects
- numpy: Library for handling multi-dimensional arrays
- Install Dependencies:
pip install numpy tqdm
- Build Korean Word2Vec Database:
python process_vecs_*.py
Extracts word vectors from the Korean Word2Vec model and saves them in *_guesses_ko.db and *_nearest_ko.dat.
is_hangul(text) -> bool
: Function to check if the given text is in Hangul (Korean)load_dic(path: str) -> Set[str]
: Function to read the dictionary file from the specified path and return it as a set, normalizing Korean words included in the dictionaryblocks(files, size=65536)
: Generator function to divide a file into blockscount_lines(filepath)
: Function to count the total number of lines in a given file- Extracts word vectors from the Word2Vec model and stores them in the database
Using stored word vectors to measure similarity between words, find similar words for a specific word, and recommend categories based on those words.
- numpy: Library for handling multi-dimensional arrays
- pickle: Library for serializing and deserializing Python objects
- pymysql: Library for connecting to and interacting with MySQL databases
- Install Dependencies:
pip install pymysql, numpy
- Perform Keyword-based Category/Keyword Recommendation, Category-based Category Recommendation:
python process_smilar_*.py
relCategory.json
: JSON file storing information about recommended related categories based on category recommendationskeyword/*.dat
: dat files storing information about related keywords recommended based on keyword recommendationscategory/*.json
: JSON files storing information about related categories recommended based on keyword recommendations
get_word_vector(word: str, model: Word2Vec) -> Optional[array]
: Function to retrieve the vector representation of a given word from the Word2Vec modelrecommend_by_category(category: str, k: int = 5) -> List[str]
: Recommends related categories based on the semantic similarity of words within the given categoryrecommend_by_keyword(keyword: str, k: int = 5) -> List[str]
: Recommends related keywords based on the semantic similarity of words within the given keyworddump_json(data: Any, filepath: str)
: Serializes the given data to a JSON fileload_json(filepath: str) -> Any
: Deserializes the data from a JSON file
This project is licensed under the MIT License - see the LICENSE file for details.