Skip to content

cch230/Recommendation-Algorithm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

44 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Recommendation-Algorithm

image

View English Introduction


Index


Directory

  • USER_CTGY: ์‚ฌ์šฉ์ž ๊ธฐ๋ฐ˜ ํ˜‘์—…ํ•„ํ„ฐ๋ง์„ ํ†ตํ•œ ์นดํ…Œ๊ณ ๋ฆฌ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜(User-based CF)
  • USER_MODL: ์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ํ˜‘์—…ํ•„ํ„ฐ๋ง์„ ํ†ตํ•œ ๋ฌ˜๋“ˆ๋ณ„ ๊ธฐ๋Šฅ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜(Item-based CF)
  • SMLR_RECO: ํ˜•ํƒœ์†Œ ๋ถ„์„, ํƒœ๊น… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•œ ์˜๋ฏธ๋ก ์  ํ‚ค์›Œ๋“œ, ์นดํ…Œ๊ณ ๋ฆฌ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    1. ByCTGY: ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ์—ฐ๊ด€ ์นดํ…Œ๊ณ ๋ฆฌ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    2. ByKYWD: ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜ ์—ฐ๊ด€ ์นดํ…Œ๊ณ ๋ฆฌ, ํ‚ค์›Œ๋“œ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜

DataSet

  1. ์œ ์ €์˜ ๋ช…์‹œ์  ๋ฐ์ดํ„ฐ์™€ ์นดํ…Œ๊ณ ๋ฆฌ/๋ชจ๋“ˆ ๋ณ„ ํ–‰๋™ ๊ธฐ๋ก์„ ๋ถ„์„ํ•œ ์•”์‹œ์  ํ”ผ๋“œ๋ฐฑ์„ ํ™œ์šฉํ•œ ์นดํ…Œ๊ณ ๋ฆฌ/๋ชจ๋“ˆ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ
    • ๋™์  ๋ฒกํ„ฐ ๊ฐ€์ค‘์น˜: ์‚ฌ์šฉํ•œ ํฌ์ธํŠธ, ๋”๋ณด๊ธฐ ์š”์ฒญ, ๊ฒ€์ƒ‰, 30์ดˆ ์ด์ƒ ์ฒด๋ฅ˜, ์ €์žฅ/๊ฐฑ์‹  ํ™œ์„ฑํ™”, ์ข‹์•„์š” / ๋Œ“๊ธ€
    • ์ •์  ๋ฒกํ„ฐ ๊ฐ€์ค‘์น˜: ๋ชจ๋“ˆ, ์นดํ…Œ๊ณ ๋ฆฌ, ํ‚ค์›Œ๋“œ, ์—ฐ๊ฐ„ ํ‚ค์›Œ๋“œ
  2. spellcheck-ko์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•œ๊ตญ์–ด๊ธฐ์ดˆ์‚ฌ์ „, ํ‘œ์ค€๊ตญ์–ด๋Œ€์‚ฌ์ „, ์šฐ๋ฆฌ๋ง์ƒ˜ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋งž์ถค๋ฒ• ์‚ฌ์ „
  3. Facebook์—์„œ ์ œ๊ณตํ•˜๋Š” FastText์˜ 300์ฐจ์› ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋‹จ์–ด์˜ ์˜๋ฏธ์  ๊ด€๊ณ„๋ฅผ ๋ฐ˜์˜ํ•œ ํ•œ๊ตญ์–ด Word2Vec ๋ชจ๋ธ ํ•œ๊ตญ์–ด ๋‹จ์–ด ๋ฒกํ„ฐ
  4. ๋„ค์ด๋ฒ„ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํ˜•ํƒœ์†Œ ๋ถ„์„ํ•˜์—ฌ ๋‚˜๋ˆˆ ๋„ค์ด๋ฒ„ ์นดํ…Œ๊ณ ๋ฆฌ ๋ง ๋ญ‰์น˜

Annoy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ฒกํ„ฐ ์œ ์‚ฌ๋„ ์ตœ์ ํ™”

Annoy์™€ Bayesian Optimization ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ์œ ์‚ฌ๋„๋ฅผ ์ตœ์ ํ™” ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ์ฃผ์š” ๊ธฐ์ˆ  ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • Annoy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: ๋ฒกํ„ฐ ์œ ์‚ฌ๋„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • Bayesian Optimization: ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ํšจ์œจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • pandas: ๋ฐ์ดํ„ฐ ์กฐ์ž‘ ๋ฐ ๊ณ„์‚ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • numpy: ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • matplotlib: ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

์‚ฌ์šฉ๋ฒ•

  1. ์˜์กด์„ฑ ์„ค์น˜:
    pip install annoy pandas numpy scikit-learn bayesian-optimization matplotlib
    
  2. ์ฝ”๋“œ ์‹คํ–‰:
    python *_optimizeAnnModel.py

์œ„ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์—ฌ Annoy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๋ฒกํ„ฐ ์œ ์‚ฌ๋„ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰

ํŒŒ์ด์ฌ ์ฝ”๋“œ ํŒŒ์ผ (*_similarity_optimization.py)์— ๋Œ€ํ•œ ์„ค๋ช…

  • evaluate_n_trees(n_trees): Annoy ์ธ๋ฑ์Šค์˜ ์ •ํ™•๋„๋ฅผ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋กœ, ์ฃผ์–ด์ง„ ํŠธ๋ฆฌ ์ˆ˜์— ๋Œ€ํ•ด ๋ฒกํ„ฐ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ตœ๊ทผ์ ‘ ์ด์›ƒ๋“ค์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜
  • BayesianOptimization: ํŠธ๋ฆฌ ์ˆ˜(n_trees)๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋ชฉ์  ํ•จ์ˆ˜(evaluate_n_trees)๋ฅผ ์ตœ์ ์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ๋ฅผ ํƒ์ƒ‰ํ•˜๋ฉฐ ํŠธ๋ฆฌ ์ˆ˜ ์ตœ์ ํ™”

t-SNE๋ฅผ ํ™œ์šฉํ•œ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

scikit-learn์˜ t-SNE ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™” ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ์ฃผ์š” ๊ธฐ์ˆ  ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • t-SNE: ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ €์ฐจ์›์œผ๋กœ ์ถ•์†Œํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • matplotlib: ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • scikit-learn: ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ตฌํ˜„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • pandas: ๋ฐ์ดํ„ฐ ์กฐ์ž‘ ๋ฐ ๊ณ„์‚ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • numpy: ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

์‚ฌ์šฉ๋ฒ•

  1. ์˜์กด์„ฑ ์„ค์น˜:
    pip install scikit-learn matplotlib pandas numpy
  2. ์ฝ”๋“œ ์‹คํ–‰:
    python visualize_vectors.py

TSNE_3D.png ๋ฐ TSNE_2D.png ์ด๋ฏธ์ง€ ํŒŒ์ผ๋กœ 2D ๋ฐ 3D t-SNE ๊ฒฐ๊ณผ๊ฐ€ ์ƒ์„ฑ


ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ BERT ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ๋‹จ์–ด ํ•„ํ„ฐ๋ง

ํ…์ŠคํŠธ์—์„œ ๊นจ๋—ํ•œ ๋‹จ์–ด๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ธฐ ์œ„ํ•ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ์ž‘์—…์— ์‚ฌ์šฉ๋˜๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์–ธ์–ด ๋ชจ๋ธ์ธ BERT ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ์ฃผ์š” ๊ธฐ์ˆ  ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • BERT: ์–‘๋ฐฉํ–ฅ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์–ธ์–ด ๋ชจ๋ธ, Smilegate-ai์—์„œ ์ œ๊ณตํ•˜๋Š” kor_unsmile ๋ชจ๋ธ์„ ํ™œ์šฉ
  • Hugging Face Transformers: ๋‹ค๊ตญ์–ด๋กœ ๋œ ์—ฌ๋Ÿฌ ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๊ณ  ํ…์ŠคํŠธ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰

์‚ฌ์šฉ๋ฒ•

  1. ์˜์กด์„ฑ ์„ค์น˜:
    pip install transformers tqdm
    
  2. ์‚ฌ์ „ ํ›ˆ๋ จ๋œ BERT ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ € ๋‹ค์šด๋กœ๋“œ:
    from transformers import BertForSequenceClassification, AutoTokenizer
    
    model_name = 'smilegate-ai/kor_unsmile'
    model = BertForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
  3. ๋‹จ์–ด ํ•„ํ„ฐ๋ง ์ˆ˜ํ–‰:
    python filter_words.py

spellcheck-ko์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•œ๊ตญ์–ด ๋‹จ์–ด๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๊นจ๋—ํ•œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๊ฒฐ๊ณผ๋Š” data/ko_filtered.txt ํŒŒ์ผ์— ์ €์žฅ

ํŒŒ์ด์ฌ ์ฝ”๋“œ ํŒŒ์ผ (filter_words.py)์— ๋Œ€ํ•œ ์„ค๋ช…

  • get_predicated_label(output_labels, min_score): BERT ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๋ ˆ์ด๋ธ”์—์„œ ์ง€์ •๋œ ์ตœ์†Œ ์ ์ˆ˜ ์ด์ƒ์ธ ๋ ˆ์ด๋ธ”๋งŒ์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜

  • TextClassificationPipeline: ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์„ค์ •. ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ BERT ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜


konlpy์™€ googletrans๋ฅผ ํ™œ์šฉํ•œ ์นดํ…Œ๊ณ ๋ฆฌ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ฐ ๋ฒˆ์—ญ

KoNLPy์˜ ์—ฌ๋Ÿฌ ํƒœ๊น… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํ˜•ํƒœ์†Œ ๋ถ„์„ํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ ๋‹จ์–ด๋กœ ์ถ”์ถœํ•˜๊ณ , googletrans๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ถ”์ถœ๋œ ๋‹จ์–ด๋“ค์„ ๋ฒˆ์—ญ ํ•œํ›„ ์ •๊ทœํ™”๋ฅผ ๊ฑฐ์ณ ์ƒˆ๋กœ์šด ์œ ์‚ฌ ๋‹จ์–ด๋“ค์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ๊ธฐ์ˆ  ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • konlpy: ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, Okt, Hannanum, Kkma, Komoran์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์ˆ˜ํ–‰
  • googletrans: Google Translate API๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ๋ฒˆ์—ญํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ
  • re: ์ •๊ทœ ํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ํ•„ํ„ฐ๋งํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ

์‚ฌ์šฉ๋ฒ•

  1. ์˜์กด์„ฑ ์„ค์น˜:
    pip install konlpy googletrans
    
  2. ์นดํ…Œ๊ณ ๋ฆฌ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ฐ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰:
    python category_corpus.py

์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ์ƒˆ๋กœ์šด ์œ ์‚ฌ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ, output.json ์™€ output_oneElement.txt ์— ์ €์žฅ

ํŒŒ์ด์ฌ ์ฝ”๋“œ ํŒŒ์ผ (category_corpus.py)์— ๋Œ€ํ•œ ์„ค๋ช…

  • tokenize_and_join(input_file: str) -> Tuple[List[int], List[str]]: ์ž…๋ ฅ ํŒŒ์ผ์—์„œ ๊ฐ ๋ผ์ธ์„ ์ฝ์–ด์™€ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ฐ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ํŒŒ์ผ๋กœ ์ €์žฅ

Word2Vec ํ•œ๊ตญ์–ด ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ตฌ์ถ•

ํ•œ๊ตญ์–ด Word2Vec ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ์ฃผ์š” ๊ธฐ์ˆ  ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • Word2Vec: ํ•œ๊ตญ์–ด ๋‹จ์–ด์˜ ๋ถ„์‚ฐ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ๊ธฐ์ˆ , Facebook์—์„œ ์ œ๊ณตํ•˜๋Š” Word2Vec ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์‚ฌ์šฉ
  • SQLite: ๊ฒฝ๋Ÿ‰ํ™” DBMS ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ๋‹จ์–ด์™€ ๊ทธ์— ํ•ด๋‹นํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ์ €์žฅ
  • unicodedata: ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • pickle: `ํŒŒ์ด์ฌ ๊ฐ์ฒด๋ฅผ ์ง๋ ฌํ™”ํ•˜๊ณ  ์—ญ์ง๋ ฌํ™”ํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • numpy: ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

์‚ฌ์šฉ๋ฒ•

  1. ์˜์กด์„ฑ ์„ค์น˜:
    pip install numpy tqdm
    
  2. ํ•œ๊ตญ์–ด Word2Vec ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ตฌ์ถ•:
    python process_vecs_*.py

ํ•œ๊ตญ์–ด Word2Vec ๋ชจ๋ธ์—์„œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜์—ฌ, *_guesses_ko.db ์™€ *_nearest_ko.dat ์— ์ €์žฅ

ํŒŒ์ด์ฌ ์ฝ”๋“œ ํŒŒ์ผ (process_vecs_*.py)์— ๋Œ€ํ•œ ์„ค๋ช…

  • is_hangul(text) -> bool: ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๊ฐ€ ํ•œ๊ธ€์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
  • load_dic(path: str) -> Set[str]: ์ฃผ์–ด์ง„ ๊ฒฝ๋กœ์—์„œ ์‚ฌ์ „ ํŒŒ์ผ์„ ์ฝ์–ด์™€ ์ง‘ํ•ฉ(Set)์œผ๋กœ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜, ์‚ฌ์ „์— ํฌํ•จ๋œ ํ•œ๊ธ€ ๋‹จ์–ด๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ์ €์žฅ
  • blocks(files, size=65536): ํŒŒ์ผ์„ ๋ธ”๋ก ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜
  • count_lines(filepath): ์ฃผ์–ด์ง„ ํŒŒ์ผ์˜ ์ด ๋ผ์ธ ์ˆ˜๋ฅผ ์„ธ์–ด ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
  • ์ฃผ์–ด์ง„ Word2Vec ๋ชจ๋ธ์—์„œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ €์žฅ

์˜๋ฏธ๋ก ์  ๋‹จ์–ด์œ ์‚ฌ๋„๋ฅผ ํ™œ์šฉํ•œ ์นดํ…Œ๊ณ ๋ฆฌ/ํ‚ค์›Œ๋“œ ์ถ”์ฒœ

์ €์žฅ๋œ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •, ํŠน์ • ๋‹จ์–ด์™€ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์„ ์ฐพ๊ณ , ํ•ด๋‹น ๋‹จ์–ด๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์ถ”์ฒœํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ๋œ ๊ธฐ์ˆ  ๋ฐ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • numpy: ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • pickle: ํŒŒ์ด์ฌ ๊ฐ์ฒด๋ฅผ ์ง๋ ฌํ™”ํ•˜๊ณ  ์—ญ์ง๋ ฌํ™”ํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • pymysql: MySQL ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์—ฐ๊ฒฐํ•˜๊ณ  ์ƒํ˜ธ์ž‘์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

์‚ฌ์šฉ๋ฒ•

  1. ์˜์กด์„ฑ ์„ค์น˜:
    pip install pymysql, numpy
    
  2. ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜ ์นดํ…Œ๊ณ ๋ฆฌ/ํ‚ค์›Œ๋“œ ์ถ”์ฒœ, ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ์นดํ…Œ๊ณ ๋ฆฌ ์ถ”์ฒœ:
    python process_smilar_*.py
    
  • relCategory.json: ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ์ถ”์ฒœ๋œ ๊ด€๋ จ ์นดํ…Œ๊ณ ๋ฆฌ ์ •๋ณด๋ฅผ JSON ํ˜•์‹์œผ๋กœ ์ €์žฅ
  • keyword/*.dat: ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜ ์ถ”์ฒœ๋œ ๊ด€๋ จ ํ‚ค์›Œ๋“œ ์ •๋ณด๋ฅผ dat ํ˜•์‹์œผ๋กœ ์ €์žฅ
  • category/*.json: ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜ ์ถ”์ฒœ๋œ ๊ด€๋ จ ์นดํ…Œ๊ณ ๋ฆฌ ์ •๋ณด๋ฅผ json ํ˜•์‹์œผ๋กœ ์ €์žฅ

ํŒŒ์ด์ฌ ์ฝ”๋“œ ํŒŒ์ผ (process_smilar_*.py)์— ๋Œ€ํ•œ ์„ค๋ช…

  • most_similar(mat: array, idx: int, k: int) -> Tuple[array, array]: ํŠน์ • ๋‹จ์–ด์— ๋Œ€ํ•ด ์ฃผ์–ด์ง„ ํ–‰๋ ฌ์—์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ k๊ฐœ์˜ ๋‹จ์–ด์™€ ๊ทธ ์œ ์‚ฌ๋„๋ฅผ ๋ฐ˜ํ™˜
  • dump_nearest(title: str, values: List[str], words: List[str], mat: array, k: int = 100) -> List[str]: ๋‹จ์–ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์„ ํŒŒ์ผ๋กœ ์ €์žฅ, ์ด๋ฏธ ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ ํŒŒ์ผ์—์„œ ๋กœ๋“œํ•˜์—ฌ ๋ฐ˜ํ™˜
  • get_nearest(title: str, values: List[str], words: List[str], mat: array) -> List[str]: ๋‹จ์–ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฏธ ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ ํ›„ ์žˆ์œผ๋ฉด ๋กœ๋“œํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•˜๊ณ , ์—†์œผ๋ฉด ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฐ˜ํ™˜

๋ผ์ด์„ ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” GPL-3.0 ๋ผ์ด์„ ์Šค๋ฅผ ๋”ฐ๋ฅด๋ฉฐ, ์ž์„ธํ•œ ๋‚ด์šฉ์€ LICENSE ํŒŒ์ผ์„ ์ฐธ์กฐํ•˜์„ธ์š”.


View English Introduction

_Index


_Directory

  • USER_CTGY: User-based Collaborative Filtering for Category Recommendation
  • USER_MODL: Item-based Collaborative Filtering for Module-specific Feature Recommendation
  • SMLR_RECO: Semantic Keyword and Category Recommendation Algorithm using Morphological Analysis and Tagging Libraries
    1. ByCTGY: Category-based Related Category Recommendation Algorithm
    2. ByKYWD: Keyword-based Related Category and Keyword Recommendation Algorithm

_DataSet

  1. Analyzing explicit user data and implicit feedback through category/module behavior records to create category/module vector data
    • Dynamic Vector Weights: Points used, more requests, searches, stays longer than 30 seconds, activate/save updates, likes/comments
    • Static Vector Weights: Module, category, keyword, annual keyword
  2. Korean Basic Dictionary, Standard Korean Dictionary, Woori-mal-saem based Korean Spelling Dictionary provided by spellcheck-ko
  3. Korean Word2Vec Model Korean Word Vectors represented in 300 dimensions provided by Facebook
  4. Naver categories divided and analyzed by morphological analysis

Optimizing Vector Similarity using the Annoy Library

Optimizing vector similarity using the Annoy library and Bayesian Optimization.

Key Technologies and Libraries Used

  • Annoy Library: Efficient library for calculating and searching vector similarity
  • Bayesian Optimization: Efficient algorithm for optimizing objective functions
  • pandas: Library for data manipulation and calculation
  • numpy: Library for handling multi-dimensional arrays
  • matplotlib: Library for data visualization

Usage

  1. Install Dependencies:
    pip install annoy pandas numpy scikit-learn bayesian-optimization matplotlib
    
  2. Run the Code:
    python *_optimizeAnnModel.py

Run the above command to perform vector similarity optimization using the Annoy library.

Explanation of Python Code File (*_similarity_optimization.py)

  • evaluate_n_trees(n_trees): Function to optimize the accuracy of the Annoy index, calculates vector similarity for a given number of trees, and returns the average distance of the nearest neighbors
  • BayesianOptimization: Initializes and configures the text classification pipeline, uses the BERT model to perform classification on the input text, and returns the results

Visualizing Vector Data with t-SNE

Using scikit-learn's t-SNE algorithm to visualize vector data.

Key Technologies and Libraries Used

  • t-SNE: Algorithm used to visualize high-dimensional data by reducing it to lower dimensions while preserving the structure
  • matplotlib: Data visualization library
  • scikit-learn: Library for implementing machine learning models
  • pandas: Library for data manipulation and calculation
  • numpy: Library for handling multi-dimensional arrays

Usage

  1. Install Dependencies:
    pip install scikit-learn matplotlib pandas numpy
  2. Run the Code:
    python visualize_vectors.py

Results in 2D and 3D t-SNE visualizations are generated as images named TSNE_2D.png and TSNE_3D.png.


Word Filtering using BERT for Korean Text Classification

Using the BERT pre-trained language model for natural language processing (NLP) tasks to filter clean words from text.

Key Technologies and Libraries Used

  • BERT: Pre-trained language model based on the bidirectional transformer model, using the kor_unsmile model provided by Smilegate-ai
  • Hugging Face Transformers: Library providing various pre-trained models for different languages, loads the model and performs text classification using BERT

Usage

  1. Install Dependencies:
    pip install transformers tqdm
    
  2. Download Pre-trained BERT Model and Tokenizer:
    from transformers import BertForSequenceClassification, AutoTokenizer
    
    model_name = 'smilegate-ai/kor_unsmile'
    model = BertForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
  3. Perform Word Filtering:
    python filter_words.py

Categorizes Korean words provided by spellcheck-ko, extracts clean words, and saves the results in the data/ko_filtered.txt file.

Explanation of Python Code File (filter_words.py)

  • get_predicated_label(output_labels, min_score): Function to return only labels from the BERT model's output that have a score greater than or equal to the specified minimum score

  • TextClassificationPipeline: Initializes and configures the text classification pipeline, uses the BERT model to perform classification on the input text, and returns the results


Category Morphological Analysis and Translation using konlpy and googletrans

Utilizing various tagging libraries from KoNLPy for Korean morphological analysis to extract meaningful words from categories. Translates extracted words using googletrans, then normalizes them to obtain new similar words.

Key Technologies and Libraries Used

  • konlpy: Library for Korean morphological analysis, using Okt, Hannanum, Kkma, and Komoran for morphological analysis
  • googletrans: Library using the Google Translate API for word translation
  • re: Library for regular expressions used to filter words

Usage

  1. Install Dependencies:
    pip install konlpy googletrans
    
  2. Perform Category Morphological Analysis and Translation:
    python category_corpus.py

Extracts new similar words from categories, saves the results in output.json and output_oneElement.txt.

Explanation of Python Code File (category_corpus.py)

  • tokenize_and_join(input_file: str) -> Tuple[List[int], List[str]]: Reads each line from the input file, performs morphological analysis and translation to extract meaningful words, and saves them to a file

Building a Word2Vec Korean Word Embedding Database

Extracting word vectors using the Korean Word2Vec embedding model and saving them.

Key Technologies and Libraries Used

  • Word2Vec: Technique for learning distributed representations of words, using Facebook's Word2Vec model to extract and use word vectors
  • SQLite: Lightweight database management system, used to store words and their corresponding vectors
  • unicodedata: Library providing a database for Unicode characters
  • pickle: Library for serializing and deserializing Python objects
  • numpy: Library for handling multi-dimensional arrays

Usage

  1. Install Dependencies:
    pip install numpy tqdm
    
  2. Build Korean Word2Vec Database:
    python process_vecs_*.py

Extracts word vectors from the Korean Word2Vec model and saves them in *_guesses_ko.db and *_nearest_ko.dat.

Explanation of Python Code File (process_vecs_*.py)

  • is_hangul(text) -> bool: Function to check if the given text is in Hangul (Korean)
  • load_dic(path: str) -> Set[str]: Function to read the dictionary file from the specified path and return it as a set, normalizing Korean words included in the dictionary
  • blocks(files, size=65536): Generator function to divide a file into blocks
  • count_lines(filepath): Function to count the total number of lines in a given file
  • Extracts word vectors from the Word2Vec model and stores them in the database

Category/Keyword Recommendation using Semantic Word Similarity

Using stored word vectors to measure similarity between words, find similar words for a specific word, and recommend categories based on those words.

Key Technologies and Libraries Used

  • numpy: Library for handling multi-dimensional arrays
  • pickle: Library for serializing and deserializing Python objects
  • pymysql: Library for connecting to and interacting with MySQL databases

Usage

  1. Install Dependencies:
    pip install pymysql, numpy
    
  2. Perform Keyword-based Category/Keyword Recommendation, Category-based Category Recommendation:
    python process_smilar_*.py
  • relCategory.json: JSON file storing information about recommended related categories based on category recommendations
  • keyword/*.dat: dat files storing information about related keywords recommended based on keyword recommendations
  • category/*.json: JSON files storing information about related categories recommended based on keyword recommendations

Explanation of Python Code File (process_smilar_*.py) - Continued

  • get_word_vector(word: str, model: Word2Vec) -> Optional[array]: Function to retrieve the vector representation of a given word from the Word2Vec model
  • recommend_by_category(category: str, k: int = 5) -> List[str]: Recommends related categories based on the semantic similarity of words within the given category
  • recommend_by_keyword(keyword: str, k: int = 5) -> List[str]: Recommends related keywords based on the semantic similarity of words within the given keyword
  • dump_json(data: Any, filepath: str): Serializes the given data to a JSON file
  • load_json(filepath: str) -> Any: Deserializes the data from a JSON file

_License

This project is licensed under the MIT License - see the LICENSE file for details.