Try to use methods of KNN and DNN to classify categories of text files
I am trying to classify Chinese Text Files into specified categories(total 7 categories).
- Feature_Extractions : TF-IDF is used as features of each texts
1.1 Jieba module is used to split text in files
1.1.1 Jieba.cut_for_search() is well performance in this trial
1.1.2 Stopping_Words Dictionary is built by own
1.1.3 Select top-K TF-IDF results (here, K = 10) - Word2Vec model is used
- Classifiers : KNN & DNN model
- Total 6804 files and split into training / validation / testing datasets with ratio 0.75:0.25
1.1 Samples of files(here, category : 2社會):
1.2 Categories of files : ['0體育', '1房產', '2社會', '3星座', '4科技', '5娛樂', '6時尚']
- KNN --> n = 5
- DNN --> { epochs : 100 , batch_size : 16 , optimizer : Adam(lr = 1e-4) , loss_fn : category_crossentropy }
2.1 Early_Stopping : { monitor : val_loss , patience : 5 }
2.2 ReduceOnLr : { monitor : val_loss , patience : 5 , factor : 0.1 , min_lr : 1e-8 }
1.1 accuracy for testing sets : ~ 66.7 %
Problems : For those failed to be classied,the word vectors results are worst and found that the TF-IDF results
are strange too.
For example , we get ['寧波', '志願', '願者', '志願者', '交通', '記者', '市民', '活動', '交通規則', '一邊'] TF-IDF <br>
result and this file is belong to 0體育.But we can found that none of the elements in features are related to this category! <br>