Skip to content

chusheng0505/Classifications-Of-Text-Files-By-Using-DNN-and-KNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classifications-Of-Chinese-Text-Files-By-Using-DNN-and-KNN

Try to use methods of KNN and DNN to classify categories of text files
I am trying to classify Chinese Text Files into specified categories(total 7 categories).

Steps of Processing:

  1. Feature_Extractions : TF-IDF is used as features of each texts
    1.1 Jieba module is used to split text in files
    1.1.1 Jieba.cut_for_search() is well performance in this trial
    1.1.2 Stopping_Words Dictionary is built by own
    1.1.3 Select top-K TF-IDF results (here, K = 10)
  2. Word2Vec model is used
  3. Classifiers : KNN & DNN model

Data:

  1. Total 6804 files and split into training / validation / testing datasets with ratio 0.75:0.25
    1.1 Samples of files(here, category : 2社會):
    image
    1.2 Categories of files : ['0體育', '1房產', '2社會', '3星座', '4科技', '5娛樂', '6時尚']

Parameters used:

  1. KNN --> n = 5
  2. DNN --> { epochs : 100 , batch_size : 16 , optimizer : Adam(lr = 1e-4) , loss_fn : category_crossentropy }
    2.1 Early_Stopping : { monitor : val_loss , patience : 5 }
    2.2 ReduceOnLr : { monitor : val_loss , patience : 5 , factor : 0.1 , min_lr : 1e-8 }

Results:

  1. DNN -- with training accuracy : 1.0
    image

1.1 accuracy for testing sets : ~ 66.7 %
Problems : For those failed to be classied,the word vectors results are worst and found that the TF-IDF results
are strange too.

        For example , we get ['寧波', '志願', '願者', '志願者', '交通', '記者', '市民', '活動', '交通規則', '一邊'] TF-IDF  <br>
        result and this file is belong to 0體育.But we can found that none of the elements in features are related to this category! <br>

In future,we may consider to modify a specified dictionary for each categories but not for common files.

About

Try to use methods of KNN and DNN to classify categories of text files

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published