===
As used for the Weibo age profiling task reported at the Language Resources & Evaluation Conference 2016 (Zhang, Caines, Alikaniotis & Buttery, 'Predicting author age from Weibo microblog posts')
- normalises Weibo posts and extracts linguistic / non-linguistic features in the process;
- requires pre-obtained Weibo files: ours were rows of users, columns of posts, Excel files;
- requires the resources listed below;
- look for 'CHECK PATHS' comments where you should adapt filepaths to your filesystem accordingly
- passes normalised texts to Stanford NLP word segmenter and part-of-speech tagger;
- requires (free) download of Stanford NLP segmenter and pos-tagger from here;
- look for 'CHECK PATHS' comments where you should adapt filepaths to your filesystem accordingly
dictClassicalModernCharacters.csv
: list of 50 classical characters with definitions and modern equivalents, where appropriate (i.e. where unambiguous); source http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=CLlistKaomoji.csv
: list of 748 Kaomojis separated by '$' signs; source http://kaomoji.ru/enlistPopularExpressions.csv
: list of 14 popular expressions; source http://zh.wikipedia.org/zh/中国网络流行语列表