-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
118 lines (96 loc) · 5.12 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
NGramTool (http://www.nlplab.cn/zhangle/ngram.html)
===================================================
This package contains a set of tools for manipulating N-gram statistics from
large raw corpus. The tools provided here distinguish themselves from other
N-Gram extraction programs in that:
1. Unicode code base.
All text are represented as Unicode (UCS16) internally for easily
handling of oriental languages such as Chinese, Korean or Japanese.
2. Both Word N-gram and Character N-gram can be extracted.
This feature is especially useful for Oriental language processing.
Since in those languages (like Chinese) there is no explicit word
boundary between words, and the basic processing units are usually
single characters.
3. Statistical Substring Reduction (SSR) algorithms (see below).
Statistical Substring Reduction is a powerful procedure to remove
redundant N-grams from an N-gram set using N-gram statistics
information. For example, if both "People's republic of China" and
"People's republic" occur 10 times in the corpus, the latter will be
removed, for it is the substring of the former. The detail can be
found in the following paper:
"Statistical Substring Reduction in Linear Time"
Lv Xue-qiang, Zhang Le and Hu Junfeng
IJCNLP-04, Hai Nan island, P.R.China. 2004
Support Platforms
=================
Tested plasforms include: GNU/Linux, FreeBSD, NetBSD, MingW32, Cygwin.
NOTICE:
If you use these programs on win32 platforms please specify the file
name as the last argument of the command line option. All options given
after the file name will be ignored! Therefore use `foo -a2 file' instead
of `foo file -a2'.
Currently the following programs are provided:
text2ngram
=========================================================================
Extract N-Gram from raw corpus using an improved version of Nagao 1994
algorithm. Optionally the corpus can be parsed and saved to disk (in
Unicode) and extract N-grams later. Type text2ngram -h for a short help.
Here are some examples:
1. text2ngram -n3 file
Extract all word trigrams from a file (words are separated by blank and tab).
The output is sent to stdout.
2. text2ngram -n3 -m10 -f5 file
Extract word 3 to 10-gram whose frequency >= 5 from a text file
and output word N-grams to stdout.
3. text2ngram -c -n3 -m10 -f5 -F gbk -T gbk file
The same as above except that this time we extract *character* N-grams
(-c option) from a Chinese file encoded in GBK. The output is also GBK. If you
do not specify -F gbk or -T gbk option, the input/output encodings are assumed
to be UTF-8.
4. text2ngram -o corpus file
Parse text file and generate a parsed corpus (corpus.ptable and corpus.ltable) for later
word N-gram extraction (using `extractngram' utility). For more on what happens please
refer to (Nagao 1994)
5. text2ngram -F gbk -c -o corpus file
The same as 4 but prepare for *character* N-gram extraction from a file encoded in GBK
(Chinese).
extractngram
=========================================================================
Extract N-gram from parsed table file generated by `text2ngarm' program.
Example:
1. extractngram -n3 -m5 -i corpus
will extract word 3-gram, 4-gram and 5-gram from corpus.ngram corpus.ptable
and corput.ltable previously saved with `text2ngarm -o corpus' and write
results to stdout. if -m5 is omitted, only 3-gram is extracted from ngram
table.
2. extractngram -c -T gbk -n3 -m5 -i corpus
The same as above example except that this time we extract *character* N-grams
(-c option) and output N-grams in GBK encoding. The corpus.* files must be
saved with `text2ngarm -c -o corpus' before.
strreduction
=========================================================================
Implement four Statistical Substring Reduction (SSR) algorithms.
The first algorithm has a O(n^2) complexity and is useful-less in real world
applications. It is included for completeness. The latter three SSR algorithms
all have an ideal time complexity of O(n), so can be used on very large N-gram
set.
example:
1. strreduction -a2 < input > output
Perform word level SSR on the input stream using algorithm 2 (can be 1-4),
and output the result to output. The input format is one "word frequency"
pair per line.
2. strreduction -a2 -c -F gbk -T gbk < input > output
The same as the above example but we use character level SSR this time on
Chinese input stream encoded in GBK (-F gbk). Output is GBK too (-T gbk). The input format
is one "word frequency" pair per line.
3. strreduction -a2 -c -F gbk -T gbk -s -t -f 3 < input > output
The same as the above example but we use a reduction threshold of 3 instead
of the default value of 1 (-f 3). The output is also sorted (-s). Finally the SSR
processing time is printed on stderr (-t).
For a detail introduction on SSR operation please refer to:
"Statistical Substring Reduction in Linear Time"
Lv Xue-qiang, Zhang Le and Hu Junfeng
IJCNLP-04, Hai Nan island, P.R.China. 2004
For building and installation instructions please see the INSTALL file.
Author:
Zhang Le <ejoy@xinhuanet.com>