A Kotlin project which extracts ngram counts from Wikipedia data dumps.
Download the latest jar from releases.
You can also clone the repository and build with maven:
$ git clone https://github.com/TomerAberbach/wikipedia-ngrams.git
$ cd wikipedia-ngrams
$ mvn package
A fat jar called wikipedia-ngrams-VERSION-jar-with-dependencies.jar
will be in a newly created target
directory.
DISCLAIMER: Many of these commands will take a very long time to run.
Download the latest Wikipedia data dump using wget
:
$ wget -np -nd -c -A 7z https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2
Or using axel
:
$ axel --num-connections=3 https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2
To speed up the download you should replace https://dumps.wikimedia.org
with the mirror closest to you.
Once downloaded, extract the zipped data using a tool like lbzip2
and feed the resulting enwiki-latest-pages-articles.xml
file into WikiExtractor:
$ python3 WikiExtractor.py --no_templates --json enwiki-latest-pages-articles.xml
This will output a large directory structure with root directory text
.
Finally, run wikipedia-ngrams.jar
with the desired ngram "n" (2 in this example) and the path to directory output of WikiExtractor:
$ java -jar wikipedia-ngrams.jar 2 text
Note that you may need to increase the maximum heap size and/or disable GC overhead limit.
contexts.txt
and 2-grams.txt
files will be in an out
directory. contexts.txt
caches the "sentences" in the Wikipedia data dump. To use this cache in your next run (with n = 3 for example), run the following command:
$ java -jar wikipedia-ngrams.jar 3 out/contexts.txt
The outputted files will not be sorted. Use a command-line tool like sort
to do so.
Note that OutOfMemoryError
is not a legitimate issue. The burden is on the user to allocate enough heap space and have a large enough RAM (consider allocating a larger swap file).