From 26f6e18eb117ca7b080d01acb453fd1c9742418d Mon Sep 17 00:00:00 2001 From: Christopher Manning Date: Sat, 24 Oct 2015 09:52:28 -0700 Subject: [PATCH] Mention need for whitespace tokenization --- README | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README b/README index 7100343..2f5c7ae 100644 --- a/README +++ b/README @@ -9,7 +9,7 @@ http://nlp.stanford.edu/projects/glove/ This package includes four main tools: 1) vocab_count -Constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. +Constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. This file should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer (http://nlp.stanford.edu/software/tokenizer.shtml) first on raw text. 2) cooccur Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by 'vocab_count', and may specify a variety of parameters, as described by running './cooccur'. 3) shuffle