From 26f6e18eb117ca7b080d01acb453fd1c9742418d Mon Sep 17 00:00:00 2001
From: Christopher Manning <cmanning@gmail.com>
Date: Sat, 24 Oct 2015 09:52:28 -0700
Subject: [PATCH] Mention need for whitespace tokenization

---
 README | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README b/README
index 7100343..2f5c7ae 100644
--- a/README
+++ b/README
@@ -9,7 +9,7 @@ http://nlp.stanford.edu/projects/glove/
 
 This package includes four main tools:
 1) vocab_count
-Constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.
+Constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. This file should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer (http://nlp.stanford.edu/software/tokenizer.shtml) first on raw text.
 2) cooccur
 Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by 'vocab_count', and may specify a variety of parameters, as described by running './cooccur'.
 3) shuffle