-
Notifications
You must be signed in to change notification settings - Fork 4
Instructions
https://www.evernote.com/shard/s103/sh/0dcaa647-07bb-4920-b32c-d89011a3a001/7ad9d6653b414e69
There is now a pre-computed word sense inventory with context clues available at frink:/home/simon/wikidata/wiki-processed3-coocs-malt_deps-collapsed_preps-nounswc__w10000__s0__twf2__tw1000__tf2__LMI__r3__p1000-SimPruned__Clusters__e0__MCL__N100__n005__gamma1.4__loopGain0.0__100polysem-words__WithCoocs__twf2
for the 100 words listed in frink:/home/simon/wikidata/100polysem-words
Additional dependency clues (the previous context clues are all sentence-wide co-occurrences) can be found at frink:/home/simon/wikidata/wiki-processed3-coocs-malt_deps-collapsed_preps-nounswc__w10000__s0__twf2__tw1000__tf2__LMI__r3__p1000-SimPruned__Clusters__e0__MCL__N100__n005__gamma1.4__loopGain0.0__100polysem-words__WithDeps__twf2
The format is as follows:
word sense-id cluster-label sense-freq cluster-word1:sim cluster-word2:sim ... clue1:score clue2:score ...
where "sense-id" is simply the index of the sense (starting from 0), "cluster-label" is the dominating cluster word according to Chinese Whispers or MCL, "sense-freq" is the average frequency of the cluster words, "sim" is the similarity of the cluster word to the head word. The last column is the list of context clues with scores for this sense.
with this information, it is straight-forward to build your own classifier / sense tagger: For a given context (consisting of a lemmatized set of sentence words) and each sense, multiply the sense-freq of the sense with all scores of the context words according to this sense entry. If a context word does not appear in the context clue list, the score is 0. Therefore, I added a smoothing of 0.00001 to every context clue score. Instead of multiplying, you can sum up the logarithms of the values instead. The sense with the highest score is then assigned.
To achieve higher precision, dependency features can be extracted as well for the context, and scores taken from the second clue file above (...WithDeps__twf2). They must be in the format "amod(@@,wild)" (e.g. for "wild chicken") without whitespaces.
You can of course re-use my existing code, but it is rather meant to sense-tag a high number of instances at the same time (for the evaluations), and cannot be used for being run iteratively e.g. from a web interface or command line. Like I said, building a classifier that fits your needs is straight-forward and a matter of parsing the relevant files (in the format above), running a lemmatizer and optionally a dependency parser, and multiplying a few numbers. To quickly find a word entry in the relatively large files, you can either first compute an index (word to byte offset) or sort the file and use a binary search.
If you want to use my code for sense-tagging, it is the "WSD" class in the noun-sense-induction-scala project. You can run it as follows:
spark-submit --num-executors 20 --queue shortrunning --master yarn-cluster --class WSD --driver-memory 7g --executor-memory 7g --driver-java-options "-Dspark.storage.memoryFraction=0.1 -Dspark.shuffle.memoryFraction=0.1 -Dspark.core.connection.auth.wait.timeout=3600 -Dspark.core.connection.ack.wait.timeout=3600 -Dspark.akka.timeout=3600 -Dspark.storage.blockManagerSlaveTimeoutMs=360000 -Dspark.worker.timeout=360000 -Dspark.akka.retry.wait=360000 -Dspark.task.maxFailures=1 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer" target/scala-2.10/noun-sense-induction_2.10-0.0.1.jar 0.00001 Product y
where is a path on HDFS to the first file (...WithCoocs__twf2) and a path to the second file (...WithDeps__twf2). is the output path to write the result to (also on HDFS). "0.00001" is the smoothing, "Product" indicates that scores must be multiplicated, and the "y" for yes tells the classifier to take the "prior" score into account, i.e. the average cluster word frequency.
is the path of a file on HDFS containing the instances (to be sense-tagged) in the following format:
word instance-id coocs deps
where instance-id is simply a unique ID for every instance, coocs is the sentence/context as a lemmatized set of words and deps is the comma-separated list of dependency features of the head word (e.g. "amod(@@,wild)").