-
Notifications
You must be signed in to change notification settings - Fork 66
- I can't process Arabic / Vietnamese with the standalone version. Why?
- Switching the domain to "colloquial" does not improve the results when processing colloquial texts compared to using "news" as document type. What am I doing wrong?
- HeidelTime is too slow for my application. Anything I can do?
- Chinese processing under Windows
#Frequently asked questions
This page contains frequently asked questions as well as notes about how HeidelTime functions.
The command to run the HeidelTime standalone version for Arabic and Vietnamese differs from the command for the other languages. It is thus not enough to specify the language with the "-l" parameter. Instead, the Stanford POS Tagger / JVnTextPro has to be installed. The respective paths have to be specified in the config.props file. Then, a classpath variable has to be set and finally specified in the command to run HeidelTime:
Under Unix/Linux/Mac OS X:
- "export HT_CP="<$1>:<$2>:<$3>:$CLASSPATH"" or under Windows:
- "set HT_CP=<$1>;<$2>;<$3>;%CLASSPATH%" where
- <$1> is the path to JVnTextPro’s bin folder, e.g., /opt/jvntextpro/bin/,
- <$2> is the path to StanfordPOSTagger’s .jar file, e.g., /opt/stanfordpostagger/stanford-postagger.jar and
- <$3> is de.unihd.dbs.heideltime.standalone.jar
Then, the following command has to be used:
- "java -cp $HT_CP de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone [options]"
This is also explained in the Manual.pdf, Section 4.1 (Command Line Usage), Extra steps for Arabic and Vietnamese tagging
Switching the domain to "colloquial" does not improve the results when processing colloquial texts compared to using "news" as document type. What am I doing wrong?
To process English colloquial text such as short messages or tweets, the domain (document type, "-t" parameter) has to be set to "colloquial". In addition, the language has to be set to "englishcoll" instead of "english". This guarantees that all the colloquial synoyms for English temporal terms are used for temporal tagging, e.g., "tmr" for "tomorrow" etc.
Instead of proper preprocessing, the AllLanguagesTokenizer can be used to create sentence and token annotations (without part-of-speech information). This will increase the processing speed, but will probably result in worse temporal tagging results as all part-of-speech constraints will be validated as "false". In the standalone version, the pos parameter has to be set to "no". In the UIMA version, the Analysis Engine AllLanguagesTokenizer has to be used instead of, e.g., the TreeTagger wrapper.
Due to improper codepage support in the Windows command prompt, our standard processing method (TreeTagger) for documents in Chinese will not work under Windows with the HeidelTime Kit or the Standalone version.
As an alternative, you can set up the StanfordPOSTaggerWrapper as a preprocessing engine for both the Standalone as well as the Kit version. The way to do that for the Standalone version is described in section 4 under the "Extra steps" headline.
For a kit pipeline, you simply need to tie in the StanfordPOSTaggerWrapper Analysis Engine supplied with the kit ahead of the HeidelTime AE.
For both of these variants, you will need to specify one of the Chinese model files that ship with the StanfordPOSTaggerWrapper full version.