Skip to content
Jannik Strötgen edited this page Sep 20, 2016 · 3 revisions

Table of contents

#Frequently asked questions

This page contains frequently asked questions as well as notes about how HeidelTime functions.

I can't process Arabic / Vietnamese with the standalone version. Why?

The command to run the HeidelTime standalone version for Arabic and Vietnamese differs from the command for the other languages. It is thus not enough to specify the language with the "-l" parameter. Instead, the Stanford POS Tagger / JVnTextPro has to be installed. The respective paths have to be specified in the config.props file. Then, a classpath variable has to be set and finally specified in the command to run HeidelTime:

Under Unix/Linux/Mac OS X:

  • "export HT_CP="<$1>:<$2>:<$3>:$CLASSPATH"" or under Windows:
  • "set HT_CP=<$1>;<$2>;<$3>;%CLASSPATH%" where
  • <$1> is the path to JVnTextPro’s bin folder, e.g., /opt/jvntextpro/bin/,
  • <$2> is the path to StanfordPOSTagger’s .jar file, e.g., /opt/stanfordpostagger/stanford-postagger.jar and
  • <$3> is de.unihd.dbs.heideltime.standalone.jar

Then, the following command has to be used:

  • "java -cp $HT_CP de.unihd.dbs.heideltime.standalone.HeidelTimeStandalone [options]"

This is also explained in the Manual.pdf, Section 4.1 (Command Line Usage), Extra steps for Arabic and Vietnamese tagging

Switching the domain to "colloquial" does not improve the results when processing colloquial texts compared to using "news" as document type. What am I doing wrong?

To process English colloquial text such as short messages or tweets, the domain (document type, "-t" parameter) has to be set to "colloquial". In addition, the language has to be set to "englishcoll" instead of "english". This guarantees that all the colloquial synoyms for English temporal terms are used for temporal tagging, e.g., "tmr" for "tomorrow" etc.

HeidelTime is too slow for my application. Anything I can do?

Instead of proper preprocessing, the AllLanguagesTokenizer can be used to create sentence and token annotations (without part-of-speech information). This will increase the processing speed, but will probably result in worse temporal tagging results as all part-of-speech constraints will be validated as "false". In the standalone version, the pos parameter has to be set to "no". In the UIMA version, the Analysis Engine AllLanguagesTokenizer has to be used instead of, e.g., the TreeTagger wrapper.

Chinese processing under Windows

Due to improper codepage support in the Windows command prompt, our standard processing method (TreeTagger) for documents in Chinese will not work under Windows with the HeidelTime Kit or the Standalone version.

As an alternative, you can set up the StanfordPOSTaggerWrapper as a preprocessing engine for both the Standalone as well as the Kit version. The way to do that for the Standalone version is described in section 4 under the "Extra steps" headline.

For a kit pipeline, you simply need to tie in the StanfordPOSTaggerWrapper Analysis Engine supplied with the kit ahead of the HeidelTime AE.

For both of these variants, you will need to specify one of the Chinese model files that ship with the StanfordPOSTaggerWrapper full version.