Skip to content
Marina D'Amato edited this page Dec 7, 2018 · 6 revisions

The Lucene Documents of the Corpus Index are composed of the following Lucene Fields:

In this section we describe how to build the Corpus Index.

Download DBpedia and Wikipedia datasets

The first step is to download the following files:

  • DBpedia Labels: used to have a complete set of all DBpedia URIs.
  • DBpedia Redirects: used to identify the original URI of a DBpedia entity, excluding the Wikipedia redirection pages.
  • DBpedia Disambiguation: used to generate bad URIs.
  • DBpedia Ontology Types: used to create type-specific datasets useful for the enrichment stage of TellMeFirst.
  • DBpedia Interlanguage (required for the Italian Version): used in the enrichment stage. In order to retrieve new information from other Linked Data repositories, you need to use English URIs.

In addition to these datasets you need to download the Wikipedia pages dump. The paragraphs of Wikipedia pages are stored in the CONTEXT field of a Lucene Document and are used in the classification process. For building TellMeFirst indexes we choose the following versions of the Wikipedia dump.

  • enwiki-20130805-pages-articles-multistream.xml.bz2 (English version);
  • itwiki-20131205-pages-articles-multistream.xml.bz2 (Italian version).

The second step is to copy the downloaded files (the DBpedia datasets and the Wikipedia dumps) in the following directories, according to the specific language:

/data/tellmefirst/dbpedia/en/original
/data/tellmefirst/dbpedia/it/original

If you want to change the original path of these input datasets, you can modify the following configuration files:

/conf/indexing.tmf.en.properties
/conf/indexing.tmf.it.properties

With this configuration files you can also set the path of intermediate outputs (necessary to produce the final Lucene index) and other configuration parameters, such as the type of Analyzer to use in the indexing phase.

Scripts for building the Corpus Index

Due to a bug in the current version of Dbpedia Spotlight, if you want to build the English version of TellMeFirst indexes you have to make little changes to the code. In particular, you should uncomment the lines starting from 127 in

core/src/main/scala/org/dbpedia/spotlight/io/FileOccurrenceSource.scala

and the line 69 in

core/src/main/scala/org/dbpedia/spotlight/model/SpotlightConfiguration

Once these changes have been made, in order to build the Corpus Index you have to run the following scripts:

/bin/index.tmf.en.sh
/bin/index.tmf.it.sh

Each script contains a set of Maven commands for progressively building the Corpus Index of TellMeFirst. We recommend to run one command at a time, making sure that each job is successfully completed. However, with the development work carried out by the Nexa Center and using the datasets specified in this guide, the process should not present hitches during the intermediate stages.

Below is a brief description of Maven commands available in these script:

mvn exec:java -e -Dexec.mainClass="org.dbpedia.spotlight.lucene.index.external.utils.TMFRedirectCleaner" -Dexec.args=$INDEX_CONFIG_FILE

The English dataset of redirection pages has the following bug:

Furthermore, the 2016 English version of redirection pages dataset has a similar bug:

This bidirectional redirection implies a "loop" phase that prevents the extractor to identify the root URI. TMFRedirectCleaner class removes these entries from redirection dataset to avoid this loop (required for the English version).

mvn exec:java -e -Dexec.mainClass="org.dbpedia.spotlight.lucene.index.external.utils.TMFWikiDumpCleaner" -Dexec.args=$INDEX_CONFIG_FILE

The DBpedia Extraction Framework used in the 0.6 version of DBpedia Spotlight is not able to parse some Wikipedia namespaces, that identify different types of page created by the MediaWiki software. Methods implemented in TMFWikiDumpCleaner class correct this issue.

mvn scala:run -Dlauncher=ExtractCandidateMap "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE"

Combining the labels, the redirection pages, and the disambiguation pages datasets, ExtractCandidateMap class extracts the entities to be stored in the Lucene Index.

mvn exec:java -e -Dexec.mainClass="org.dbpedia.spotlight.lucene.index.external.utils.TMFUriDecoder" -Dexec.args=$INDEX_CONFIG_FILE

The extracted URIs from DBpedia datasets are represented using the percent-encoding, while the dump of Wikipedia contains URIs already encoded. For these reasons, it is necessary to encode URIs extracted from DBpedia, so that URIs of Wikipedia dump can be correctly mapped (required for the English version).

mvn scala:run -Dlauncher=ExtractOccsFromWikipedia "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE|$DBPEDIA_WORKSPACE/output/occs.tsv"

Mapping the DBpedia entities with the Wikipedia dump entities, you can extract the Wikipedia paragraphs associated with the DBpedia entity (to be saved in the Lucene Fields CONTEXT).

mvn scala:run -Dlauncher=IndexMergedOccurrences "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE|$DBPEDIA_WORKSPACE/output/occs.uriSorted.tsv"

Building of the intermediate Corpus Index, starting from the occurrences previously created, which contains the following Lucene Fields: URI, URI COUNT and CONTEXT.

mvn scala:run -Dlauncher=AddTypesToIndex "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE|$DBPEDIA_WORKSPACE/output/index"

Add the Lucene Field TYPE to the Corpus Index.

mvn scala:run -Dlauncher=AddTitlesToIndex "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE|$DBPEDIA_WORKSPACE/output/index-withTypes"

Add the Lucene Field TITLE to the Corpus Index.

mvn scala:run -Dlauncher=AddImagesToIndex "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE|$DBPEDIA_WORKSPACE/output/index-withTypesTitles"

Add the Lucene Field IMAGE to the Corpus Index.

mvn scala:run -Dlauncher=AddSameAsToIndex "-DjavaOpts.Xmx=$JAVA_XMX" "-DaddArgs=$INDEX_CONFIG_FILE|$DBPEDIA_WORKSPACE/output/index-withTypesTitlesImages"

Add the Lucene Field SAMEAS to the Corpus Index (required for the Italian version).