Skip to content

Domain Corpus Index

Giuseppe Futia edited this page Dec 31, 2015 · 35 revisions

The Domain Corpus Index contains a subset of the DBpedia entities that are indexed in the Corpus Index. The Domain Corpus Index can be exploited in order to classify documents related to a specific domain (cinema, literature, tourism, ecc.).

In this section I will explain how to configure the TellMeFirst Index Builder in order to create an index useful for your target domain.

Same fields of the Corpus Index

The Domain Corpus Index has the same structure of the Corpus Index. Each Lucene Document is composed by the following fields:

Obviously, the number of entities indexed in the Domain Corpus Index will be lower than the number of entities of the Corpus Index, because they cover a specific domain. In order to generate this index, you should start following the instructions available in the Download DBpedia and Wikipedia datasets section.

In this documentation I will describe the components (configuration properties and scripts) that you should define in order to create a domain index for classifying your documents.

The main part of the code is contained in the TMFDomainEngine class, properly driven by domain SPARQL queries and domain REST services, for creating a new URIs list file.

SPARQL queries configuration

A suitable method to identify the DBpedia entities that belong to a specific domain is performing ad hoc SPARQL queries. These queries exploit the ontology classes used in DBpedia (e.g., YAGO and UMBEL) to describe the features of an entity.

You need to specify the SPARQL queries in a JSON configuration file for each language in the following directories:

/dbpedia-spotlight/conf/external/domain.en.json
/dbpedia-spotlight/conf/external/domain.en.json

You can change the location of these JSONs in the properties files available here:

/dbpedia-spotlight/conf/indexing.tmf.domain.en.properties
/dbpedia-spotlight/conf/indexing.tmf.domain.it.properties

The property that you need to update is:

tellmefirst.domain.conf

The JSON file has the following structure (the SPARQL queries available below cover the tourism domain):

[{
	"description": "Get all places from DBpedia",
	"endpoint": "http://dbpedia.org/sparql",
	"query": "SELECT DISTINCT ?entity WHERE { ?entity a <http://dbpedia.org/ontology/Place>} LIMIT 10000",
	"domain": "POIs",
	"language": "en",
	"baseuri": "http://dbpedia.org/resource/"
}, {
	"description": "Get all spatial things from DBpedia",
	"endpoint": "http://dbpedia.org/sparql",
	"query": "SELECT DISTINCT ?entity WHERE { ?entity a <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing>} LIMIT 10000",
	"domain": "POIs",
	"language": "en",
	"baseuri": "http://dbpedia.org/resource/"
}]

The JSON file to configure the SPARQL queries is composed by different fields:

  • the description of results obtained through the query;
  • the endpoint to perform the query;
  • the actual query;
  • the domain covered by the query;
  • the language of resources obtained through the query;
  • the base uri of resources obtained through the query.

In the repository you have an example of this JSON file to cover the tourism domain in English and in Italian.

The output of each query is available in the form of URIs list in the directory:

/data/tellmefirst/dbpedia/en/output/domain

Finally, the TMFDomainEngine will merge results of SPARQL queries.

Domain Services configuration

SPARQL queries represent a fundamental tool to cut your graph and identify entities that belong to a specific domain. Nevertheless, SPARQL queries have limits related to the ontologies used in the knowledge base: it is not always possible to identify entities that may potentially be included in the chosen domain.

For these reasons, a good strategy is to identify services and mechanisms that wrap complex SPARQL queries, neighbor algorithms based on different metrics, ecc., in order to integrate the results obtained through the SPARQL queries setted up by the user.

A first implementation of the domain service exploits the Linked Data Recommender (LDR) developed by the SoftEng group of the Politecnico di Torino. More information on the LDRs are available in the paper entitled "A systematic literature review of Linked Data-based recommender systems".

The LDR currently exposes 2 REST services:

  • Get all DBpedia categories from a DBpedia entity.
  • Get DBpedia entities related to a specific DBpedia entity and a DBpedia category.

Exploiting these two services on DBpedia resources, retrieved with SPARQL queries defined in the previous step, you are able to get new entities to enrich your Domain Corpus Index.

In order to integrate a domain service in the TellMeFirst Index Builder, you can implement the following actions:

  • Configure the /dbpedia-spotlight/conf/indexing.tmf.domain.en.properties (or the Italian version) properly, with the following parameters:
tellmefirst.domain.LDRPath = ../data/tellmefirst/dbpedia/en/output/domain/domainURIsfromLDR.list
tellmefirst.domain.LDRService = http://localhost:8080/LDRecommenderWeb/rest/service/recommendations

The first parameter defines the output list of URIs retrieved through the domain service, while the second parameter defines the endpoint of the domain service.

  • Create a Java class that implements this interface:
package org.dbpedia.spotlight;

import java.util.List;

/**
 * Interface for a basic request to a domain service
 *
 */
public interface DomainServiceClient {

    public List getDomainEntities(String[] parameters) throws Exception;
}

The getDomainEntities method must returns a list of entities (java.util.List) that can be printed in a file (in our case domainURIsfromLDR.list). The TMFDomainEngine component will merge this list of entities with URIs of the resources retrieved through the SPARQL queries mentioned before.

  • Then, you should write some code in order to implement the effective request to the domain service (in the case of the LDR service developed by SoftEng you have to combine the results of two different requests).

  • Finally, you have to integrate requests to the domain services in the TMFDomainEngine to merge results. In the future, the integration in the TMFDomainEngine component will be made more configurable, in order to easily include new services.

Data processing

In order to create the Domain Corpus Index you have to launch the following scripts (respectively for the English and for the Italian indexes):

index.tmf.domain.en.sh 
index.tmf.domain.it.sh 

In particular, a new command has been integrated in the TellMeFirst Index Builder pipeline:

mvn compile
mvn exec:java -e -Dexec.mainClass="org.dbpedia.spotlight.lucene.index.external.domain.TMFDomainEngine" -Dexec.args=$INDEX_CONFIG_FILE

This command launches the TMFDomainEngine that integrates results of SPARQL queries and domain services in order to create a new URIs list for creating a Domain Corpus Index.