-
Notifications
You must be signed in to change notification settings - Fork 2
Domain Corpus Index
The Domain Corpus Index contains a subset of the DBpedia entities that are indexed in the Corpus Index. The Domain Corpus Index can be exploited in order to classify documents related to a specific domain (cinema, literature, tourism, ecc.).
In this section I will explain how to configure the TellMeFirst Index Builder in order to create an index useful for your target domain.
The Domain Corpus Index has the same structure of the Corpus Index. Each Lucene Document is composed by the following fields:
- URI: the DBpedia entity URI;
- URI COUNT: number of times that an entity appears as a [wikilink] (http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking) within the Wikipedia corpus;
- TITLE: title of the Wikipedia page identified by a URI;
- TYPE: ontology class of the entity (each entity can have more than one TYPE);
- IMAGE: image url retrieved from the property foaf:depiction of the DBpedia entity. For instance, the foaf:depiction value of Giacomo Leopardi is http://upload.wikimedia.org/wikipedia/commons/c/c6/Leopardi,Giacomo%281798-1837%29_-_ritr._A_Ferrazzi,_Recanati,_casa_Leopardi.jpg;
- CONTEXT: paragraph of Wikpedia that contains the specific entity in the form of wikilink.
Obviously, the number of entities indexed in the Domain Corpus Index will be lower than the number of entities of the Corpus Index, because they cover a specific domain. In order to generate this index, you should start following the instructions available in the Download DBpedia and Wikipedia datasets section.
In this documentation I will describe the components (configuration properties and scripts) that you should define in order to create a domain index for classifying your documents.
The main part of the code is contained in the TMFDomainEngine class, properly driven by domain SPARQL queries and domain REST services, for creating a new URIs list file.
A suitable method to identify the DBpedia entities that belong to a specific domain is performing ad hoc SPARQL queries. These queries exploit the ontology classes used in DBpedia (e.g., YAGO and UMBEL) to describe the features of an entity.
You need to specify the SPARQL queries in a JSON configuration file for each language in the following directories:
/dbpedia-spotlight/conf/external/domain.en.json
/dbpedia-spotlight/conf/external/domain.en.json
You can change the location of these JSONs in the properties files available here:
/dbpedia-spotlight/conf/indexing.tmf.domain.en.properties
/dbpedia-spotlight/conf/indexing.tmf.domain.it.properties
The property that you need to update is:
tellmefirst.domain.conf
The JSON file has the following structure (the SPARQL queries available below cover the tourism domain):
[{
"description": "Get all places from DBpedia",
"endpoint": "http://dbpedia.org/sparql",
"query": "SELECT DISTINCT ?entity WHERE { ?entity a <http://dbpedia.org/ontology/Place>} LIMIT 10000",
"domain": "POIs",
"language": "en",
"baseuri": "http://dbpedia.org/resource/"
}, {
"description": "Get all spatial things from DBpedia",
"endpoint": "http://dbpedia.org/sparql",
"query": "SELECT DISTINCT ?entity WHERE { ?entity a <http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing>} LIMIT 10000",
"domain": "POIs",
"language": "en",
"baseuri": "http://dbpedia.org/resource/"
}]
The JSON file to configure the SPARQL queries is composed by different fields:
- the description of results obtained through the query;
- the endpoint to perform the query;
- the actual query;
- the domain covered by the query;
- the language of resources obtained through the query;
- the base uri of resources obtained through the query.
In the repository you have an example of this JSON file to cover the tourism domain in English and in Italian.
The output of each query is available in the form of URIs list in the directory:
/data/tellmefirst/dbpedia/en/output/domain
Finally, the TMFDomainEngine will merge results of SPARQL queries.
SPARQL queries represent a fundamental tool to cut your graph and identify entities that belong to a specific domain. Nevertheless, SPARQL queries have limits related to the ontologies used in the knowledge base: it is not always possible to identify entities that may potentially be included in the chosen domain.
For these reasons, a good strategy is to identify services and mechanisms that wrap complex SPARQL queries, neighbor algorithms based on different metrics, ecc., in order to integrate the results obtained through the SPARQL queries setted up by the user.
A first implementation of the domain service exploits the Linked Data Recommender (LDR) developed by the SoftEng group of the Politecnico di Torino. More information on the LDRs are available in the paper entitled "A systematic literature review of Linked Data-based recommender systems".
The LDR currently exposes 2 REST services:
- Get all DBpedia categories from a DBpedia entity.
- Get DBpedia entities related to a specific DBpedia entity and a DBpedia category.
Exploiting these two services on DBpedia resources, retrieved with SPARQL queries defined in the previous step, you are able to get new entities to enrich your Domain Corpus Index.
In order to integrate a domain service in the TellMeFirst Index Builder, you can implement the following actions:
- Configure the /dbpedia-spotlight/conf/indexing.tmf.domain.en.properties (or the Italian version) properly, with the following parameters:
tellmefirst.domain.LDRPath = ../data/tellmefirst/dbpedia/en/output/domain/domainURIsfromLDR.list
tellmefirst.domain.LDRService = http://localhost:8080/LDRecommenderWeb/rest/service/recommendations
The first parameter defines the output list of URIs retrieved through the domain service, while the second parameter defines the endpoint of the domain service.
- Create a Java class that implements this interface:
package org.dbpedia.spotlight;
import java.util.List;
/**
* Interface for a basic request to a domain service
*
*/
public interface DomainServiceClient {
public List getDomainEntities(String[] parameters) throws Exception;
}
The getDomainEntities method must returns a list of entities (java.util.List) that can be printed in a file (in our case domainURIsfromLDR.list). The TMFDomainEngine component will merge this list of entities with URIs of the resources retrieved through the SPARQL queries mentioned before.
-
Then, you should write some code in order to implement the effective request to the domain service (in the case of the LDR service developed by SoftEng you have to combine the results of two different requests).
-
Finally, you have to integrate requests to the domain services in the TMFDomainEngine to merge results. In the future, the integration in the TMFDomainEngine component will be made more configurable, in order to easily include new services.
In order to create the Domain Corpus Index you have to launch the following scripts (respectively for the English and for the Italian indexes):
index.tmf.domain.en.sh
index.tmf.domain.it.sh
In particular, a new command has been integrated in the TellMeFirst Index Builder pipeline:
mvn compile
mvn exec:java -e -Dexec.mainClass="org.dbpedia.spotlight.lucene.index.external.domain.TMFDomainEngine" -Dexec.args=$INDEX_CONFIG_FILE
This command launches the TMFDomainEngine that integrates results of SPARQL queries and domain services in order to create a new URIs list for creating a Domain Corpus Index.