-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Shedding Light on the Web of Documents
This is the main page for the DBpedia Spotlight documentation. Some shortcuts are presented below:
If you are a user, you may want to check:
- Web Application: we prepared a HTML+Javascript interface to show you how DBpedia Spotlight works: http://spotlight.dbpedia.org/demo/
- Web Service: RESTful interface that takes text as input and returns annotated text as output.
- Would you like to install it in your server? See: Installation
If you are a developer:
- See some tips on how to join the community
- Read the guide lines of how to contribute
- How to set up your dev environment with IntelliJ or Eclipse
- More Installation instructions.
- Want to use the tool with your own data or help with internationalization?
If you are a researcher:
- Publications
- Known uses
- Related work
- If you use this work on your research, please see the citation note below.
If you are a Google Summer of Code student:
DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned (e.g. "Michael Jordan"), and subsequently matches these names to unique identifiers (e.g. dbpedia:Michael_I._Jordan, the machine learning professor or dbpedia:Michael_Jordan the basketball player). It can also be used for building your solution for Named Entity Recognition, Keyphrase Extraction, Tagging, etc. amongst other information extraction tasks.
Text annotation has the potential of enhancing a wide range of applications, including search, faceted browsing and navigation. By connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.
Take a look at our [Known Uses] (http://dbpedia.org/spotlight/knownuses) page for other examples of how DBpedia Spotlight can be used. If you use DBpedia Spotlight in your project, please add a link to http://spotlight.dbpedia.org. If you use it in a paper, please use the citation available in the end of this page.
You can try out DBpedia Spotlight through our Web Application or Web Service endpoints. The Web Application is a user interface that allows you to enter text in a form and generates an HTML annotated version of the text with links to DBpedia. The Web Service endpoints provide programmatic access to the demo, allowing you to retrieve data also in XML or JSON.
Our latest implementation is based on statistical methods and is available in a number of languages. Data collection can be performed on a Hadoop cluster using our version of PigNLProc. More details on the indexing process of this implementation can be found here and a fully automated indexing tool can be found here.
There are still several open issues with this implementation, see the open issues listed in our Issue tracker.
Q: Can the memory footprint be reduced? A: The memory footprint of this implementation is mainly due to context words, there are three ways to reduce it: 1. use disk-based context instead of memory-based context lookup (see Issue #187) 2. do not consider context (en_small.tar.gz) 3. Prune context data (see Issue #167).
Q: I want to pass a parameter to show more or fewer entities depending on their score. A: See Issue #188
You can also use Spotlight out of the box on a Linux machine by following this guide.
For the memory requirements of the models, see our paper. As the English model is fairly big, en_small.tar.gz
is a low-memory alternative for the English model that does not consider context words and hence will provide lower accuracy.
If you use this work in your research, please cite:
Joachim Daiber, Max Jakob, Chris Hokamp, Pablo N. Mendes Improving Efficiency and Accuracy in Multilingual Entity Extraction. Proceedings of the 9th International Conference on Semantic Systems (I-Semantics). Graz, Austria, 4–6 September 2013.
@inproceedings{isem2013daiber,
title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
year = {2013},
booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}
The original DBpedia Spotlight implementation uses Apache Lucene for disambiguation and LingPipe for spotting. Pre-built indexes and spotter models are available for English.
DBpedia Spotlight looks for ~3.5M things of ~320 types in text and tries to disambiguate them to their global unique identifiers in DBpedia. It uses the entire Wikipedia in order to learn how to annotate DBpedia Resources, the entire dataset cannot be distributed alongside the code, and can be downloaded in varied sizes from the download page. A tiny dataset is included in the distribution for demonstration purposes only. After you've downloaded the files, you need to modify the configuration in server.properties with the correct path to the files. More info here.
If you use this work on your research, please cite:
Pablo N. Mendes, Max Jakob, Andrés García-Silva and Christian Bizer. DBpedia Spotlight: Shedding Light on the Web of Documents. Proceedings of the 7th International Conference on Semantic Systems (I-Semantics). Graz, Austria, 7–9 September 2011.
@inproceedings{isem2011mendesetal,
title = {DBpedia Spotlight: Shedding Light on the Web of Documents},
author = {Pablo N. Mendes and Max Jakob and Andres Garcia-Silva and Christian Bizer},
year = {2011},
booktitle = {Proceedings of the 7th International Conference on Semantic Systems (I-Semantics)},
abstract = {Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.}
}
The corpus used to evaluate DBpedia Spotlight in this work is described here.
The best way to get help with DBpedia Spotlight is to send a message to our mailing list at dbp-spotlight-users@lists.sourceforge.net.
You can also join the #dbpedia-spotlight IRC channel on Freenode. We also listen to Tweets.
We'd love if you gave us some feedback.
The DBpedia Spotlight team includes the names cited below. Individual contributions are acknowledged in the source code and publications.
- Pablo Mendes (Freie Universität Berlin), Jun 2010-present.
- Max Jakob (Freie Universität Berlin), Jun 2010-Sep 2011, Apr 2012-present.
- Jo Daiber (Charles University in Prague), Mar 2011-present.
- Prof. Dr. Chris Bizer (Freie Universität Berlin), supervisor, Jun 2010-present.
- Andrés García-Silva (Universidad Politécnica de Madrid), Jul-Dec 2010.
- Rohana Rajapakse (Goss Interactive Ltd.), Oct-2011, Jun-Jul2012.
- Iavor Jelev, May-Jun 2012.
Google Summer of Code Students
- Chris Hokamp (University of North Texas)
- Hector
- Dirk Weissenborn
- Jo Daiber
This work has been partially funded by:
- Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications. (Jun 2010-Jun 2011)
- The European Commission through the projects:
- LOD2 - Creating Knowledge out of Linked Data. (Jun 2010-Oct 2012)
- IKS - Interactive Knowledge Stack via the Early Adopters Program. (Jun 2011)
- Dicode (July 2011-present)
- Google Summer of Code 2012, with 4 students. (May 2012-Sep 2012)