Skip to content

CompNet/TranspoloSearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TranspoloSearch v2

Web-based information extraction for political science

  • Copyright 2015-18 Vincent Labatut

TranspoloSearch is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. For source availability and license information see licence.txt


Description

This software takes the name of a public person and a period, and retrieve all events available online involving this person during this period. It first perform a web search using various engines, then retrieves the corresponding Web pages, performs NER (named entity recognition), uses these entities to cluster the articles, and considers each cluster as the description of a specific event. It is designed to handle Web pages in French, but should work also for English. It has been used in references [MLE'15] and [ML'17].

If you use this software, please cite reference [MLE'15]:

@InProceedings{Marrel2015,
  author           = {Marrel, Guillaume and Labatut, Vincent and El Bèze, Marc},
  title            = {Le {Web} comme miroir du travail politique quotidien~? Reconstituer l'écho médiatique en ligne des événements d'un agenda d'élu},
  booktitle        = {13ème Congrès de l'Association Française de Science Politique},
  year             = {2015},
  pages            = {25},
  address          = {Aix-en-Provence, FR},
  url              = {[hal-01904338](http://www.congres-afsp.fr/st/st7/st7marrellabatutelbeze.pdf)},
}

Organization

The source code takes the form of an Eclipse project. It is organized as follows:

  • Package data contains all the classes used to represent data: articles, entities, etc.
  • Pacakge evaluation contains classes used to measure the performance of the retrieval tool
  • Package processing contains classes related to named entity recognition (NER).
  • Package retrieval contains classes used to get the web pages.
  • Package search contains classes used to perform the web search.
  • Package tools: various classes used throughout the software.

The rest of the files are resources:

  • Folder lib contains the external libraries, especially the NER-related ones (cf. the Dependencies section).
  • Folder log contains the log generated during the processing.
  • Folder out contains the articles and the files generated during the process.
  • Folder res contains the XML schemas (XSD files), as well as the configuration files required by certain NER tools.

Installation

First, get the last version of the project. Second, you need to download some additional files to get the required data.

Most of the data files are too large to be compatible with GitHub constraints. For this reason, they are hosted on FigShare. Before using Nerwip, you need to retrieve these archives and unzip them in the Eclipse project.

  1. Go to our FigShare page.
  2. You need the data related to the different NER tools (models, dictionaries, etc.), and you can ignore the corpus files (used for another project).
  • Download all 4 Zip files containing the NER data,
  • Extract the res folder,
  • Put it in the Eclipse project, in place of the existing res folder. Do not remove the existing folder, just overwrite it (we need the existing folders and files).

Finally, some of the NER tools integrated in Nerwip require some key or password to work. This is the case of:

  • Subee: our Wikipedia/Freebase-based NER tool requires a Freebase key to work correctly.
  • OpenCalais: this NER tool takes the form of a Web service. All keys are set up in the dedicated XML file keys.xml, which is located in res/misc.

Use

For now, there is not interface, not even a command-line one. All the processes need to be launched programmatically, as illustrated by class fr.univavignon.transpolosearch.Test. I advise to import the project in Eclipse and directly edit the source code in this class. A more appropriate interface will be added once the software is more stable. The output folder is out.

Dependencies

Here are the dependencies for TranspoloSearch:

Todo

References

  • [ML'17] V. Labatut & G. Marrel. La visibilité politique en ligne : Contribution à la mesure de l’e-reputation politique d’un maire urbain, Big Data et visibilité en ligne - Un enjeu pluridisciplinaire de l’économie numérique, 32p, 2017. ⟨hal-01904352⟩
  • [MLE'15] G. Marrel, V. Labatut & M. El Bèze. Le Web comme miroir du travail politique quotidien ? : Reconstituer l'écho médiatique en ligne des événements d'un agenda d'élu, 13ème Congrès de l'Association Française de Science Politique (AFSP), 25p, 2015. ⟨hal-01904338⟩

About

Web-based information extraction for political science

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages