Jupyter notebook for geoparsing historical encyclopedia texts in French using the PERDIDO Geoparser.
This notebook is proposed by L. Moncla (INSA Lyon) and K. McDonough (The Alan Turing Institute) as part of the GEODE project.
In this tutorial, we demonstrate how to use a custom version of the Perdido geoparser python library developed in the GEODE project. We will use texts from Diderot and d’Alembert’s Encyclopédie as a case study for querying a corpus and wrangling geoparsed data. We will also compare Perdido’s NER annotations (e.g. it's output) to the results of other well-known python NER libraries (spaCy and Stanza).
In this tutorial, we'll learn about a few different things.
- How to load data from TEI-XML files into a Python dataframe
- Use Python dataframe for simple data analysis
- Test the PERDIDO API for preprocessing French texts (part-of-speech tagging)
- Test the PERDIDO API for geoparsing (geotagging + geocoding) Encyclopedie articles
- Display custom geotagging results (PERDIDO TEI-XML) with the displaCy Named Entity Visualizer
- Display geocoding results on a map
You can open this notebook in an executable and remote environment with or
git clone https://github.com/GEODE-project/perdido-geoparsing-notebook.git
- Create a new environment called
tutorial-geoparsing-py39
conda create -n tutorial-geoparsing-py39 python=3.9
- Activate the environment
conda activate tutorial-geoparsing-py39
- Install
fiona
package with conda (avoid an issue withpip
)
conda install fiona==1.8.21
- Install dependencies with
pip
pip install -r requirements.txt
jupyter notebook
Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR).