This script process segmentation, normalization and lemmatization of XML-TEI encoded files.
- clone or download this repository
git clone git@github.com:e-ditiones/Annotator.git
cd SEG17
- create a first virtual environment and activate it
python3 -m venv env
source env/bin/activate
- install dependencies
pip install -r requirements.txt
- if you want to split your text
python3 scripts/segment_text.py path/to/file
- You will get
filename_segmented.xml
.
-
The virtual env to be used is
env
. -
install lemmatisation models
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr
- if you want to lemmatize your segmented file
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 scripts/lemmatize.py path/to/file_segmented.xml
- In
output/data.csv
, you will find the results of the lemmatisation.
- First, you have to deactivate the previous virtual env, using :
deactivate
- create a second virtual environment and activate it
python3 -m venv norm_lstm
source norm_lstm/bin/activate
- install dependencies
pip install -r NORM17-LSTM/requirements.txt
- download the model
cd NORM17-LSTM
bash download_model.sh
- if you want to normalize your segmented file
python3 ../scripts/normalize_lstm.py ../path/to/file_segmented
- The file
output/data.csv
will be updated and contain the result of the normalisation.
- First, you have to deactivate the previous virtual env, using
deactivate
- Then, activate the first virutal env
source env/bin/activate
- install model
Download https://sharedocs.huma-num.fr/wl/?id=hNkFbpu7qU4uQsvRaPWM3mm8SEK5CypU&fmode=download and uncompress it on the data folder.
Download https://sharedocs.huma-num.fr/wl/?id=Kq2woXBVoUv8BIyEQrIP0L0dv6XysWO3&fmode=download and uncompress it on the logs folder.
cd presto-tagger
bash prepare.sh
- if you want to do NER on your file:
python3 ../scripts/ner.py ../output/data.csv
- The file
output/data.csv
will be updated and contain the result of the ner.
- In the first environment
Using the created csv file, csv_to_xml.py
will constitute an XML file.
2. Get the annotated XML file
python3 scripts/csv_to_xml.py path/to/file_segmented
- You will get
file_annotated.xml
.
Using the Level-2_to_level-3.xsl
XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>
).
For each <p>
(paragraph) and <l>
(line), using some poncuation marks (.;:!?), the script level2to3.py
split the text in segments captured in <seg>
elements.
For lemmatisation, we use Pie-extended and the "fr" model.
The original version, and not the normalised version, is lemmatised.
This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.
Pie-extended is under the Mozilla Public License 2.0.
Morphalou is under the LGPL-LR.
Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.