Skip to content

Automatic annotation of XML encoded files.

Notifications You must be signed in to change notification settings

Juliettejns/Annotator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SEG17

This script process segmentation, normalization and lemmatization of XML-TEI encoded files.

Getting starded

To install SEG17, using command lines, you have to :

  • clone or download this repository
git clone git@github.com:e-ditiones/Annotator.git
cd SEG17

Segmentation

  1. create a first virtual environment and activate it
python3 -m venv env
source env/bin/activate
  1. install dependencies
pip install -r requirements.txt
  1. if you want to split your text
python3 scripts/segment_text.py path/to/file
  1. You will get filename_segmented.xml.

Lemmatisation

  1. The virtual env to be used is env.

  2. install lemmatisation models

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr
  1. if you want to lemmatize your segmented file
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 scripts/lemmatize.py path/to/file_segmented.xml
  1. In output/data.csv, you will find the results of the lemmatisation.

Normalisation LSTM

  1. First, you have to deactivate the previous virtual env, using :
deactivate
  1. create a second virtual environment and activate it
python3 -m venv norm_lstm
source norm_lstm/bin/activate
  1. install dependencies
pip install -r NORM17-LSTM/requirements.txt
  1. download the model
cd NORM17-LSTM
bash download_model.sh
  1. if you want to normalize your segmented file
python3 ../scripts/normalize_lstm.py ../path/to/file_segmented
  1. The file output/data.csv will be updated and contain the result of the normalisation.

NER

  1. First, you have to deactivate the previous virtual env, using
deactivate
  1. Then, activate the first virutal env
source env/bin/activate
  1. install model

Download https://sharedocs.huma-num.fr/wl/?id=hNkFbpu7qU4uQsvRaPWM3mm8SEK5CypU&fmode=download and uncompress it on the data folder.

Download https://sharedocs.huma-num.fr/wl/?id=Kq2woXBVoUv8BIyEQrIP0L0dv6XysWO3&fmode=download and uncompress it on the logs folder.

cd presto-tagger
bash prepare.sh
  1. if you want to do NER on your file:
python3 ../scripts/ner.py ../output/data.csv
  1. The file output/data.csv will be updated and contain the result of the ner.

Get an XML file

  1. In the first environment

Using the created csv file, csv_to_xml.py will constitute an XML file. 2. Get the annotated XML file

python3 scripts/csv_to_xml.py path/to/file_segmented
  1. You will get file_annotated.xml.

How it works

The segmentation

Using the Level-2_to_level-3.xsl XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>). For each <p>(paragraph) and <l>(line), using some poncuation marks (.;:!?), the script level2to3.py split the text in segments captured in <seg> elements.

The lemmazition

For lemmatisation, we use Pie-extended and the "fr" model.

The original version, and not the normalised version, is lemmatised.

Credits À CHANGER

This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.

Licences

Licence Creative Commons
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.

Pie-extended is under the Mozilla Public License 2.0.

Morphalou is under the LGPL-LR.

Cite this repository À CHANGER

Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.

About

Automatic annotation of XML encoded files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 81.0%
  • Perl 10.5%
  • Shell 5.4%
  • XSLT 3.1%