OBVILCorpusImporter

This project is intended to ease the mass import of the OBVIL Library into the OBVIL OAI-PMH repository.

What is this script doing

Once launched with the proper command, (for instance
python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json ) this will crawls the specified¹ OBVIL Corpora available in the OBVIL Library.

It will:

saves XML/TEI version of the texts in the specified directory (I.e. "crawled_data");
extracts the relevant header meta-data to be exposed in the OAI-PMH repository (eg. dc:creator, dc:relation, dc:rights, dc:format, dc:identifier, dc:title, dc:contributor...)
creates a thumbnail ("vignette") for each document. All the thumbnails have been generated once and are stored here. In case some are missing, you may consider scp them directly with your admin privileges.
builds one Omeka csv import file per specified project with all the necessary information in the specified directory (I.e. "crawled_data");.

Tl;dr:

python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json
All you need is in the folder crawled_data.

What it does not do (i.e. DIY)

To successfully import the documents into the OAI-PMH repository, you will need to:

Run this script with the right options and configuration.
Put the generated vignettes on the right place on the server if they are missing.
Manually import the generated CSV file into Omeka, with proper rights and mappings.

Disclamer

Should you run this spiders, you are going to scrap A LOT of data. Use at your own risk !
The text provided by the OBVIL are copyrighted.

1 To specify which corpora should be imported, you will need to custom a configuration file. See the "configs" directory of this repo. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
configs		configs
obvilcorpusimporter		obvilcorpusimporter
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
illustration.jpg		illustration.jpg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OBVILCorpusImporter

What is this script doing

Tl;dr:

What it does not do (i.e. DIY)

Disclamer

About

Releases

Packages

Languages

License

OBVIL/OBVILCorpusImporter

Folders and files

Latest commit

History

Repository files navigation

OBVILCorpusImporter

What is this script doing

Tl;dr:

What it does not do (i.e. DIY)

Disclamer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages