Skip to content

A python scrapy spider intended to retrieve .xml and .epubs from OBVIL corpora

License

Notifications You must be signed in to change notification settings

OBVIL/OBVILCorpusImporter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OBVILCorpusImporter

OBVILCorpusImporter

This project is intended to ease the mass import of the OBVIL Library into the OBVIL OAI-PMH repository.

What is this script doing

Once launched with the proper command, (for instance
python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json ) this will crawls the specified1 OBVIL Corpora available in the OBVIL Library.

It will:

  • saves XML/TEI version of the texts in the specified directory (I.e. "crawled_data");
  • extracts the relevant header meta-data to be exposed in the OAI-PMH repository (eg. dc:creator, dc:relation, dc:rights, dc:format, dc:identifier, dc:title, dc:contributor...)
  • creates a thumbnail ("vignette") for each document. All the thumbnails have been generated once and are stored here. In case some are missing, you may consider scp them directly with your admin privileges.
  • builds one Omeka csv import file per specified project with all the necessary information in the specified directory (I.e. "crawled_data");.
Tl;dr:
  • python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json
  • All you need is in the folder crawled_data.

What it does not do (i.e. DIY)

To successfully import the documents into the OAI-PMH repository, you will need to:

  • Run this script with the right options and configuration.
  • Put the generated vignettes on the right place on the server if they are missing.
  • Manually import the generated CSV file into Omeka, with proper rights and mappings.

Disclamer

  • Should you run this spiders, you are going to scrap A LOT of data. Use at your own risk !

  • The text provided by the OBVIL are copyrighted.

1 To specify which corpora should be imported, you will need to custom a configuration file. See the "configs" directory of this repo.

About

A python scrapy spider intended to retrieve .xml and .epubs from OBVIL corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%