Skip to content

CERNDocumentServer/cds-videos-transfer

Repository files navigation

CDS-Videos Transferring Service

Introduction

This project aims to get old video files from CDS and programatically extract their metadata, convert the metadata to a new format and upload both the video and its metadata to CDS-Videos.

It was implemented as a local webserver with three main pages. The first page lets you convert the record's metadata from MARCXML to JSON and, once you already have the JSON metadata, the second page lets you both upload the old record to the new plataform or simply download the JSON converted metadata to your system. The third and last page will show you the progress of your files being uploaded and warn you if something went wrong with any of the uploads.

Installation

The project is a simple flask application with only a few Pypy dependencies and a repo dependency.

Create a Python environment, with pyenv or conda for example, and run:

(your_environment)$ git clone https://github.com/Luizerko/cds-video-transfer
(your_environment)$ cd cds-video-transfer
(your_environment)$ pip install -r requirements.txt

Now do the cds-dojson installation by running:

(your_environment)$ cd ..
(your_environment)$ git clone https://github.com/CERNDocumentServer/cds-dojson
(your_environment)$ cd cds-dojson
(your_environment)$ pip install -e .[tests]
(your_environment)$ cd ../cds-video-transfer
(your_environment)$ pip install -e ../cds-dojson

For the last part, change the default configuration of cds-dojson to make sure it works with flask. do that by changing line 108 of cds-dojson/cds_dojson/overdo.py from if HAS_FLASK: to if not HAS_FLASK:. Now you should have a working environment.

Adding Dependencies:

For managing the dependencies, the pip-tools package was used. If you want to add new dependencies to the repo, edit requirements.in file and then run:

(your_environment)$ pip install pip-tools
(your_environment)$ pip-compile requirements.in

The requirements.txt file will be automatically updated. This dependency manager was used because it takes care of all the version conflicts generated by the sub-dependencies.

Note:

If you want to migrate your videos, you need authorization from the CDS team, which means you need an access token to interact with the plataform programatically. Please, get your access token and save it right outside the cds-video-transfer folder as access_token.

Also, since there were too many problems with the tags, legacy information, inconsistency and redundancy for example, the project is still in experimental phase. This means that videos still need to me migrated to CDS-Videos as soon as all the decisions about tags have been taken and CDS-Dojson has been properly updated. This also means that you need a local instance of CDS-Videos running to test the migration process - or change the code appropriately to test it on sand-box/production.

Usage

Start by activating your Python environment, creating the database if you don't have one and then running the project locally with:

(your_environment)$ python3 init_db.py
(your_environment)$ python3 video_extractor.py

When you have the webserver running locally, open your browser and got to localhost:5555. You'll find a plug and play website ready to transfer old video records from CDS to CDS-Videos.

Single Record: Just put your record ID on the 'Record' section and press submit to convert its MARCXML metadata to JSON metadata.

Multiple Records or Queries: Indicate the record IDs, first_number,second_number,third_number for three records for example, or search for a query like 'physics'. If your query fetches more than 10 videos, they will be migrated in chunks of 10 records.

All Records: Migrate all the records from the Digital Memory Project on chunks of 10 records.

After you're done migrating all the records you want, remember to generate a file to update CDS records properly, marking migrated records as migrated_2023 using tag 980__b:

(your_environment)$ python3 updating_cds.py

Additional Content:

Inside moving_images_data folder, one can also find some preprocessed files:

  • update_cds -> MARCXML generated code to update CDS records.
  • migration_database.db -> Database that stores migration state for each processed record.
  • moving_images_<number>.xml -> Processed MARCXMLs with records from a query to CDS. It is numbered because of pagination of requests, since the maximum number of fetched records from a query is 200.
  • <recid>.xml -> Individual processed MARCXML from a specific record.
  • missing_tags -> All tags that were found in record`s MARCXMLs for a query, but were not processed.
  • missing_tags_examples -> All the missing tags for each individual queried record. This file is primarly used to find examples of the missing tags.
  • missing_tags_values -> All the unique values for each individual missing tag of the queried records.
  • moving_images_fails -> All the fails and their errors when generating the JSON file for each individual queried record.
  • moving_images_json -> All the generated JSONs for each individual queried record. -persistent_data -> Folder with similar moving_images_<number>.xml, missing_tags, missing_tags_examples, missing_tags_values, moving_images_fails and moving_images_json but for all the Digital Memory Project and non-migrated files.

About

Service to transfer video records from CDS to CDS Videos

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published