A script to extract node and edge lists from the BELTRANS corpus Excel file. Deliberately not a command line script with parameters, because the users would like to use variables instead.
(1) Create and activate a Python virtual environment
# Create a new Python virtual environment
python3 -m venv py-gephi-extraction-env
# Activate the virtual environment
source py-gephi-extraction-env/bin/activate
# Install dependencies
pip -r requirements.txt
(2) Adapt the necessary parameters in the script.
inputFile
: the full path to the input Excel filegenrePrefixes
: A string used to filter Belgian Bibliography genre classifications. Examples are provided in the scriptminYear
andmaxYear
: values used to filter publications of the input dataconsiderImprintRelation
:True
orFalse
, indicating if indicated publishers (imprints) are used as nodes or if they should be replaced with their main publisher as indicated in theisImprintOf
columnimprintMappingExceptions
: identifiers of imprints that should not be replaced with their main publishernamesInsteadOfIdentifiers
: use the name of nodes as identifier (both in nodes and in edges), if duplicate names are found a warning is printed
Note
Please note that normally such things are down via command line parameters or a config file. However, users of this script preferred to change variables instead.
(3) Execute the script
python gephi-extraction.py
All dependencies are listed in the file requirements.txt
.
This script uses Pandas. More specifically the function read_excel
is used which requires the other dependencies.