After logging into your CSD3 account (on Icelake node), first load the correct Python module:
module load python/3.9.12/gcc/pdcqf4o5
Clone the repository and fetch data files (Git LFS must be installed):
git clone git@github.com:ukaea/fair-mast-ingestion.git
cd fair-mast-ingestion
git lfs pull
Create a virtual environment:
python -m venv fair-mast-ingestion
source fair-mast-ingestion/bin/activate
Update pip and install required packages:
python -m pip install -U pip
python -m pip install -e .
The final step to installation is to have mastcodes:
git clone git@git.ccfe.ac.uk:MAST-U/mastcodes.git
cd mastcodes
Edit uda/python/setup.py
and change the "version" to 1.3.9.
python -m pip install uda/python
cd ..
source ~/rds/rds-ukaea-ap002-mOlK9qn0PlQ/fairmast/uda-ssl.sh
Finally, for uploading to S3 we need to install s5cmd
and make sure it is on the path:
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xvzf s5cmd_2.2.2_Linux-64bit.tar.gz
PATH=$PWD:$PATH
And add a config file for the bucket keys, by creating a file called .s5cfg.stfc
:
[default]
aws_access_key_id=<access-key>
aws_secret_access_key=<secret-key>
You should now be able to run the following commands.
This will ingest data into the test folder in S3. The small_ingest script allows you to put one file of shots into the ingestion.
- First submit a job to collect all the metadata:
sbatch ./jobs/metadata.csd3.slurm.sh
- Then submit an ingestion job
Argument 1 (e.g. s3://mast/test/shots/) is where the data will ingest to, and argument 2 is the file of shots to ingest (e.g. campaign_shots/tiny_campaign.csv), arguments 3 and greater are the sources (e.g. amc)
sbatch ./jobs/small_ingest.csd3.slurm.sh s3://mast/test/shots/ campaign_shots/tiny_campaign.csv amc
This ingestion job runs through all shots for the specified source (e.g. amc)
sbatch ./jobs/small_ingest.csd3.slurm.sh s3://mast/test/shots/ amc
The following section details how to ingest data into a local folder on freia with UDA.
- Parse the metadata for all signals and sources for a list of shots with the following command
mpirun -n 16 python3 -m src.create_uda_metadata data/uda campaign_shots/tiny_campaign.csv
mpirun -np 16 python3 -m src.main data/local campaign_shots/tiny_campaign.csv --metadata_dir data/uda --source_names amc xsx --file_format nc
Files will be output in the NetCDF format to data/local
.
The following section details how to ingest data into the s3 storage on freia with UDA.
- Parse the metadata for all signals and sources for a list of shots with the following command
mpirun -n 16 python3 -m src.create_uda_metadata data/uda campaign_shots/tiny_campaign.csv
This will create the metadata for the tiny campaign. You may do the same for full campaigns such as M9
.
- Run the ingestion pipleline by submitting the following job:
mpirun -np 16 python3 -m src.main data/local campaign_shots/tiny_campaign.csv --bucket_path s3://mast/test/shots --source_names amc xsx --file_format zarr --upload --force
This will submit a job to the freia job queue that will ingest all of the shots in the tiny campaign and push them to the s3 bucket.
To parse CPF metadata we can use the following script (only on Friea):
qsub ./jobs/freia_write_cpf.qsub campaign_shots/tiny_campaign.csv