Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPF metadata script #4

Merged
merged 1 commit into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,11 @@ mpirun -np 16 python3 -m src.main data/local campaign_shots/tiny_campaign.csv --

This will submit a job to the freia job queue that will ingest all of the shots in the tiny campaign and push them to the s3 bucket.

## CPF Metadata

To parse CPF metadata we can use the following script (only on Friea):

```sh
qsub ./jobs/freia_write_cpf.qsub campaign_shots/tiny_campaign.csv
```

11 changes: 4 additions & 7 deletions jobs/freia_write_cpf.qsub
Original file line number Diff line number Diff line change
@@ -1,20 +1,17 @@
#!/bin/bash

# Verify options and abort if there is a error
#$ -w e

# Choose parallel environment
#$ -pe mpi 16

# Specify the job name in the queue system
#$ -N fairmast-dataset-writer
#$ -N fairmast-cpf-writer

# Start the script in the current working directory
#$ -cwd

# Time requirements
#$ -l h_rt=120:00:00
#$ -l s_rt=120:00:00
#$ -l h_rt=48:00:00
#$ -l s_rt=48:00:00

# Activate your environment here!
module load python/3.9
Expand All @@ -28,4 +25,4 @@ shot_file=$1
export PATH="/home/rt2549/dev/:$PATH"

# Run script
python3 -m src.metadata.create_cpf_metadata $shot_file
python3 -m src.create_cpf_metadata $shot_file
5 changes: 5 additions & 0 deletions jobs/submit.cpf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
qsub jobs/freia_write_cpf.qsub campaign_shots/M9.csv
qsub jobs/freia_write_cpf.qsub campaign_shots/M8.csv
qsub jobs/freia_write_cpf.qsub campaign_shots/M7.csv
qsub jobs/freia_write_cpf.qsub campaign_shots/M6.csv
qsub jobs/freia_write_cpf.qsub campaign_shots/M5.csv
56 changes: 56 additions & 0 deletions src/create_cpf_metadata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import argparse
import numpy as np
import pandas as pd
import multiprocessing as mp
from functools import partial
from pathlib import Path
from rich.progress import track
from pycpf import pycpf


def read_cpf_for_shot(shot, columns):
cpf_data = {}
for name in columns:
entry = pycpf.query(name, f"shot = {shot}")
value = entry[name][0] if name in entry else np.nan
cpf_data[name] = value

cpf_data['shot_id'] = shot
return cpf_data

def main():
parser = argparse.ArgumentParser(
prog="FAIR MAST Ingestor",
description="Parse the MAST archive and writer to Zarr/NetCDF/HDF files",
)

parser.add_argument("shot_file")
args = parser.parse_args()

shot_file = args.shot_file
shot_ids = pd.read_csv(shot_file)
shot_ids = shot_ids['shot_id'].values

columns = pycpf.columns()
columns = pd.DataFrame(columns, columns=['name', 'description'])
columns.to_parquet(f'data/{Path(shot_file).stem}_cpf_columns.parquet')

pool = mp.Pool(16)
column_names = columns['name'].values
func = partial(read_cpf_for_shot, columns=column_names)
mapper = pool.imap_unordered(func, shot_ids)
rows = [item for item in track(mapper, total=len(shot_ids))]
cpf_data = pd.DataFrame(rows)

# Convert objects to strings
for column in cpf_data.columns:
dtype = cpf_data[column].dtype
if isinstance(dtype, object):
cpf_data[column] = cpf_data[column].astype(str)

cpf_data.to_parquet(f'data/{Path(shot_file).stem}_cpf_data.parquet')
print(cpf_data)


if __name__ == "__main__":
main()