This project augments OpenAlex works with additional metadata prepared by CSET. The dataset is available at Zenodo.
For the complete list of metadata fields and their types, see schemas/metadata.json
. Below, we describe how
each field was collected in more detail:
Our article linkage pipeline generates language ID labels for titles and abstracts using PYCLD2. We only include language IDs where PYCLD2 successfully output a language and marked the output as reliable.
We share outputs for subject classifiers (for more information on how these classifiers were trained and deployed, see our documentation) in the following fields:
is_cv
- True if a computer vision classifier predicted the work was relevantis_nlp
- True if a natural language processing classifier predicted the work was relevantis_robotics
- True if a robotics classifier predicted the work was relevantis_ai
- True if an artificial intelligence classifier predicted the work was relevant, or if any of the computer vision, natural language processing, or robotics classifiers predicted the work was relevantis_cyber
- True if a cybersecurity classifier predicted the work was relevantis_ai_safety
- True if an AI safety classifier predicted the work was relevantis_chip_design_fabrication
- True if a chip design and fabrication classifier predicted the work was relevantis_llm
- True if a large language model classifier predicted the work was relevant
The dataset is updated monthly through the pipeline in cset_openalex_augmentation_dag.py
. This pipeline runs
the query in sql/metadata.sql
to aggregate CSET metadata associated with each OpenAlex work, backs the
results up within our internal data warehouse, and updates the data on Zenodo.
(For CSET staff) To update the artifacts used by this pipeline, run bash push_to_airflow.sh
.