Notion: https://elianna.notion.site/elianna/Automating-Metadata-a8acd0a54e05497dad0faf3ada5f1708
The goal of this project is to ignore the current metadata system and start over. We want to know
- What metadata is necessary for science?
- Resources
- elicit.org - This search engine is already scraping metadata through GPT-3 including customization of which data you the user wants. Sometimes it doesn’t return results.
- I’m not sure if they use GPT for the base metadata (authors, etc) or if they use Semantic Scholar’s API.
- https://www.researchobject.org/ro-crate/1.1/ -> a metadata standard.
- https://research.manchester.ac.uk/en/publications/ro-crate-metadata-specification-111 - the documentation for the project above.
- Schema.org
- FAIR data standards
- Data standards: (
- Data origin: experimental, observational, raw or derived, physical collections, models, images, etc.
- Data type: integer, Boolean, character, floating point, etc.
- Instrument(s) used
- Data acquisition details: sensor deployment methods, experimental design, sensor calibration methods, etc.
- File type: CSV, mat, xlsx, tiff, HDF, NetCDF, etc.
- Data processing methods, software used
- Data processing scripts or codes
- Dataset parameter list, including
- Variable names
- Description of each variable
- Units
- elicit.org - This search engine is already scraping metadata through GPT-3 including customization of which data you the user wants. Sometimes it doesn’t return results.
- Elaboration
- We should explore this at a granular level at the research object (instead of just research node). What is the metadata for code, pdfs, etc?
- We’re in the exploratory phase -> we want to see what data is in these papers. What can we get out of this?
- Error in GPT isn’t a super huge problem at the moment -
- We’re trying to find ways to make things better!
- Resources
- How can we eliminate the need for scientists to enter that metadata?
- Hypothesis: GPT!
- Other?
- What else?
What we have
- A section for metadata on the node platform.
- Build out/explore what the Nodes team has discovered through scraping PDFs
- later on down the road -> integrated into search. But shouldn't think of this as integrating into the nodes platform.
- Emerging Problem statements - (To be edited!)
- There is no understanding of what metadata is prevalent across all research. There is no standard for that.
- Connected to ontology.
- Longer term problem statement -> researchers don't have the knowledge processes/workflows to correctly enter metadata in a standardized format.
- metadata that is unstandardized and it is difficult to read and use.
- Difficult to seamlessly tie one piece of research to another (what are the relations between research) ununified scientific record.
- How might we unify the scientific record w metadata standards that can be accessed by different platforms/gateways so that they can use the research in any way they want.
- There is no understanding of what metadata is prevalent across all research. There is no standard for that.