The Materials Data Facility Connect service is the ETL flow to deeply index datasets into MDF Search. It is not intended to be run by end-users. To submit data to the MDF, visit the Materials Data Facility.
The MDF Connect service is a serverless REST service that is deployed on AWS. It consists of an AWS API Gateway that uses a lambda function to authenticate requests against GlobusAuth. If authorised, the endpoints trigger AWS lambda functions. Each endpoint is implemented as a lambda function contained in a python file in the aws/ directory. The lambda functions are deployed via GitHub actions as described in a later section.
The API Endpoints are:
- POST /submit: Submits a dataset to the MDF Connect service. This triggers a Globus Automate flow
- GET /status: Returns the status of a dataset submission
- POST /submissions: Forms a query and returns a list of submissions
The Globus Automate flow is a series of steps that are triggered by the POST /submit endpoint. The flow is defined using a python dsl that can be found in automate/minimus_mdf_flow.py. At a high level the flow:
- Notifies the admin that a dataset has been submitted
- Checks to see if the data files have been updated or if this is a metadata only submission
- If there is a dataset, it starts a globus transfer
- Once the transfer is complete it may trigger a curation step if the organization is configured to do so
- A DOI is minted if the organization is configured to do so
- The dataset is indexed in MDF Search
- The user is notified of the completion of the submission
Changes should be made in a feature branch based off of the dev branch. Create PR and get a friend to review your changes. Once the PR is approved, merge it into the dev branch. The dev branch is automatically deployed to the dev environment. Once the changes have been tested in the dev environment, create a PR from dev to main. Once the PR is approved, merge it into main. The main branch is automatically deployed to the prod environment.
The MDF Connect service is deployed on AWS into development and production environments. The automate flow is deployed into the Globus Automate service via a second GitHub action.
Changes to the automate flow are deployed via a GitHub action, triggered by the push of a new GitHub release. If the release is tagged as "pre-release" it will be deployed to the dev environment, otherwise it will be deployed to the prod environment.
The flow IDs for dev and prod are stored in
automate/mdf_dev_flow_info.json and
automate/mdf_prod_flow_info.json
respectively. The flow ID is stored in the flow_id
key.
- Merge your changes into the
dev
branch - On the GitHub website, click on the Release link on the repo home page.
- Click on the Draft a new release button
- Fill in the tag version as
X.Y.Z-alpha.1
where X.Y.Z is the version number. You can use subsequent alpha tags if you need to make further changes. - Fill in the release title and description
- Select
dev
as the Target branch - Check the Set as a pre-release checkbox
- Click the Publish release button
- Merge your changes into the
main
branch - On the GitHub website, click on the Release link on the repo home page.
- Click on the Draft a new release button
- Fill in the tag version as
X.Y.Z
where X.Y.Z is the version number. - Fill in the release title and description
- Select
main
as the Target branch - Check the Set as the latest release checkbox
- Click the Publish release button
You can verify deployment of the flows in the Globus Automate Console.
The MDF Connect service is deployed via a GitHub action. The action is triggered by a push to the dev or main branch. The action will deploy the service to the dev or prod environment respectively.
Schemas and the MDF organization database are managed in the automate branch of the Data Schemas Repo.
The schema is deployed into the docker images used to serve up the lambda functions.
To run the tests first make sure that you are running python 3.7.10. Then install the dependencies:
$ cd aws/tests
$ pip3 install -r requirements-test.txt
Now you can run the tests using the command:
$ PYTHONPATH=.. python -m pytest --ignore schemas
This work was performed under financial assistance award 70NANB14H012 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Material Design (CHiMaD). This work was performed under the following financial assistance award 70NANB19H005 from U.S. Department of Commerce, National Institute of Standards and Technology as part of the Center for Hierarchical Materials Design (CHiMaD). This work was also supported by the National Science Foundation as part of the Midwest Big Data Hub under NSF Award Number: 1636950 "BD Spokes: SPOKE: MIDWEST: Collaborative: Integrative Materials Design (IMaD): Leverage, Innovate, and Disseminate".