This is a demonstration of the MIP Data Factory focusing on its workflow application, Airflow.
The demonstration runs inside a Vagrant Virtual machine and demonstrates ETL pipelines for medical data.
-
install Ansible version 2.2.0 or better. On Ubuntu you can use the script ./common/scripts/bootstrap.sh
-
install VirtualBox version 5.0 or better
-
install Vagrant version 1.8.5 or better
-
install vagrant plugin install vagrant-hostmanager
vagrant plugin install vagrant-hostmanager
-
start the virtual machine with Vagrant. You will need at least 5Gb of RAM available for the VM.
vagrant up
After upgrading the Linux kernel in your system you may encounter this message when running a Vagrant command:
The provider 'virtualbox' that was requested to back the machine
'airflow' is reporting that it isn't usable on this system. The
reason is shown below:
VirtualBox is complaining that the installation is incomplete. Please
run `VBoxManage --version` to see the error message which should contain
instructions on how to fix this error.
To fix it, you need to rebuild a module for Virtualbox using this command:
sudo apt-get install --reinstall virtualbox-dkms linux-headers-generic
The virtual machine should start and install Airflow.
You can see Airflow running at localhost:14080
Marathon can be accessed on localhost:15080
Example data is provided in /data/demo folder inside the VM, but you need Matlab installed in the virtual machine to execute the SPM 12 based preprocessing pipelines.
Deployment | Organisation | License | Management | Continuous integration |
---|---|---|---|---|
ansible-airflow |
Data Factory | Organisation | License | Planning | Continuous integration |
---|---|---|---|---|
data-tracking | ||||
mri-meta-db | ||||
mri-preprocessing-pipeline | ||||
airflow-imaging-plugins | ||||
data-factory-airflow-dags |
Ansible inventory controls what software is installed and how it is configured.
It is organised by hosts (servers) and groups.
Here, we have the following organisation:
- demo: the target host, running inside a Vagrant Virtual machine
- managed: a group containing demo, indicating that the server is managed by Ansible and should be applied a default configuration and a set of base sofware packages
- control: a group containing demo, indicating that this server is used to perform operations affecting the whole cluster (here we have a 'cluster' of one machine)
- zookeeper, mesos-mixed: groups that are used to define where and how the Mesos stack is deployed
- airflow: groups that are used to define which applications should be deployed by Marathon