Python interface for NEEMhub and OpenEASE.
- tutorial / shell script for installing dependencies on debian based systems
- Which steps are necessary for Automatic Data Download?
- Which steps are necessary for Automatic Data Upload?
- How to do https://dvc.org/doc/api-reference
- Look at: https://gitpython.readthedocs.io/en/stable/
- Define Interface for dataset creator!
- Add File Meta-Data to Ontology / Tripple store
- Get access to MongoDB
- Get a hidden test dataset in OpenEASE
- Which backend for build-hook execution?
- query to openease -> events-ids + time-invervals
- get trial-ids (top-level-neem) for events-ids
- filter files by data-properties and meta-data
- download needed files
- create dataset: cut dataset snippets
- Install hadoop: https://linuxconfig.org/ubuntu-20-04-hadoop
- make sure you added the exports for the hadoop path to your local users .bashrc (otherwise only the hadoop user can pull neems mit dvc)
- pip3 install dvc[hdfs]
- git clone https://neemgit.informatik.uni-bremen.de/neems/ease-2020-pr2-setting-up-table
- got to new folder and dvc pull
Synchronization encapsulates git and dvc operations. It either clones the remote repository or
if remote not exists:
-> FAIL
if local exists:
if is git-repo:
fetch remote data from neemgit
else:
clone remote data
if conflicts: FAIL
else: merge
if local changes:
if not dvc repos:
dvc init
dvc add all
dvc push all
git commit -am
git push
else:
clone neemgit
- Create query to OpenEASE + Specify Datasets
- Transform query into a file requesting query
- For each dataset: Submit query to to OpenEASE
- For each dataset in reply: checkout dvc files
- For each file-uri in query pull file from NEEM-Hub
- Transform query result to proper format (possible csv or pd.DataFrame)
- Check which files have been changed till last filepull
- For new files is there a newer version online ? -> If yes, cancel upload -> problem needs to be solved manually
- Commit changed data into local HDFS
- Create new dvc-files
- push upoload to remote NeemHub HDFS
- push into NeemGit
Table: filename
- The associated research field seems to be ontology alignment / semantic integration
- The term ontology mapping also exists is however not clearly defined, but seems to fit better in this context, since we are not mapping onto-to-onto but (annotation-scema)-to-onto
- there are three kinds of properties in an ELAN file: fully mapped properties, ignored properties (not mapped) and undefined properties
- with converting ELAN to a DataFrame, we have a relational representation
- most promising candidates are for the moment:
- rdflib.csv2rdf → not documented and therefore obscure
- rdfpandas → not powerfull enough
- pyTARQL, where TARQL is SPARQL for Table