Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce different dataloader categories #8

Open
motey opened this issue Apr 22, 2020 · 2 comments
Open

introduce different dataloader categories #8

motey opened this issue Apr 22, 2020 · 2 comments
Labels
Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Documentation About CovidGraph Documentation Tag: Help Wanted Extra attention is needed

Comments

@motey
Copy link
Member

motey commented Apr 22, 2020

To prevent messed up data and enable possible new features we need to categorize dataloaders

none idempotent dataloader

Dataloaders that only run once inital. these are for static data like gene databases

idempotent dataloader

Dataloaders that will evolve and data will probably change. Like publication data in the CORD19 dataset which iterates from time to time.

If a rerun is neccesary could be decide by changing docker hub hashed (changing dataloader image)

service dataloaders

Data that will change in any case regulary, like covid case statistics.
These dataloaders should run periodically

@motey motey added Tag: Documentation About CovidGraph Documentation Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed labels Apr 22, 2020
This was referenced Apr 22, 2020
@mpreusse
Copy link
Member

We should also consider that not all data loaders have a simple update logic. I.e. they have to perform complex oerations to define the updates.

Example: The loading script that generates :Fragment nodes with sentence from full text nodes (:BodyText, :PatentAbstract etc ). This has to rerun whenever we have new text. But the text fragments have no primary key except for the sentence itself. It would need to check every existing full text and check if all sentences exist (costly) or create the :Fragment nodes only for full text nodes that have no :Fragment nodes yet (error prone if the content of the full text node changes).

Btw gene databases are not static 😄

@motey
Copy link
Member Author

motey commented Apr 23, 2020

Btw gene databases are not static smile

I was allready afraid thats the case. but my brain just wouldnt come up with a good example at 1am :D

Example: The loading script that generates :Fragment nodes with sentence from full text nodes(:BodyText, :PatentAbstract etc ) [...]

imho the dataloader is the problem in this case :) What about a flag to fragged text, or a simple logic like "when textfraggments are on the node, no fragging is needed anymore"
Changing fulltext nodes should be rather rare (and if changes should be rather subtile)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Documentation About CovidGraph Documentation Tag: Help Wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants