Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Database Migration #129

Open
lkiesow opened this issue May 20, 2017 · 3 comments
Open

Automatic Database Migration #129

lkiesow opened this issue May 20, 2017 · 3 comments
Assignees
Milestone

Comments

@lkiesow
Copy link
Member

lkiesow commented May 20, 2017

With 2.2 we will change the database structure. That means that we need to re-create the database. That, in turn, means that we will lose data if we do no do anything about it.

This is definitely no problem for the status tables or the buffer, but it might be a problem for the finished recordings. Although we do not lose any real data since the recordings are still there and the metadata are in Opencast. Still, it would be convenient to keep that table.

Hence, the question what we want to do about it. Options are:

  1. No migration. Just delete the database.
  2. Manual migration (SQL upgrade file)
  3. Add DB versioning and add a complete automatic migration
  4. Add DB versioning, but only export relevant data, create new database and re-import the data

I would be ok with 1., I dislike 2., I think 3. is a lot of work and not worth the effort and finally, I like 4.

My idea about 4. would be to add a database version table, just containing a number. If the internal pyCA database version if different, we would export the recorded_event table as JSON, re-create the database and finally import the JSON again. This should be easier than a complete migration and we should be able to even import old data since we only need to interpret what information we have about a given event. If at some point one field becomes irrelevant, we can simply drop it. If there is a new one, we set it to null or generate a sensible default.

Packages could also make use of this method on upgrades by calling the migration script.

Any thoughts?

@lkiesow lkiesow added this to the 2.2 milestone May 20, 2017
@JanKoppe
Copy link
Contributor

I dislike 1., IMO this is the equivalent of a "just reboot windows"-bugfix.
I would be totally okay with 2., as sqlite is installed on those machines anyway, which makes running an upgrade file a breeze.
3: We could probably use alembic for that.
and finally 4: You probably have much more experience with that than me, but wouldn't such a migration need to be able to read out the old database properly, serialize it, keep all records in memory, remove the old database, create a new one and write all entries back? This seems to be a real pitfall. What happens if pyCA crashes while all entries are only kept in memory? Instead, create a new database with a different name? Do we have file permissions? Isnt that also more error prone code?

Right now I would favor 2 > 3 > 4 > 1. But this is just after 5 minutes of quick thinking. :)

@lkiesow
Copy link
Member Author

lkiesow commented May 21, 2017

The problem with the update scripts is that you cannot easily automate the updates during a package upgrade. That basically means that automatic upgrades may break a running pyCA. For once, you do not know which database is used since you can configure mysql, postgres, … to be used as well. Also, you either need upgrade scripts for each version or need to build a lot of safeguards into the script which might lead to weird SQL magic.

Although at the moment, the script would be relatively simple:

ALTER TABLE recorded_event ADD title TEXT;
DROP TABLE upcoming_event;
DROP TABLE service_states;

Alembic would work, but it would create some unnecessary overhead since we do not need to migrate anything but recorded_event. We would need to keep track of all changes and iirc this would add quite a bit of migration code. It has been a while since I had a look at that one, though, and I will take another look later.

The idea with 4 was to export the relevant data from recorded_event and normalize it while doing so. When updating we would then re-import the data into the table. Thinking about it, we could even write the data to the recording directory. That would make it possible to upgrade at any time without the need to deliberately start the process. In short:

First, add a version table to detect required migrations. Maybe, add a field indicating if we really need a data migration, or if dropping the tables for non-permanent data would suffice.

Second, write an info.json to the recording dir on recording success and possible with certain status updates. It should contain the data from the recorded_event table in normalized form like this:

{
  "uid": "123",
  "start": 1,
  "end": 2,
  "tracks": {
    "presenter/source": "xy.mp4"
  }
}

If, on start-up, a migration is required, drop all tables, re-initialize the database and import the data from the JSON files. If, for example, a new field title was added, which is not present in the JSON, set a sensible default.

This is certainly slower than an SQL alter statement but should be no problem with the amount of data we are dealing with.

It has the advantage that, if we modify the database, we only need to bump the database version and, only if we modify the data of recorded events, set a sensible default for the import.

The nice thing about this is, that it even works for data migration (say we want to extract data from the field data which holds all the attachments).

Please let me know if you think this is an insane idea…

@JanKoppe
Copy link
Contributor

JanKoppe commented May 21, 2017

Ah, yes, did not notice that getting the version of the old database might be a problem. I've only ever packaged for Arch Linux, where it's possible to start actions depending on the old package versions while upgrading. But that can be different to the actual database as well with some weird edge cases.

Your suggestion to always write out a JSON file on recording success sounds sensible. A nice side effect of this is that recordings are (once again) self-contained in their directories, making it possible to move them around (and maybe backup?) if someone really wants to do that.

One point though is the condition when this file would be saved to disk - I would even go as far as saving it when starting a recording. This would ensure that even failed recordings are registered on a database migration. Otherwise you maybe would have a lot of dead files when upgrading a database that still had a lot of failed/unfinished recordings in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants