Data management #3385

trentj · 2021-09-22T15:11:46Z

trentj
Sep 22, 2021

I am designing a data management plan for a new lab using QCoDeS. I'd like to hear the opinions of other users before committing too much.

As I understand, QCoDeS stores all gathered data during a run in a SQLite database file. I need to find a way to aggregate all this data over a large number of runs. But I'm not sure how I'm meant to do this, or if QCoDeS is even designed for this kind of use.

I considered re-using the same database file for all our experimentation. This is not a workable solution for several reasons. In the first place, it puts the entire database at the mercy of whoever is currently using the bench machine. We're talking, hopefully, thousands of experiments over several months. Even with regular automatic backups this is a recipe for trouble. Furthermore, it's not scalable to multiple test setups.

I briefly entertained the idea of storing the database on a shared filesystem but this is just silly both from a data integrity and performance perspective.

Any reuse-the-database solution also raises concerns about data hygiene. The fact that QCoDeS has a concept of the "default experiment" which aggregates all data not otherwise assigned to an experiment, raises concerns for me because I definitely don't want to allow a carelessly written test to barf all over the last guy's data set, making it impossible to determine which experiments come from where. Perhaps I have misunderstood this concept.

So at this point we're talking about some kind of system where each time we change the physical test setup, we start fresh with an empty database. This means I need some kind of secondary "layer 2" database to complement the SQLite "layer 1" database. The easiest way to do this (and least likely to accidentally lose data due to my mistakes) is simply to upload the whole SQLite file as a blob to a data server (what database to choose is still unclear; it probably doesn't matter much) with some attached metadata about the test setup provided by the experimenter (so it can be easily retrieved). However, this means there's no way to do queries across all experiments: you have to retrieve each blob from the database and do analysis on it locally. This may work for a while, however, I'm concerned about scalability and the ability to find the experiment I'm looking for when the database gets large.

The final idea I have had is to somehow get data "out" of the SQLite database and render it into a more native format. At first I thought this would be relatively easy because SQLite is a SQL database and it should be easy to export tables into Postgres or whatever. But the QCoDeS database format does things like storing the names of tables inside other tables and generating names from autoincrement keys - this makes it difficult to merge two databases that may already have tables with the same names. I could probably write a script to do it, but I'd be concerned about accidentally missing something and mangling the database irrecoverably.

On the other hand, I could simply design my own database for Layer 2 and export the data directly from the bench machine - and this is what I would do if QCoDeS did not have a database at all. But this requires taking a lot of care in order to make sure all the relevant information is included. How can I minimize the chance that I forget to upload some critical piece of data? And how should I design that database to allow all the flexibility that QCoDeS can support?

Basically I'm hoping for feedback on any or all of these ideas, and ideally some stories from people who have used QCoDeS and developed long term data storage plans to go with it. How did you solve these problems?

Answered by RasmusBC59

Sep 23, 2021

Hi @trentj

I think what you ask is an important question that I have heard ask many times before, but unfortunately I have not seen any public good solutions. I agree with @FarBo that a full scale data warehouse solution is a lot of work, that also depends on everybody in the lab follows the same definitions and only uses QCoDeS. I have tried to make a solution inspired by the method used by most photo editing software, where the photos are stored in a file system, and then you have a library database storing the information about where the individual datafiles are stored.

The packed is called QDataLib and you can read more about it here: https://qdev-dk.github.io/QDataLib/index.html
and …

View full answer

FarBo · 2021-09-23T08:48:51Z

FarBo
Sep 23, 2021

Hello @trentj

What I understood from your explanation is that you need a data warehouse to upload data from local sqlite database files generated by qcodes. If this is the case, it sounds like a big project and how to implement it is something that has not a definite answer. But, qcodes doesn't have any infrastructure to do that and it should be built by taking qcodes generated database files, and what identities (tables) are generated on those files.
What you probably asking for is to have another package that takes care of querying qcodes databases for what you want to be uploaded (exp_name, measurement_name, snapshots, raw data, etc.) and create a logic what identity is connected to what , upload to the server, and also download it based on those identities when needed. This is one approach, and sounds like a lot of work to do. Of course this only makes sense if you have a large scale lab across different sites that is not easy to share data between qcodes users. This is not a fun thing that somebody can do it in his/her free time.

Regarding "default_experiment", there is no other choice than having an experiment to be default to perform qcodes measurements and generate datasets. Because qcodes doesn't care about if the run is a real run or test run or uncompleted run. Experiment is a bucket to gather all those runs under a same flag of exp_name and sample_name. Of course qcodes didn't have aggregation of data in mind and only cares about performing measurement on a local computer.

1 reply

trentj Sep 23, 2021
Author

Indeed, this is a big task and my summary may have made it seem like I'm hoping for a ready-made solution. Well, of course a ready-made solution that solves all my problems would be ideal, but actually I'm quite ready to start with some disconnected ideas and build up from there. Thank you for your thoughts and for confirming that QCoDeS doesn't have a built in or officially supported way of exporting data en masse.

RasmusBC59 · 2021-09-23T09:38:26Z

RasmusBC59
Sep 23, 2021

Hi @trentj

I think what you ask is an important question that I have heard ask many times before, but unfortunately I have not seen any public good solutions. I agree with @FarBo that a full scale data warehouse solution is a lot of work, that also depends on everybody in the lab follows the same definitions and only uses QCoDeS. I have tried to make a solution inspired by the method used by most photo editing software, where the photos are stored in a file system, and then you have a library database storing the information about where the individual datafiles are stored.

The packed is called QDataLib and you can read more about it here: https://qdev-dk.github.io/QDataLib/index.html
and get the code from here: https://github.com/qdev-dk/QDataLib
or install it from pypi: pip install qdatalib

QDataLib is still in it's first iteration and have not been properly battle tested , so for now it is best suited for inspiration.
However, if you chose to go down the same path you are more than welcome to help improve QDataLib,
or if you get an better idea, I think we are many, who have the same issues and would like to help, so please share and keep it public :-)

1 reply

trentj Sep 23, 2021
Author

Thank you! I spent some time looking for projects like this but I did not come across QDataLib. I am looking into it now and it looks like that, or at least something like it, might be a part of the solution for us. Although I certainly still have my work cut out for me, this is the kind of thing I was hoping to find.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data management #3385

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data management #3385

trentj Sep 22, 2021

Replies: 2 comments · 2 replies

FarBo Sep 23, 2021

trentj Sep 23, 2021 Author

RasmusBC59 Sep 23, 2021

trentj Sep 23, 2021 Author

trentj
Sep 22, 2021

Replies: 2 comments 2 replies

FarBo
Sep 23, 2021

trentj Sep 23, 2021
Author

RasmusBC59
Sep 23, 2021

trentj Sep 23, 2021
Author