Data management #3385
-
I am designing a data management plan for a new lab using QCoDeS. I'd like to hear the opinions of other users before committing too much. As I understand, QCoDeS stores all gathered data during a run in a SQLite database file. I need to find a way to aggregate all this data over a large number of runs. But I'm not sure how I'm meant to do this, or if QCoDeS is even designed for this kind of use. I considered re-using the same database file for all our experimentation. This is not a workable solution for several reasons. In the first place, it puts the entire database at the mercy of whoever is currently using the bench machine. We're talking, hopefully, thousands of experiments over several months. Even with regular automatic backups this is a recipe for trouble. Furthermore, it's not scalable to multiple test setups. I briefly entertained the idea of storing the database on a shared filesystem but this is just silly both from a data integrity and performance perspective. Any reuse-the-database solution also raises concerns about data hygiene. The fact that QCoDeS has a concept of the "default experiment" which aggregates all data not otherwise assigned to an experiment, raises concerns for me because I definitely don't want to allow a carelessly written test to barf all over the last guy's data set, making it impossible to determine which experiments come from where. Perhaps I have misunderstood this concept. So at this point we're talking about some kind of system where each time we change the physical test setup, we start fresh with an empty database. This means I need some kind of secondary "layer 2" database to complement the SQLite "layer 1" database. The easiest way to do this (and least likely to accidentally lose data due to my mistakes) is simply to upload the whole SQLite file as a blob to a data server (what database to choose is still unclear; it probably doesn't matter much) with some attached metadata about the test setup provided by the experimenter (so it can be easily retrieved). However, this means there's no way to do queries across all experiments: you have to retrieve each blob from the database and do analysis on it locally. This may work for a while, however, I'm concerned about scalability and the ability to find the experiment I'm looking for when the database gets large. The final idea I have had is to somehow get data "out" of the SQLite database and render it into a more native format. At first I thought this would be relatively easy because SQLite is a SQL database and it should be easy to export tables into Postgres or whatever. But the QCoDeS database format does things like storing the names of tables inside other tables and generating names from autoincrement keys - this makes it difficult to merge two databases that may already have tables with the same names. I could probably write a script to do it, but I'd be concerned about accidentally missing something and mangling the database irrecoverably. On the other hand, I could simply design my own database for Layer 2 and export the data directly from the bench machine - and this is what I would do if QCoDeS did not have a database at all. But this requires taking a lot of care in order to make sure all the relevant information is included. How can I minimize the chance that I forget to upload some critical piece of data? And how should I design that database to allow all the flexibility that QCoDeS can support? Basically I'm hoping for feedback on any or all of these ideas, and ideally some stories from people who have used QCoDeS and developed long term data storage plans to go with it. How did you solve these problems? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hello @trentj What I understood from your explanation is that you need a data warehouse to upload data from local sqlite database files generated by qcodes. If this is the case, it sounds like a big project and how to implement it is something that has not a definite answer. But, qcodes doesn't have any infrastructure to do that and it should be built by taking qcodes generated database files, and what identities (tables) are generated on those files. Regarding "default_experiment", there is no other choice than having an experiment to be default to perform qcodes measurements and generate datasets. Because qcodes doesn't care about if the run is a real run or test run or uncompleted run. Experiment is a bucket to gather all those runs under a same flag of exp_name and sample_name. Of course qcodes didn't have aggregation of data in mind and only cares about performing measurement on a local computer. |
Beta Was this translation helpful? Give feedback.
-
Hi @trentj I think what you ask is an important question that I have heard ask many times before, but unfortunately I have not seen any public good solutions. I agree with @FarBo that a full scale data warehouse solution is a lot of work, that also depends on everybody in the lab follows the same definitions and only uses QCoDeS. I have tried to make a solution inspired by the method used by most photo editing software, where the photos are stored in a file system, and then you have a library database storing the information about where the individual datafiles are stored. The packed is called QDataLib and you can read more about it here: https://qdev-dk.github.io/QDataLib/index.html QDataLib is still in it's first iteration and have not been properly battle tested , so for now it is best suited for inspiration. |
Beta Was this translation helpful? Give feedback.
Hi @trentj
I think what you ask is an important question that I have heard ask many times before, but unfortunately I have not seen any public good solutions. I agree with @FarBo that a full scale data warehouse solution is a lot of work, that also depends on everybody in the lab follows the same definitions and only uses QCoDeS. I have tried to make a solution inspired by the method used by most photo editing software, where the photos are stored in a file system, and then you have a library database storing the information about where the individual datafiles are stored.
The packed is called QDataLib and you can read more about it here: https://qdev-dk.github.io/QDataLib/index.html
and …