You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To avoid using necromancy, this new thread picks up where #79 ended.
Possible designs
Original Design: List of (key, value) pairs with values homogenized as strings
Non-viable: List of (key, value) pairs with value stored as a union
Per-File Struct: A single struct whose fields are strongly typed
The Alternate Union: List of (key, value) pairs with value stored in one of n nullable strongly typed lanes
Seperate Type Lists: One list of (key, value) pairs per value type
To evaluate this I used some examples from mzML spectra as I did not have a high density example set of peptide ID
CV parameters handy. The produced Parquet files are found here: examples.zip
The associated Parquet files for all of these cases are attached.
Discussion
Sizes:
1 161.1 KB
3 102.4 KB
4 111.8 KB
5 111.8 KB
Design #1 is the largest file (161.1 KB), and at first glance should also be the most expensive one to query at scale since
it requires parsing the value.
Design #3 is the smallest file (102.4 KB), but doesn't permit us to provide units or CURIEs. It also will
throw schema errors when a parameter is missing, instead of returning NULL. It's not possible to repeat
a parameter a variable number of times either, but that use-case should be pretty rare.
Designs #4 and #5 are comparable in size, weighing in at ~111.8 KB each.
Query efficiency
Supposing you know exactly what you want, and the type it will be, then the cost per row to find it will
be
1 O(N)
3 O(1)
4 O(N)
5 O(N_type)
where N is the # of parameters and N_type is the # of parameters of that type, so expected to be ≤ N.
This means that #5 is slightly more efficient than #4.
If the stored type isn't known but you want to use it as a specific type, then things get more complicated.
SELECT COALESCE(base_peak.int_value::double, base_peak.float_value) AS base_peak FROM
(
SELECT list_filter(cv_values, x ->x.name=='base peak m/z')[1] AS base_peak FROM'mixed.parquet'
);
SELECT COALESCE(
list_filter(cv_ints, x ->x.name=='base peak m/z')[1].value::double,
list_filter(cv_floats, x ->x.name=='base peak m/z')[1].value
) AS base_peak
AS base_peak FROM'props.parquet';
Trade-off discussion
Using #3's per-file struct schema is an appealing maximum-efficiency approach, but it means one cannot reasonably assume
that a consistent schema within the quantms.io spec is used for all files in a collection. This may effectively prevent
query engines from accepting them, but this will be an implementation detail of the engine. My tests suggested that DuckDB
would reject this without special configuration (passing union_by_name = true to the read_parquet function) but still
requires consistent types amongst shared names.
Using #4 or #5 achieves a compromise in disk space and type-safety. #5's separate list representation benefits by consuming less
memory during the marshalling process (uses 17% less memory than #4). However, #5 is more complicated to use, especially when the
caller does not know which type the creator had listed something as. This would essentially lead to robust readers needing
to add a fallback "search all the lists" code path just in case a parameter is mis-represented as another type instead of
immediately assuming it is absent when not found in the expected type list.
As optimality is both an issue of computing/resource efficiency and ease-of-use, I think #4 strikes the best balance. It
maintains the original desired flexibility it is easier to resolve the ambiguous cases, while not sacrificing too much in terms
of resource efficiency.
The text was updated successfully, but these errors were encountered:
How to represent controlled vocabulary terms
To avoid using necromancy, this new thread picks up where #79 ended.
Possible designs
To evaluate this I used some examples from mzML spectra as I did not have a high density example set of peptide ID
CV parameters handy. The produced Parquet files are found here: examples.zip
Design #1
This makes the value a homogenous type, but the value needs to be parsed from a string whenever it needs to be used.
Design #3
No general-purpose code is viable here. It depends upon the source data.
This is the most compact format.
Design #4
Design #5
The associated Parquet files for all of these cases are attached.
Discussion
Sizes:
Design #1 is the largest file (161.1 KB), and at first glance should also be the most expensive one to query at scale since
it requires parsing the value.
Design #3 is the smallest file (102.4 KB), but doesn't permit us to provide units or CURIEs. It also will
throw schema errors when a parameter is missing, instead of returning
NULL
. It's not possible to repeata parameter a variable number of times either, but that use-case should be pretty rare.
Designs #4 and #5 are comparable in size, weighing in at ~111.8 KB each.
Query efficiency
Supposing you know exactly what you want, and the type it will be, then the cost per row to find it will
be
O(N)
O(1)
O(N)
O(N_type)
where
N
is the # of parameters andN_type
is the # of parameters of that type, so expected to be ≤N
.This means that #5 is slightly more efficient than #4.
If the stored type isn't known but you want to use it as a specific type, then things get more complicated.
For #4:
For #5:
Trade-off discussion
Using #3's per-file struct schema is an appealing maximum-efficiency approach, but it means one cannot reasonably assume
that a consistent schema within the quantms.io spec is used for all files in a collection. This may effectively prevent
query engines from accepting them, but this will be an implementation detail of the engine. My tests suggested that DuckDB
would reject this without special configuration (passing
union_by_name = true
to theread_parquet
function) but stillrequires consistent types amongst shared names.
Using #4 or #5 achieves a compromise in disk space and type-safety. #5's separate list representation benefits by consuming less
memory during the marshalling process (uses 17% less memory than #4). However, #5 is more complicated to use, especially when the
caller does not know which type the creator had listed something as. This would essentially lead to robust readers needing
to add a fallback "search all the lists" code path just in case a parameter is mis-represented as another type instead of
immediately assuming it is absent when not found in the expected type list.
As optimality is both an issue of computing/resource efficiency and ease-of-use, I think #4 strikes the best balance. It
maintains the original desired flexibility it is easier to resolve the ambiguous cases, while not sacrificing too much in terms
of resource efficiency.
The text was updated successfully, but these errors were encountered: