How to represent controlled vocabulary parameters, round 2 #85

mobiusklein · 2024-12-22T00:15:42Z

How to represent controlled vocabulary terms

To avoid using necromancy, this new thread picks up where #79 ended.

Possible designs

Original Design: List of (key, value) pairs with values homogenized as strings
Non-viable: List of (key, value) pairs with value stored as a union
Per-File Struct: A single struct whose fields are strongly typed
The Alternate Union: List of (key, value) pairs with value stored in one of n nullable strongly typed lanes
Seperate Type Lists: One list of (key, value) pairs per value type

To evaluate this I used some examples from mzML spectra as I did not have a high density example set of peptide ID
CV parameters handy. The produced Parquet files are found here: examples.zip

Design #1

cv_str_val = pa.struct([
    pa.field("name", pa.string(), True),
    pa.field("curie", pa.string(), True),
    pa.field("unit_name", pa.string(), True),
    pa.field("unit_curie", pa.string(), True),
    pa.field("value", pa.string(), True)
])

...

duckdb.sql(r"""
SELECT list_filter(cv_values, x -> x.name == 'base peak m/z')[1].value::double as base_peak from 'as_str.parquet';
""")

This makes the value a homogenous type, but the value needs to be parsed from a string whenever it needs to be used.

Design #3

No general-purpose code is viable here. It depends upon the source data.

duckdb.sql(r"""
SELECT "base peak m/z" FROM 'exact_props_struct.parquet';
""")

This is the most compact format.

Design #4

cv_any_val = pa.struct([
    pa.field("name", pa.string(), True),
    pa.field("curie", pa.string(), True),
    pa.field("unit_name", pa.string(), True),
    pa.field("unit_curie", pa.string(), True),
    pa.field("float_value", pa.float64(), True),
    pa.field("str_value", pa.string(), True),
    pa.field("int_value", pa.int64(), True),
])

...

duckdb.sql(r"""
SELECT list_filter(cv_values, x -> x.name == 'base peak m/z')[1].float_value as base_peak from 'mixed.parquet';
""")

Design #5

cv_ident = [
    pa.field("name", pa.string(), True),
    pa.field("curie", pa.string(), True),
    pa.field("unit_name", pa.string(), True),
    pa.field("unit_curie", pa.string(), True),
]

cv_str = pa.struct(cv_ident + [pa.field("value", pa.string(), True)])

cv_int = pa.struct(cv_ident + [pa.field("value", pa.int64(), True)])

cv_float = pa.struct(cv_ident + [pa.field("value", pa.float64(), True)])

...

duckdb.sql(r"""
SELECT list_filter(cv_floats, x -> x.name == 'base peak m/z')[1].value as base_peak FROM 'props.parquet';
""")

The associated Parquet files for all of these cases are attached.

Discussion

Sizes:

1 161.1 KB
3 102.4 KB
4 111.8 KB
5 111.8 KB

Design #1 is the largest file (161.1 KB), and at first glance should also be the most expensive one to query at scale since
it requires parsing the value.

Design #3 is the smallest file (102.4 KB), but doesn't permit us to provide units or CURIEs. It also will
throw schema errors when a parameter is missing, instead of returning NULL. It's not possible to repeat
a parameter a variable number of times either, but that use-case should be pretty rare.

Designs #4 and #5 are comparable in size, weighing in at ~111.8 KB each.

Query efficiency

Supposing you know exactly what you want, and the type it will be, then the cost per row to find it will
be

1 O(N)
3 O(1)
4 O(N)
5 O(N_type)

where N is the # of parameters and N_type is the # of parameters of that type, so expected to be ≤ N.
This means that #5 is slightly more efficient than #4.

If the stored type isn't known but you want to use it as a specific type, then things get more complicated.

For #4:

SELECT COALESCE(base_peak.int_value::double, base_peak.float_value) AS base_peak FROM
    (
        SELECT list_filter(cv_values, x -> x.name == 'base peak m/z')[1] AS base_peak FROM 'mixed.parquet'
    );

For #5:

SELECT COALESCE(
        list_filter(cv_ints, x -> x.name == 'base peak m/z')[1].value::double,
        list_filter(cv_floats, x -> x.name == 'base peak m/z')[1].value
    ) AS base_peak
    AS base_peak FROM 'props.parquet';

Trade-off discussion

Using #3's per-file struct schema is an appealing maximum-efficiency approach, but it means one cannot reasonably assume
that a consistent schema within the quantms.io spec is used for all files in a collection. This may effectively prevent
query engines from accepting them, but this will be an implementation detail of the engine. My tests suggested that DuckDB
would reject this without special configuration (passing union_by_name = true to the read_parquet function) but still
requires consistent types amongst shared names.

Using #4 or #5 achieves a compromise in disk space and type-safety. #5's separate list representation benefits by consuming less
memory during the marshalling process (uses 17% less memory than #4). However, #5 is more complicated to use, especially when the
caller does not know which type the creator had listed something as. This would essentially lead to robust readers needing
to add a fallback "search all the lists" code path just in case a parameter is mis-represented as another type instead of
immediately assuming it is absent when not found in the expected type list.

As optimality is both an issue of computing/resource efficiency and ease-of-use, I think #4 strikes the best balance. It
maintains the original desired flexibility it is easier to resolve the ambiguous cases, while not sacrificing too much in terms
of resource efficiency.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to represent controlled vocabulary parameters, round 2 #85

How to represent controlled vocabulary parameters, round 2 #85

mobiusklein commented Dec 22, 2024

How to represent controlled vocabulary parameters, round 2 #85

How to represent controlled vocabulary parameters, round 2 #85

Comments

mobiusklein commented Dec 22, 2024

How to represent controlled vocabulary terms

Possible designs

Design #1

Design #3

Design #4

Design #5

Discussion

Query efficiency

Trade-off discussion