Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to represent controlled vocabulary parameters, round 2 #85

Open
mobiusklein opened this issue Dec 22, 2024 · 0 comments
Open

How to represent controlled vocabulary parameters, round 2 #85

mobiusklein opened this issue Dec 22, 2024 · 0 comments

Comments

@mobiusklein
Copy link
Contributor

How to represent controlled vocabulary terms

To avoid using necromancy, this new thread picks up where #79 ended.

Possible designs

  1. Original Design: List of (key, value) pairs with values homogenized as strings
  2. Non-viable: List of (key, value) pairs with value stored as a union
  3. Per-File Struct: A single struct whose fields are strongly typed
  4. The Alternate Union: List of (key, value) pairs with value stored in one of n nullable strongly typed lanes
  5. Seperate Type Lists: One list of (key, value) pairs per value type

To evaluate this I used some examples from mzML spectra as I did not have a high density example set of peptide ID
CV parameters handy. The produced Parquet files are found here: examples.zip

Design #1

cv_str_val = pa.struct([
    pa.field("name", pa.string(), True),
    pa.field("curie", pa.string(), True),
    pa.field("unit_name", pa.string(), True),
    pa.field("unit_curie", pa.string(), True),
    pa.field("value", pa.string(), True)
])

...

duckdb.sql(r"""
SELECT list_filter(cv_values, x -> x.name == 'base peak m/z')[1].value::double as base_peak from 'as_str.parquet';
""")

This makes the value a homogenous type, but the value needs to be parsed from a string whenever it needs to be used.

Design #3

No general-purpose code is viable here. It depends upon the source data.

duckdb.sql(r"""
SELECT "base peak m/z" FROM 'exact_props_struct.parquet';
""")

This is the most compact format.

Design #4

cv_any_val = pa.struct([
    pa.field("name", pa.string(), True),
    pa.field("curie", pa.string(), True),
    pa.field("unit_name", pa.string(), True),
    pa.field("unit_curie", pa.string(), True),
    pa.field("float_value", pa.float64(), True),
    pa.field("str_value", pa.string(), True),
    pa.field("int_value", pa.int64(), True),
])

...

duckdb.sql(r"""
SELECT list_filter(cv_values, x -> x.name == 'base peak m/z')[1].float_value as base_peak from 'mixed.parquet';
""")

Design #5

cv_ident = [
    pa.field("name", pa.string(), True),
    pa.field("curie", pa.string(), True),
    pa.field("unit_name", pa.string(), True),
    pa.field("unit_curie", pa.string(), True),
]

cv_str = pa.struct(cv_ident + [pa.field("value", pa.string(), True)])

cv_int = pa.struct(cv_ident + [pa.field("value", pa.int64(), True)])

cv_float = pa.struct(cv_ident + [pa.field("value", pa.float64(), True)])

...

duckdb.sql(r"""
SELECT list_filter(cv_floats, x -> x.name == 'base peak m/z')[1].value as base_peak FROM 'props.parquet';
""")

The associated Parquet files for all of these cases are attached.

Discussion

Sizes:

  • 1 161.1 KB
  • 3 102.4 KB
  • 4 111.8 KB
  • 5 111.8 KB

Design #1 is the largest file (161.1 KB), and at first glance should also be the most expensive one to query at scale since
it requires parsing the value.

Design #3 is the smallest file (102.4 KB), but doesn't permit us to provide units or CURIEs. It also will
throw schema errors when a parameter is missing, instead of returning NULL. It's not possible to repeat
a parameter a variable number of times either, but that use-case should be pretty rare.

Designs #4 and #5 are comparable in size, weighing in at ~111.8 KB each.

Query efficiency

Supposing you know exactly what you want, and the type it will be, then the cost per row to find it will
be

  • 1 O(N)
  • 3 O(1)
  • 4 O(N)
  • 5 O(N_type)

where N is the # of parameters and N_type is the # of parameters of that type, so expected to be ≤ N.
This means that #5 is slightly more efficient than #4.

If the stored type isn't known but you want to use it as a specific type, then things get more complicated.

For #4:

SELECT COALESCE(base_peak.int_value::double, base_peak.float_value) AS base_peak FROM
    (
        SELECT list_filter(cv_values, x -> x.name == 'base peak m/z')[1] AS base_peak FROM 'mixed.parquet'
    );

For #5:

SELECT COALESCE(
        list_filter(cv_ints, x -> x.name == 'base peak m/z')[1].value::double,
        list_filter(cv_floats, x -> x.name == 'base peak m/z')[1].value
    ) AS base_peak
    AS base_peak FROM 'props.parquet';

Trade-off discussion

Using #3's per-file struct schema is an appealing maximum-efficiency approach, but it means one cannot reasonably assume
that a consistent schema within the quantms.io spec is used for all files in a collection. This may effectively prevent
query engines from accepting them, but this will be an implementation detail of the engine. My tests suggested that DuckDB
would reject this without special configuration (passing union_by_name = true to the read_parquet function) but still
requires consistent types amongst shared names.

Using #4 or #5 achieves a compromise in disk space and type-safety. #5's separate list representation benefits by consuming less
memory during the marshalling process (uses 17% less memory than #4). However, #5 is more complicated to use, especially when the
caller does not know which type the creator had listed something as. This would essentially lead to robust readers needing
to add a fallback "search all the lists" code path just in case a parameter is mis-represented as another type instead of
immediately assuming it is absent when not found in the expected type list.

As optimality is both an issue of computing/resource efficiency and ease-of-use, I think #4 strikes the best balance. It
maintains the original desired flexibility it is easier to resolve the ambiguous cases, while not sacrificing too much in terms
of resource efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant