[CALCITE-4918] Add a VARIANT data type #3947

mihaibudiu · 2024-09-04T06:39:43Z

This PR introduces a VARIANT SQL data type, based on the design in Snowflake: https://docs.snowflake.com/en/sql-reference/data-types-semistructured. VARIANT is a much better base to build support for JSON that the existing functions in Calcite. All the existing Calcite JSON functions assume that JSON is stored as an unparsed string. VARIANT can represent any JSON document in a "binary" form (and much more than JSON!)

The PR is divided into two commits:

the parser and validator support
the runtime support

Turns out that the validator support is almost trivial. Most of the work is in the runtime, where a new kind of value must be introduced, Variant, which carries runtime type information. For this purpose a new class hierarchy has been introduced, rooted at RuntimeTypeInformation.

Currently the runtime type information carries precision and scale, but I am not sure that these are actually necessary. I will study this a bit more and may amend this in a subsequent commit.

There are no functions supporting VARIANT at this point, but VARIANTS can still be created using casts, and accessed using variant.field, variant[index], and variant['key'].

Subsequent pull requests are expected to add more functions. In particular, functions to parse and unparse JSON into variants.

julianhyde · 2024-09-04T16:23:22Z

Can you move (or at least copy) that description to the Jira case? Jira cases are our feature specifications. Minimal features thatI would like to see specified:

How VARIANT interacts with JDBC
Whether there is an 'instanceof' operator
Behavior in CAST (and implicit conversion)
How variants are converted to and from strings

julianhyde

Looks good.

One useful addition would be a variant.iq quidem test. With extensive comments, it could read like a tutorial, showing all the things you can do with variants. As a new language feature, specific to Calcite (albeit based on Snowflake) I think it needs better documentation than we usually provide.

julianhyde · 2024-09-05T00:38:22Z

core/src/main/java/org/apache/calcite/util/rtti/RuntimeTypeInformation.java

+import static java.util.Objects.requireNonNull;
+
+/**
+ * This class represents the type of a SQL expression at runtime.


Remove 'This class represents'

julianhyde · 2024-09-05T00:40:13Z

core/src/main/java/org/apache/calcite/util/rtti/RuntimeTypeInformation.java

+ * a dynamically-typed value, and needs this kind of information.
+ * We cannot use the very similar RelDataType type since it carries extra
+ * baggage, like the type system, which is not available at runtime. */
+public abstract class RuntimeTypeInformation {


I get the impression that RuntimeTypeInformation is connected with either the enumerable engine, or as part of the Java API (which is used to define Java UDFs). If it is either of those, I don't think it belongs in (a subpackage of) the util package.

julianhyde · 2024-09-05T00:40:29Z

core/src/main/java/org/apache/calcite/util/rtti/RuntimeTypeInformation.java

+  }
+
+  /**
+   * Create and return an expression that creates a runtime type that


s/Create/Creates/

... and generally, please follow the best practice of using third-person indicative for method javadoc.

julianhyde · 2024-09-05T00:44:33Z

core/src/main/java/org/apache/calcite/util/Variant.java

+/** This class is the runtime support for values of the VARIANT SQL type. */
+public class Variant {
+  /** Actual value. */
+  final @Nullable Object value;


Making value nullable probably complicates a lot of code. (If the experience with RexLiteral is anything to go by.) Could you have a subclass for null variants (or some other solution)?

julianhyde · 2024-09-05T00:51:14Z

core/src/main/java/org/apache/calcite/util/Variant.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.calcite.util;


The util package feels like the wrong place (for the reasons I described in rtti).

How about org.apache.calcite.runtime? The classes in that package are intended for use by generated code, rather than user code. But it has a similar goal to have few dependencies.

(DateString, TimeString, TimestampString should have been there too.)

mihaibudiu · 2024-09-06T03:20:17Z

@julianhyde I have hopefully implemented your suggestions. The structure is much better this way.
I have added TYPEOF and VARIANTNULL functions. TYPEOF is very useful for debugging.
The variant.iq file will grow as we add more functions.

suibianwanwank · 2024-09-06T10:18:13Z

Implicitly cast is an important part of VARIANT type, will it be supported in this PR?

edit: I just noticed you mention this in Jira.

mihaibudiu · 2024-09-06T16:44:45Z

Yes, let's leave implicit casts for later.
I think we can add coercions fairly easily, but other implicit casts may require much more work due to the way Calcite handles them.

julianhyde · 2024-09-06T19:09:43Z

site/_docs/reference.md

-| Type     | Description                | Example type
-|:-------- |:---------------------------|:---------------
-| ANY      | The union of all types |
+| Type     | Description                | Example type |


Our markdown style is to not have a | after the last column. Can you remove it.

(Intellij does not respect that style, so its edits need to be reverted.)

Also revert the changes in spacing, which become spurious diffs.

julianhyde · 2024-09-06T19:10:53Z

@mihaibudiu Thanks for all your edits. Much improved.

+1 after you fix the cosmetic issues in reference.md.

mihaibudiu · 2024-09-06T19:59:14Z

I will leave this PR open for a few more days in case people want to comment. Otherwise I have tagged it with "merge soon".

caicancai

@mihaibudiu I reviewed the comments and left some minor comment. I think this PR can be merged.

testkit/src/main/java/org/apache/calcite/test/SqlOperatorTest.java

core/src/main/java/org/apache/calcite/runtime/variant/VariantNull.java

NobiGo

It's really a complicated PR. We may be able to fill in some optimizations about VARIANT in some future PRs.
Some example:
CAST(e as VARIANT ) is not null to e is not null.
CAST(cast(e as variant) as int) to cast(e as int)

core/src/main/java/org/apache/calcite/adapter/enumerable/RexImpTable.java

core/src/main/java/org/apache/calcite/runtime/rtti/BasicSqlTypeRtti.java

core/src/main/java/org/apache/calcite/runtime/rtti/RuntimeTypeInformation.java

NobiGo · 2024-12-04T08:39:49Z

core/src/test/java/org/apache/calcite/test/SqlValidatorTest.java

+        .columnType("VARIANT NOT NULL");
+    expr("cast(TIMESTAMP '2024-09-01 00:00:00' as variant)")
+        .columnType("VARIANT NOT NULL");
+    expr("cast(ARRAY[1,2,3] AS VARIANT)")


If we convert cast(ARRAY[1,2,3, null] AS VARIANT, What's the columnType?

I plan to rework a bit this PR to bring it more in line with what Snowflake does.
In this case converting an ARRAY to a variant produces a VARIANT whose value is an array of variant. So the elements are all recursively converted to VARIANT. Same thing is true for MAP.

site/_docs/reference.md

julianhyde · 2024-12-04T19:55:29Z

Looks good. Thanks for adding variant.iq - that's the first place I would direct someone who wants to understand this feature.

mihaibudiu · 2024-12-04T19:59:14Z

We are actually using this feature extensively in our Rust-based runtime, but our runtime implementation is slightly different. I plan to make the Java implementation a bit closer.

This is a very powerful feature. Once you have variant you can do many things which are very difficult otherwise, including recursive data types (like JSON). One other thing we use VARIANT for is error reporting. You can have a dynamic error type (a VARIANT MAP) which can include any other type, including any SQL record inside!

julianhyde · 2024-12-04T20:07:42Z

I agree with your remark that variants are difficult to use without TYPEOF. We could also add something like this (apologies for the crappy syntax, hopefully you get the idea):

CASE v
WHENINSTANCEOF number THEN v + 5
WHENINSTANCEOF varchar THEN length(v)
ELSE 0
END

I believe Java has something like this and maybe some SQL dialects do too.

mihaibudiu · 2024-12-07T05:52:47Z

In the last commit I have reworked the runtime implementation of VARIANT.
(We have tested this design in a different backend for a while, and it works pretty well.)
In the previous implementation the runtime type for generic types like ARRAY or MAP would hold information about the element types. In the current implementation this is no longer true: when converting an array to a variant all elements are converted to VARIANT as well. Same for a MAP. Converting a ROW to a VARIANT generates a MAP indexed by the field names.

For the user there isn't much of a difference, and, as you may notice, the tests have changed very little. The runtime cost is different, though. Neither of the schemes dominates the other, whether one is preferable depends on the workload.

I believe that this is similar to what Snowflake does, although I could not find a precise description of their exact implementation. But I believe that the TYPEOF function in Snowflake applied to an ARRAY will return just "ARRAY" and not "INT ARRAY" - so it resembles more the current implementation.

VARIANT shines at handling JSON, so in future PRs we should add more JSON support. Unlike the existing Calcite JSON support, VARIANT represents the JSON natively, and does not need to convert back and forth to strings on every operation, saving many resources.

I will give potential readers a few more days to review this.

mihaibudiu · 2024-12-11T18:48:04Z

I will squash the commits in preparation for merging, please let me know if you have objections.

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>

sonarqubecloud · 2024-12-16T23:00:10Z

Quality Gate passed

Issues
27 New issues
0 Accepted issues

Measures
0 Security Hotspots
50.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mihaibudiu mentioned this pull request Sep 4, 2024

[SQL] JSON support feldera/feldera#2401

Closed

7 tasks

mihaibudiu force-pushed the variant branch 2 times, most recently from 5759a66 to 44e3d35 Compare September 4, 2024 18:34

julianhyde reviewed Sep 5, 2024

View reviewed changes

mihaibudiu force-pushed the variant branch from 44e3d35 to 1162156 Compare September 6, 2024 03:19

mihaibudiu force-pushed the variant branch from 1162156 to 3ed4604 Compare September 6, 2024 06:43

julianhyde reviewed Sep 6, 2024

View reviewed changes

mihaibudiu added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Sep 6, 2024

mihaibudiu force-pushed the variant branch 2 times, most recently from 78bff8d to a35f642 Compare September 13, 2024 20:20

caicancai approved these changes Sep 14, 2024

View reviewed changes

caicancai reviewed Sep 22, 2024

View reviewed changes

testkit/src/main/java/org/apache/calcite/test/SqlOperatorTest.java Outdated Show resolved Hide resolved

core/src/main/java/org/apache/calcite/runtime/variant/VariantNull.java Show resolved Hide resolved

mihaibudiu force-pushed the variant branch from a35f642 to 9eb0fe7 Compare October 2, 2024 23:50

mihaibudiu force-pushed the variant branch 2 times, most recently from af0ff63 to 5827923 Compare October 23, 2024 22:40

mihaibudiu force-pushed the main branch from dc8c4ff to 855ad83 Compare October 23, 2024 22:46

mihaibudiu force-pushed the variant branch 2 times, most recently from 2ee9768 to 9b88b5b Compare October 30, 2024 18:35

mihaibudiu force-pushed the variant branch from 9b88b5b to 5fc9186 Compare November 15, 2024 21:08

mihaibudiu force-pushed the variant branch from 5fc9186 to 6317e1c Compare December 4, 2024 00:24

NobiGo approved these changes Dec 4, 2024

View reviewed changes

mihaibudiu force-pushed the variant branch from 6317e1c to 77a2ac6 Compare December 7, 2024 05:46

mihaibudiu force-pushed the variant branch 2 times, most recently from 4f5bc02 to 77044f6 Compare December 7, 2024 18:35

mihaibudiu force-pushed the variant branch from 77044f6 to 69c3e8a Compare December 11, 2024 19:01

mihaibudiu added 3 commits December 16, 2024 14:31

[CALCITE-4918] Add a VARIANT data type - parser and validator

15e01d6

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>

[CALCITE-4918] Add a VARIANT data type - runtime support

21aea94

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>

Implement VARIANT functions TYPEOF, VARIANTNULL; add variant.iq

b9e09d2

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>

mihaibudiu force-pushed the variant branch from 69c3e8a to b9e09d2 Compare December 16, 2024 22:36

mihaibudiu merged commit 87485a9 into apache:main Dec 17, 2024
20 checks passed

mihaibudiu deleted the variant branch December 17, 2024 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CALCITE-4918] Add a VARIANT data type #3947

[CALCITE-4918] Add a VARIANT data type #3947

mihaibudiu commented Sep 4, 2024

julianhyde commented Sep 4, 2024

julianhyde left a comment

julianhyde Sep 5, 2024

julianhyde Sep 5, 2024

julianhyde Sep 5, 2024

julianhyde Sep 5, 2024

julianhyde Sep 5, 2024 •

edited

Loading

julianhyde Sep 5, 2024

mihaibudiu commented Sep 6, 2024

suibianwanwank commented Sep 6, 2024 •

edited

Loading

mihaibudiu commented Sep 6, 2024

julianhyde Sep 6, 2024

julianhyde commented Sep 6, 2024

mihaibudiu commented Sep 6, 2024

caicancai left a comment •

edited

Loading

NobiGo left a comment

NobiGo Dec 4, 2024

mihaibudiu Dec 4, 2024

julianhyde commented Dec 4, 2024

mihaibudiu commented Dec 4, 2024

julianhyde commented Dec 4, 2024

mihaibudiu commented Dec 7, 2024

mihaibudiu commented Dec 11, 2024

sonarqubecloud bot commented Dec 16, 2024

[CALCITE-4918] Add a VARIANT data type #3947

[CALCITE-4918] Add a VARIANT data type #3947

Conversation

mihaibudiu commented Sep 4, 2024

julianhyde commented Sep 4, 2024

julianhyde left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianhyde Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihaibudiu commented Sep 6, 2024

suibianwanwank commented Sep 6, 2024 • edited Loading

mihaibudiu commented Sep 6, 2024

Choose a reason for hiding this comment

julianhyde commented Sep 6, 2024

mihaibudiu commented Sep 6, 2024

caicancai left a comment • edited Loading

Choose a reason for hiding this comment

NobiGo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianhyde commented Dec 4, 2024

mihaibudiu commented Dec 4, 2024

julianhyde commented Dec 4, 2024

mihaibudiu commented Dec 7, 2024

mihaibudiu commented Dec 11, 2024

sonarqubecloud bot commented Dec 16, 2024

Quality Gate passed

julianhyde Sep 5, 2024 •

edited

Loading

suibianwanwank commented Sep 6, 2024 •

edited

Loading

caicancai left a comment •

edited

Loading