Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing numeric precisions in Hive type system #501

Merged
merged 14 commits into from
Jul 25, 2024

Conversation

KevinGe00
Copy link
Contributor

@KevinGe00 KevinGe00 commented Mar 29, 2024

What changes are proposed in this pull request, and why are they necessary?

Previously, unions between a bigint column and int column would result in an int column. Which is incorrect since bigint is a superset of int, i.e bigint is less restrictive than int, and not the other way around. This bug was caused by missing numeric precision valeus in HiveTypeSystem which is used in calcite to determine the least restrictive type between two columns.

There a few changes in this changes, including some subsequent fixes that are needed after adding the precisions in:

  1. Add precision values in HiveTypeSystem for all integer types: tinyint, smallint, int, bigint
  • tinyint and smallint are needed as now that we have concrete precisions for int and bigint, we would mislead Calcite that int and bigint could have a lower precisions than tinyint and smallint if they also don't have concrete precisions to be compared to.
  1. After adding in the precisions, a bug was revealed where we were incorrectly converting types to their respective specs in FromUtcTimestampOperatorTransformer. So replace that method with Calcite's already existing util function to properly convert types to specs for us.

How was this patch tested?

  1. clean build
  2. new UT
  3. expected regressions, example:

Snippet of Input Hive query:

CASE WHEN `table`.`a` = CAST(1 AS BIGINT)

where column a is of type int

Previous Trino Translation:

CASE WHEN "table"."a" = CAST(CAST(1 AS BIGINT) AS INTEGER)

Since Calcite couldn't correctly figure out the least restrictive type because of missing precisions, it would cast the bigint type as int instead of the other way around.

New Trino Translation (regresssion):

CASE WHEN CAST("table"."a" AS BIGINT) = CAST(1 AS BIGINT)

Casting happens on the less restrictive type (int)

@wmoustafa
Copy link
Contributor

Should the precision be patched in Coral or Calcite? Should it go in LHS or IR? Sounds Calcite + IR is the correct answer based on recent discussions?

@KevinGe00
Copy link
Contributor Author

KevinGe00 commented Apr 9, 2024

Should the precision be patched in Coral or Calcite?

There is no error in Calcite code/precision. What's happening is that Calcite uses whatever implementation of the interface RelDataTypeSystem, in the case of Coral using Calcite, we pass in HiveTypeSystem which implements RelDataTypeSystem. Therefore we need to patch in Coral.

Should it go in LHS or IR?

HiveTypeSystem is used on the LHS in HiveToRels SqlToRelConverter so patching HiveTypeSystem would automatically fix it in LHS.

Sounds Calcite + IR is the correct answer based on recent discussions?

When you say IR do you mean that we need to align on a unified Coral type system whether that's using HiveTypeSystem or maybe Calcite's default type system? And once we align on a type system we use it in CoralSqlNode1 -> CoralRelNode?

I may be misunderstanding what you mean by IR here but an IR type system doesn't make sense to me. We need type systems to encapsulate differences concerning type limits and behaviors. For example, in the default SQL system, a DECIMAL can have maximum precision 19, but Hive overrides to 38. We would need this information when converting a Hive input query.

@wmoustafa
Copy link
Contributor

Ok let me clarify. For Coral vs Calcite: I was under the impression that Hive INT and LONG will be converted to Calcite INT and LONG and then Calcite is responsible to internally figure out each's precision. Looking at the class contract, it seems that it does expect us to fill in the precession as part of the contract, and not simply return INT or LONG and that is it. So now it makes sense that we are obligated to fill in this value as part of the contract.

For the second part of the question, what I meant by an IR type system is the "Coral Type system" that all other systems get mapped to (e..g, from LHS). Upon further look, it sounds that this is indeed a Coral type system and not a Hive-specific type system, and the classname is just misleading due to historical reasons. Why is it a Coral type system: because it is in the common module. I now recall I have moved this from the hive module to the common module because it is independent of the LHS. So we are good there. Can you add an item to the backlog to rename this class to CoralTypeSystem?

@KevinGe00 KevinGe00 changed the title [WIP] [Coral-Hive] Port Hive code fix for missing type systen precisions [WIP] [Coral-Hive] Add missing Hive type system precisions for BIGINT and INTEGER Apr 10, 2024
@KevinGe00 KevinGe00 changed the title [WIP] [Coral-Hive] Add missing Hive type system precisions for BIGINT and INTEGER [WIP] Add missing numeric precisions in Hive type system Apr 11, 2024
@KevinGe00 KevinGe00 changed the title [WIP] Add missing numeric precisions in Hive type system Add missing numeric precisions in Hive type system Apr 11, 2024
Copy link
Contributor

@aastha25 aastha25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @KevinGe00!
has this patch been tested against the li-trino-hotfix branch? could you add that in the PR description?

Comment on lines +32 to +35
private static final int DEFAULT_TINYINT_PRECISION = 3;
private static final int DEFAULT_SMALLINT_PRECISION = 5;
private static final int DEFAULT_INTEGER_PRECISION = 10;
private static final int DEFAULT_BIGINT_PRECISION = 19;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you link the reference for these values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the javadoc to specify where the values are coming from, this entire class was copied from Hive source code but has now deviated. We have a backlog item mentioned above to eventually rename this to CoralTypeSystem.

@@ -622,6 +622,15 @@ public void testNameSakeColumnNamesShouldGetUniqueIdentifiers() {
assertEquals(generated, expected);
}

@Test
public void testUnionIntAndBigInt() {
final String sql = "SELECT a FROM test.tableOne UNION ALL SELECT z FROM test.tableTwo";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) can we add a couple more assertions in the unit test -
(int union all bigint), (bigint union all int), (tinyint union all bigint) etc

(2) could you add another UT (here & in HiveToTrino tests) for CASE WHEN table.a = CAST(1 AS BIGINT) type queries as mentioned in the PR description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) The reason I originally didn't add more tests is because Calcite already has tests using the same precision values. But definitely agree more tests in Coral don't hurt. I added tests comparing INT to the other 3 integer types since those are the most common comparisons.

(2) Added.

return SqlStdOperatorTable.CAST.createCall(ZERO, operand, getSqlDataTypeSpecForCasting(relDataType));
}

private SqlDataTypeSpec getSqlDataTypeSpecForCasting(RelDataType relDataType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add an example of Coral translation for before & after this code change? Since there's no difference in unit tests, does that mean this is a bug or an unnecessary transformation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add an example of Coral translation for before & after this code change

After adding in the precisions, the first case in testSubstrWithTimestampOperator in HiveToTrinoConverterTest started to fail because we were creating a faulty type spec that fails this assertion in the Calcite's derive type function for SqlBasicTypeNameSpec. Therefore getSqlDataTypeSpecForCasting is a bug, if you look at SqlTypeUtil.convertTypeToSpec there are more involved steps that involve using precision/scale values.

@KevinGe00
Copy link
Contributor Author

@aastha25

has this patch been tested against the li-trino-hotfix branch?

I did run tests againt the li-trino-hotfix branch but as expected there were regressions outside the scope of the PR since I branched this PR from master.


@Test
public void testIntCastToBigIntDuringComparison() {
// We're testing that a comparison between INT and BIGINT sees a cast on the more restrictive type to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? What happens if you query Trino directly with a comparison operation without the CAST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not explicitly required (tested in trino to confirm), however, Calcite (during validation) will implicitly type coerce comparisons between 2 types by casting the more restrictive type to the more restrictive type. This PR just makes sure we give Calcite the right precision values to figure out correctly which type is more/less restrictive.

@KevinGe00 KevinGe00 merged commit d1d5b1e into linkedin:master Jul 25, 2024
1 check passed
KevinGe00 added a commit to KevinGe00/coral that referenced this pull request Jul 31, 2024
* add precision for int and bigint for hive type system

* use static final variables for precisions

* spotless check

* fix converting type to typespec bug

* add UT

* add tinyint and smallint precisions

* spotless

* spotless

* spotless

* new UTs

* add tableInt to rel to trino tests

* spotless

* remove redundant test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants