Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove update_iceberg_ts from iceberg table when using merge incremental strategy #453 #454

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aiss93
Copy link

@aiss93 aiss93 commented Sep 30, 2024

resolves #452

Description

In the current implementation, when using the merge incremental strategy with the Iceberg file format, the adapter automatically adds a column update_iceberg_ts, which is not utilized in the merge process. This causes issues such as breaking schema comparisons between tables in certain scenarios, making the hard-coding of this column restrictive.

This PR removes the hard-coded update_iceberg_ts column from the adapter. If users require this column for specific use cases, they can now add it directly within their model configuration.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-glue next" section.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@moomindani
Copy link
Collaborator

Hi, thanks for your contribution.

Since this is breaking change, in case users are already using update_iceberg_ts column, dbt-glue version upgrade impacts their workload.
Is it possible to think backward-compatible way? I usually prefer safer approach to add config to remove that column.

@aiss93
Copy link
Author

aiss93 commented Oct 3, 2024

Hi @moomindani,
I added a configuration parameter to control this behavior. By default, it will add the update_iceberg_ts column to ensure compatibility with existing workloads and avoid any disruptions.

Additionally, I discovered some hardcoded Spark configurations, such as:
{% call statement() %} set spark.sql.autoBroadcastJoinThreshold=-1 {% endcall %}. This configuration will disable broadcasted joins by default, which can negatively impact performance in some cases.

I believe this should be removed, as such Spark configurations should be left for the user to set according to their specific use case and environment.

@moomindani
Copy link
Collaborator

Yes, disabling spark.sql.autoBroadcastJoinThreshold can introduce performance bottleneck. It will be better to override the config when spark.sql.autoBroadcastJoinThreshold is set in --conf parameter, or remove that option for future Glue version. But again, we need to be careful not to impact existing workload. And this needs to be discussed in a separate PR/issue.

@moomindani
Copy link
Collaborator

Can we add test case for this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove update_iceberg_ts column from model with merge incremental strategy
2 participants