Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove default value checks #852

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jan 23, 2025

Companion issue of https://github.com/iterative/studio/pull/11236
With making CH accept nullable columns, we no longer need special checks for default values in tests, as they will be None always.

@ilongin ilongin marked this pull request as draft January 23, 2025 00:09
Copy link

codecov bot commented Jan 23, 2025

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 87.71%. Comparing base (b275928) to head (a5248d0).

Files with missing lines Patch % Lines
src/datachain/data_storage/db_engine.py 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #852      +/-   ##
==========================================
- Coverage   87.72%   87.71%   -0.02%     
==========================================
  Files         129      129              
  Lines       11475    11476       +1     
  Branches     1545     1545              
==========================================
- Hits        10067    10066       -1     
- Misses       1020     1022       +2     
  Partials      388      388              
Flag Coverage Δ
datachain 87.63% <66.66%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Jan 24, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: a5248d0
Status: ✅  Deploy successful!
Preview URL: https://5a2ee850.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-11161-clickhouse-nul.datachain-documentation.pages.dev

View logs

@@ -276,8 +276,9 @@ def has_table(self, name: str) -> bool:
)
return bool(next(self.execute(query))[0])

def create_table(self, table: "Table", if_not_exists: bool = True) -> None:
def create_table(self, table: "Table", if_not_exists: bool = True) -> "Table":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this as it makes code better in Studio part where table columns are modified to allow nullable

@ilongin ilongin marked this pull request as ready for review January 27, 2025 10:09
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@skshetry
Copy link
Member

Can we introduce something like strict mode in schema, similar to pydantic to support non-nullable types?

https://docs.pydantic.dev/latest/concepts/strict_mode/

@ilongin
Copy link
Contributor Author

ilongin commented Feb 4, 2025

Can we introduce something like strict mode in schema, similar to pydantic to support non-nullable types?

https://docs.pydantic.dev/latest/concepts/strict_mode/

The thing here is actually this is not null vs not null constraint, but making sure we don't rely on Clickhouse defaults when we insert empty value (None).
Regarding nulls, Clickhouse is syntactically compatible with other relational databases but not semantically which means that if we define NOT NULL constraint on a column (default behavior before this PR) it will not reject data that has empty value, instead it will put default value for that column type, e.g int -> 0. This is because CH has emphasized for performance and analytical use cases and checking every single value for constraint would be slow.
Because of this, we currently have 2 different behaviors in SaaS vs local:

  1. In SQLite we allow nulls everywhere which behaves as expected
  2. In Clickhouse we don't allow nulls anywhere, but CH behaves in a way to put default values instead

The simplest solution is to make all columns nullable in CH, which is what this (and Studio companion) PR is up to, as having those default values is not correct and expected from user perspective. The only drawback is performance implication as it could slow down queries by double, according to what people are saying.

Your suggestion goes one step further -> adding a new feature to the system for setting NULL constraints in user model fields. This is easy to do in SQLite, but in Clickhouse we need to add special constraints or maybe this setting although it seems like it's global and we need it for specific columns

Regarding the joins, as they were also mentioned multiple times in our calls, we have special global setting join_use_nulls set to 1 which means CH will convert all columns which get empty value in joins (e.g. after using outer join) to nullable (same as Studio companion PR is doing to ALL columns being used in CH).

@ilongin ilongin closed this Feb 4, 2025
@ilongin ilongin reopened this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants