Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Implicitly generate null value column when using COPY #4305

Open
prrao87 opened this issue Sep 26, 2024 · 1 comment
Open

Feature: Implicitly generate null value column when using COPY #4305

prrao87 opened this issue Sep 26, 2024 · 1 comment
Labels
feature New features or missing components of existing features

Comments

@prrao87
Copy link
Member

prrao87 commented Sep 26, 2024

API

Other

Description

I have this scenario where I have a CSV/Parquet file with just two columns:

product   price
Laptop 1100.0
Mouse 150.0
Headphones 250.0

In my DDL, I want to add a nullable column historical_price as follows:

CREATE NODE TABLE IF NOT EXISTS Product(name STRING, price DOUBLE, historical_sales INT32, PRIMARY KEY (name));

I want to initialize the node table with all values in the historical_sales columns as nulls, and add them via MERGE at a later time when they become available.

Issue

Because my input file has just 2 columns, and my DDL specifies 3 columns, I cannot use COPY Product FROM 'product.parquet' directly.

I instead have to do this:

COPY Product(name, price) FROM 'data/product.parquet'

Feature request

In the COPY pipeline, because we already know the number of columns in the input prior to importing the data, can we implicitly infer the column names so that we can use the much simpler DDL command below? It would reduce the mental burden on the user, as it's expected that the columns that are absent in the input file would have to be filled with nulls.

COPY Product FROM 'data/product.parquet'

@ray6080 I think this seems like a reasonable feature, but if you think it's infeasible, feel free to close.

@prrao87 prrao87 added the feature New features or missing components of existing features label Sep 26, 2024
@ray6080
Copy link
Contributor

ray6080 commented Sep 27, 2024

In the COPY pipeline, because we already know the number of columns in the input prior to importing the data, can we implicitly infer the column names so that we can use the much simpler DDL command below?

I think it can introduce some confusing behaviours if we implicitly infer column names, such as what if column names don't exactly match? or when there are no header information in the source? In that sense, COPY Product(name, price) FROM 'source' is much less prone to confusions I think.
But let's do more survey on this and see how other systems handle such cases before we jump to a conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New features or missing components of existing features
Projects
None yet
Development

No branches or pull requests

2 participants