Skip to content

[Gluten-9254][CH] Support RDDScanExec #9270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 15, 2025
Merged

[Gluten-9254][CH] Support RDDScanExec #9270

merged 7 commits into from
Apr 15, 2025

Conversation

loneylee
Copy link
Member

@loneylee loneylee commented Apr 9, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #9254)

How was this patch tested?

Test by ut

@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Apr 9, 2025
Copy link

github-actions bot commented Apr 9, 2025

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Apr 9, 2025

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

1 similar comment
Copy link

Run Gluten ClickHouse CI on ARM

@loneylee loneylee requested a review from Copilot April 11, 2025 03:14
@loneylee
Copy link
Member Author

@CodiumAI-Agent /review

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 19 out of 27 changed files in this pull request and generated 1 comment.

Files not reviewed (8)
  • backends-clickhouse/src-delta-32/test/scala/org/apache/spark/sql/execution/RDDScanSuite.scala: Language not supported
  • backends-clickhouse/src-kafka/main/scala/org/apache/gluten/component/CHKafkaComponent.scala: Language not supported
  • backends-clickhouse/src-kafka/main/scala/org/apache/gluten/extension/columnar/KafkaMiscColumnarRules.scala: Language not supported
  • backends-clickhouse/src/main/scala/org/apache/gluten/backendsapi/clickhouse/CHRuleApi.scala: Language not supported
  • backends-clickhouse/src/main/scala/org/apache/gluten/backendsapi/clickhouse/CHSparkPlanExecApi.scala: Language not supported
  • backends-clickhouse/src/main/scala/org/apache/gluten/extension/CHRemoveTopmostColumnarToRow.scala: Language not supported
  • backends-clickhouse/src/main/scala/org/apache/spark/sql/execution/CHRDDScanTransformer.scala: Language not supported
  • gluten-core/src/main/scala/org/apache/gluten/extension/caller/CallerInfo.scala: Language not supported
Comments suppressed due to low confidence (1)

cpp-ch/local-engine/Parser/SparkRowToCHColumn.cpp:60

  • The change from inserting null data (using insertData with a nullptr) to insertDefault alters the handling of null values. Please verify that this new behavior is intended and compatible with downstream logic.
columns[i]->insertDefault();

Comment on lines +552 to +553
return reinterpret_cast<jlong>(a);
LOCAL_ENGINE_JNI_METHOD_END(env, -1)
Copy link
Preview

Copilot AI Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to LOCAL_ENGINE_JNI_METHOD_END appears after a return statement in Java_org_apache_gluten_vectorized_CHNativeBlock_copyBlock, making it unreachable. Consider moving the macro invocation before the return or restructuring the function to ensure proper cleanup.

Suggested change
return reinterpret_cast<jlong>(a);
LOCAL_ENGINE_JNI_METHOD_END(env, -1)
LOCAL_ENGINE_JNI_METHOD_END(env, -1)
return reinterpret_cast<jlong>(a);

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

@loneylee
Copy link
Member Author

@CodiumAI-Agent /improve

@CodiumAI-Agent
Copy link

CodiumAI-Agent commented Apr 11, 2025

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure property restoration consistency

Ensure that the restoration of the topmost conversion property is executed even when
early returns occur.

backends-clickhouse/src-delta-32/main/scala/org/apache/spark/sql/delta/commands/merge/MergeIntoMaterializeSource.scala [292-343]

 // --- modified start
 val originalRemoveTopmostC2R = CHRemoveTopmostColumnarToRow.isRemoveTopmostC2R(spark)
-spark.sparkContext.setLocalProperty(
-  CHRemoveTopmostColumnarToRow.REMOVE_TOPMOST_COLUMNAR_TO_ROW,
-  "true")
+try {
+  spark.sparkContext.setLocalProperty(
+    CHRemoveTopmostColumnarToRow.REMOVE_TOPMOST_COLUMNAR_TO_ROW,
+    "true")
+  if (!materialize) {
+    mergeSource = Some(
+      MergeSource(
+        df = Dataset.ofRows(spark, source),
+        isMaterialized = false,
+        materializeReason = materializeReason
+      )
+    )
+    return
+  }
+  // Continue with materialization logic...
+  ...
+} finally {
+  CHRemoveTopmostColumnarToRow.setRemoveTopmostC2R(originalRemoveTopmostC2R, spark)
+}
 // --- modified end
 
-if (!materialize) {
-  // Does not materialize, simply return the dataframe from source plan
-  mergeSource = Some(
-    MergeSource(
-      df = Dataset.ofRows(spark, source),
-      isMaterialized = false,
-      materializeReason = materializeReason
-    )
-  )
-  return
-}
-...
-// --- modified start
-CHRemoveTopmostColumnarToRow.setRemoveTopmostC2R(originalRemoveTopmostC2R, spark)
-// --- modified end
-
Suggestion importance[1-10]: 8

__

Why: The suggestion addresses a potential bug where an early return might skip the restoration of the topmost conversion property. Wrapping the configuration change in a try–finally block ensures that the property is reset regardless of control flow, making the solution both relevant and effective.

Medium
General
Improve streaming detection

Review the string matching logic for detecting streaming calls to ensure it
accurately identifies all relevant call frames.

gluten-core/src/main/scala/org/apache/gluten/extension/caller/CallerInfo.scala [69-71]

 private def inStreamingCall(stack: Seq[StackTraceElement]): Boolean = {
-  stack.exists(_.getClassName.equals(StreamExecution.getClass.getName.split('$').head))
+  // Consider using a more robust matching method, e.g., checking for substring or regex that matches expected streaming classes.
+  stack.exists(elem => elem.getClassName.contains("StreamExecution"))
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion enhances the existing string matching by using a substring check which is more flexible; however, as it is an improvement in detection logic rather than a fix for a critical bug, its impact is moderate and may risk false positives if not carefully reviewed.

Low

Copy link

Run Gluten ClickHouse CI on ARM

@CodiumAI-Agent
Copy link

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure property restoration consistency

Ensure that the restoration of the topmost conversion property is executed even when
early returns occur.

backends-clickhouse/src-delta-32/main/scala/org/apache/spark/sql/delta/commands/merge/MergeIntoMaterializeSource.scala [292-343]

 // --- modified start
 val originalRemoveTopmostC2R = CHRemoveTopmostColumnarToRow.isRemoveTopmostC2R(spark)
-spark.sparkContext.setLocalProperty(
-  CHRemoveTopmostColumnarToRow.REMOVE_TOPMOST_COLUMNAR_TO_ROW,
-  "true")
+try {
+  spark.sparkContext.setLocalProperty(
+    CHRemoveTopmostColumnarToRow.REMOVE_TOPMOST_COLUMNAR_TO_ROW,
+    "true")
+  if (!materialize) {
+    mergeSource = Some(
+      MergeSource(
+        df = Dataset.ofRows(spark, source),
+        isMaterialized = false,
+        materializeReason = materializeReason
+      )
+    )
+    return
+  }
+  // Continue with materialization logic...
+  ...
+} finally {
+  CHRemoveTopmostColumnarToRow.setRemoveTopmostC2R(originalRemoveTopmostC2R, spark)
+}
 // --- modified end
 
-if (!materialize) {
-  // Does not materialize, simply return the dataframe from source plan
-  mergeSource = Some(
-    MergeSource(
-      df = Dataset.ofRows(spark, source),
-      isMaterialized = false,
-      materializeReason = materializeReason
-    )
-  )
-  return
-}
-...
-// --- modified start
-CHRemoveTopmostColumnarToRow.setRemoveTopmostC2R(originalRemoveTopmostC2R, spark)
-// --- modified end
-
Suggestion importance[1-10]: 8

__

Why: The suggestion addresses a potential bug where an early return might skip the restoration of the topmost conversion property. Wrapping the configuration change in a try–finally block ensures that the property is reset regardless of control flow, making the solution both relevant and effective.

Medium
General
Improve streaming detection

Review the string matching logic for detecting streaming calls to ensure it
accurately identifies all relevant call frames.

gluten-core/src/main/scala/org/apache/gluten/extension/caller/CallerInfo.scala [69-71]

 private def inStreamingCall(stack: Seq[StackTraceElement]): Boolean = {
-  stack.exists(_.getClassName.equals(StreamExecution.getClass.getName.split('$').head))
+  // Consider using a more robust matching method, e.g., checking for substring or regex that matches expected streaming classes.
+  stack.exists(elem => elem.getClassName.contains("StreamExecution"))
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion enhances the existing string matching by using a substring check which is more flexible; however, as it is an improvement in detection logic rather than a fix for a critical bug, its impact is moderate and may risk false positives if not carefully reviewed.

Low

Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

Copy link

Run Gluten ClickHouse CI on ARM

@loneylee loneylee merged commit 65ad57a into apache:main Apr 15, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support CloumnBatch RDDScanExec
3 participants