Change log

Generated on 2024-10-14

Release 24.10

Features


#11525	[FEA] If dump always is enabled dump before decoding the file
#11461	[FEA] Support non-UTC timezone for casting from date to timestamp
#11445	[FEA] Support format 'yyyyMMdd' in GetTimestamp operator
#11442	[FEA] Add in support for setting row group sizes for parquet
#11330	[FEA] Add companion metrics for all nsTiming metrics to measure time elapsed excluding semaphore wait
#5223	[FEA] Support array_join
#10968	[FEA] support min_by function
#10437	[FEA] Add Spark 3.5.2 snapshot support

Performance


#10799	[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce
#8301	[FEA] semaphore prioritization
#11234	Explore swapping build table for left outer joins
#11263	[FEA] Cluster/pack multi_get_json_object paths by common prefixes

Bugs Fixed


#11573	[BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore
#11367	[BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column
#11543	[BUG] test_yyyyMMdd_format_for_legacy_mode[DATAGEN_SEED=1727619674, TZ=UTC] failed GPU and CPU are not both null
#11500	[BUG] dataproc serverless Integration tests failing in json_matrix_test.py
#11384	[BUG] "rs. shuffle write time" negative values seen in app history log
#11509	[BUG] buildall no longer works
#11501	[BUG] test_yyyyMMdd_format_for_legacy_mode failed in Dataproc Serverless integration tests
#11502	[BUG] IT script failed get jars as we stop deploying intermediate jars since 24.10
#11479	[BUG] spark400 build failed do not conform to class UnaryExprMeta's type parameter
#8558	[BUG] `from_json` generated inconsistent result comparing with CPU for input column with nested json strings
#11485	[BUG] Integration tests failing in join_test.py
#11481	[BUG] non-utc integration tests failing in json_test.py
#10911	from_json: when input is a bad json string, rapids would throw an exception.
#10457	[BUG] ScanJson and JsonToStructs allow unquoted control chars by default
#10479	[BUG] JsonToStructs and ScanJson should return null for non-numeric, non-boolean non-quoted strings
#10534	[BUG] Need Improved JSON Validation
#11436	[BUG] Mortgage unit tests fail with RAPIDS shuffle manager
#11437	[BUG] array and map casts to string tests failed
#11463	[BUG] hash_groupby_approx_percentile failed assert is None
#11465	[BUG] java.lang.NoClassDefFoundError: org/apache/spark/BuildInfo$ in non-databricks environment
#11359	[BUG] a couple of arithmetic_ops_test.py cases failed mismatching cpu and gpu values with [DATAGEN_SEED=1723985531, TZ=UTC, INJECT_OOM]
#11392	[AUDIT] Handle IgnoreNulls Expressions for Window Expressions
#10770	[BUG] Slow/no progress with cascaded pandas udfs/mapInPandas in Databricks
#11397	[BUG] We should not be using copyWithBooleanColumnAsValidity unless we can prove it is 100% safe
#11372	[BUG] spark400 failed compiling datagen_2.13
#11364	[BUG] Missing numRows in the ColumnarBatch created in GpuBringBackToHost
#11350	[BUG] spark400 compile failed in scala213
#11346	[BUG] databrick nightly failing with not able to get spark-version-info.properties
#9604	[BUG] Delta Lake metadata query detection can trigger extra file listing jobs
#11318	[BUG] GPU query is case sensitive on Hive text table's column name
#10596	[BUG] ScanJson and JsonToStructs does not deal with escaped single quotes properly
#10351	[BUG] test_from_json_mixed_types_list_struct failed
#11294	[BUG] binary-dedupe leaves around a copy of "unshimmed" class files in spark-shared
#11183	[BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal"
#11008	Fix tests failures in ast_test.py
#11265	[BUG] segfaults seen in cuDF after prefetch calls intermittently
#11025	Fix tests failures in date_time_test.py
#11065	[BUG] Spark Connect Server (3.5.1) Can Not Running Correctly

PRs


#11576	Update rapids JNI and private dependency to 24.10.0
#11582	[DOC] update doc for 24.10 release [skip ci]
#11588	backport fixes of #11573 to branch 24.10
#11569	Have "dump always" dump input files before trying to decode them
#11567	Fix test case unix_timestamp(col, 'yyyyMMdd') failed for Africa/Casablanca timezone and LEGACY mode
#11496	Update test now that code is fixed
#11548	Fix negative rs. shuffle write time
#11545	Update test case related to LEACY datetime format to unblock nightly CI
#11515	Propagate default DIST_PROFILE_OPT profile to Maven in buildall
#11497	Update from_json to use new cudf features
#11516	Deploy all submodules for default sparkver in nightly [skip ci]
#11484	Fix FileAlreadyExistsException in LORE dump process
#11457	GPU device watermark metrics
#11507	Replace libmamba-solver with mamba command [skip ci]
#11503	Download artifacts via wget [skip ci]
#11490	Use UnaryLike instead of UnaryExpression
#10798	Optimizing Expand+Aggregate in sqls with many count distinct
#11366	Enable parquet suites from Spark UT
#11477	Install cuDF-py against python 3.10 on Databricks
#11462	Support non-UTC timezone for casting from date type to timestamp type
#11449	Support yyyyMMdd in GetTimestamp operator for LEGACY mode
#11456	Enable tests for all JSON white space normalization
#11483	Use reusable auto-merge workflow [skip ci]
#11482	Fix a json test for non utc time zone
#11464	Use improved CUDF JSON validation
#11474	Enable tests after string_split was fixed
#11473	Revert "Skip test_hash_groupby_approx_percentile byte and double test…
#11466	Replace scala.util.Try with a try statement in the DBR buildinfo
#11469	Skip test_hash_groupby_approx_percentile byte and double tests tempor…
#11429	Fixed some of the failing parquet_tests
#11455	Log DBR BuildInfo
#11451	xfail array and map cast to string tests
#11331	Add companion metrics for all nsTiming metrics without semaphore
#11421	[DOC] remove the redundant archive link [skip ci]
#11308	Dynamic Shim Detection for `build` Process
#11427	Update CI scripts to work with the "Dynamic Shim Detection" change [skip ci]
#11425	Update signoff usage [skip ci]
#11420	Add in array_join support
#11418	stop using copyWithBooleanColumnAsValidity
#11411	Fix asymmetric join crash when stream side is empty
#11395	Fix a Pandas UDF slowness issue
#11371	Support MinBy and MaxBy for non-float ordering
#11399	stop using copyWithBooleanColumnAsValidity
#11389	prevent duplicate queueing in the prio semaphore
#11291	Add distinct join support for right outer joins
#11396	Drop cudf-py python 3.9 support [skip ci]
#11393	Revert work-around for empty split-string
#11334	Add support for Spark 3.5.2
#11388	JSON tests for corrected date, timestamp, and mixed types
#11375	Fix spark400 build in datagen and tests
#11376	Create a PrioritySemaphore to back the GpuSemaphore
#11383	Fix nightly snapshots being downloaded in premerge build
#11368	Move SparkRapidsBuildInfoEvent to its own file
#11329	Change reference to `MapUtils` into `JSONUtils`
#11365	Set numRows for the ColumnBatch created in GpuBringBackToHost
#11363	Fix failing test compile for Spark 4.0.0
#11362	Add tests for repeated JSON columns/keys
#11321	conform dependency list in 341db to previous versions style
#10604	Add string escaping JSON tests to the test_json_matrix
#11328	Swap build side for outer joins when natural build side is explosive
#11358	Fix download doc [skip ci]
#11357	Fix auto merge conflict 11354 [skip ci]
#11347	Revert "Fix the mismatching default configs in integration tests (#11283)"
#11323	replace inputFiles with location.rootPaths.toString
#11340	Audit script - Check commits from sql-hive directory [skip ci]
#11283	Fix the mismatching default configs in integration tests
#11327	Make hive column matches not case-sensitive
#11324	Append ustcfy to blossom-ci whitelist [skip ci]
#11325	Fix auto merge conflict 11317 [skip ci]
#11319	Update passing JSON tests after list support added in CUDF
#11307	Safely close multiple resources in RapidsBufferCatalog
#11313	Fix auto merge conflict 10845 11310 [skip ci]
#11312	Add jihoonson as an authorized user for blossom-ci [skip ci]
#11302	Fix display issue of lore.md
#11301	Skip deploying non-critical intermediate artifacts [skip ci]
#11299	Enable get_json_object by default and remove legacy version
#11289	Use the new chunked API from multi-get_json_object
#11295	Remove redundant classes from the dist jar and unshimmed list
#11284	Use distinct count to estimate join magnification factor
#11288	Move easy unshimmed classes to sql-plugin-api
#11285	Remove files under tools/generated_files/spark31* [skip ci]
#11280	Asynchronously copy table data to the host during shuffle
#11258	Explicitly disable ANSI mode for ast_test.py
#11267	Update the rapids JNI and private dependency version to 24.10.0-SNAPSHOT
#11241	Auto merge PRs to branch-24.10 from branch-24.08 [skip ci]
#11231	Cache dependencies for scala 2.13 [skip ci]

Release 24.08

Features


#9259	[FEA] Create Spark 4.0.0 shim and build env
#10366	[FEA] It would be nice if we could support Hive-style write bucketing table
#10987	[FEA] Implement lore framework to support all operators.
#11087	[FEA] Support regex pattern with brackets when rewrite to PrefixRange patten in rlike
#22	[FEA] Add support for bucketed writes
#9939	[FEA] `GpuInsertIntoHiveTable` supports parquet format

Performance


#8750	[FEA] Rework GpuSubstringIndex to use cudf::slice_strings
#7404	[FEA] explore a hash agg passthrough on partial aggregates
#10976	Rewrite `pattern1

Bugs Fixed


#11287	[BUG] String split APIs on empty string produce incorrect result
#11270	[BUG] test_regexp_replace[DATAGEN_SEED=1722297411, TZ=UTC] hanging there forever in pre-merge CI intermittently
#9682	[BUG] Casting FLOAT64 to DECIMAL(12,7) produces different rows from Apache Spark CPU
#10809	[BUG] cast(9.95 as decimal(3,1)), actual: 9.9, expected: 10.0
#11266	[BUG] test_broadcast_hash_join_constant_keys failed in databricks runtimes
#11243	[BUG] ArrayIndexOutOfBoundsException on a left outer join
#11030	Fix tests failures in string_test.py
#11245	[BUG] mvn verify for the source-javadoc fails and no pre-merge check catches it
#11223	[BUG] Remove unreferenced `CUDF_VER=xxx` in the CI script
#11114	[BUG] Update nightly tests for Scala 2.13 to use JDK 17 only
#11229	[BUG] test_delta_name_column_mapping_no_field_ids fails on Spark
#11031	Fix tests failures in multiple files
#10948	Figure out why `MapFromArrays` appears in the tests for hive parquet write
#11018	Fix tests failures in hash_aggregate_test.py
#11173	[BUG] The `rs. serialization time` metric is misleading
#11017	Fix tests failures in url_test.py
#11201	[BUG] Delta Lake tables with name mapping can throw exceptions on read
#11175	[BUG] Clean up unused and duplicated 'org/roaringbitmap' folder in the spark3xx shims
#11196	[BUG] pipeline failed due to class not found exception: NoClassDefFoundError: com/nvidia/spark/rapids/GpuScalar
#11189	[BUG] regression in NDS after PR #11170
#11167	[BUG] UnsupportedOperationException during delta write with `optimize()`
#11172	[BUG] `get_json_object` returns wrong output with wildcard path
#11148	[BUG] Integration test `test_write_hive_bucketed_table` fails
#11155	[BUG] ArrayIndexOutOfBoundsException in BatchWithPartitionData.splitColumnarBatch
#11152	[BUG] LORE dumping consumes too much memory.
#11029	Fix tests failures in subquery_test.py
#11150	[BUG] hive_parquet_write_test.py::test_insert_hive_bucketed_table failure
#11070	[BUG] numpy2 fail fastparquet cases: numpy.dtype size changed
#11136	UnaryPositive expression doesn't extend UnaryExpression
#11122	[BUG] UT MetricRange failed 651070526 was not less than 1.5E8 in spark313
#11119	[BUG] window_function_test.py::test_window_group_limits_fallback_for_row_number fails in a distributed environment
#11023	Fix tests failures in dpp_test.py
#11026	Fix tests failures in map_test.py
#11020	Fix tests failures in grouping_sets_test.py
#11113	[BUG] Update premerge tests for Scala 2.13 to use JDK 17 only
#11027	Fix tests failures in sort_test.py
#10775	[BUG] Issues found by Spark UT Framework on RapidsStringExpressionsSuite
#11033	[BUG] CICD failed a case: cmp_test.py::test_empty_filter[>]
#11103	[BUG] UCX Shuffle With scala.MatchError
#11007	Fix tests failures in array_test.py
#10801	[BUG] JDK17 nightly build after Spark UT Framework is merged
#11019	Fix tests failures in window_function_test.py
#11063	[BUG] op time for GpuCoalesceBatches is more than actual
#11006	Fix test failures in arithmetic_ops_test.py
#10995	Fallback TimeZoneAwareExpression that only support UTC with zoneId instead of timeZone config
#8652	[BUG] array_item test failures on Spark 3.3.x
#11053	[BUG] Build on Databricks 330 fails
#10925	Concat cannot accept no parameter
#10975	[BUG] regex `^.*literal` cannot be rewritten as `contains(literal)` for multiline strings
#10956	[BUG] hive_parquet_write_test.py: test_write_compressed_parquet_into_hive_table integration test failures
#10772	[BUG] Issues found by Spark UT Framework on RapidsDataFrameAggregateSuite
#10986	[BUG]Cast from string to float using hand-picked values failed in CastOpSuite
#10972	Spark 4.0 compile errors
#10794	[BUG] Incorrect cast of string columns containing various infinity notations with trailing spaces
#10964	[BUG] Improve stability of pre-merge jenkinsfile
#10714	Signature changed for `PythonUDFRunner.writeUDFs`
#10712	[AUDIT] BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys
#10673	[AUDIT] Rename plan nodes for PythonMapInArrowExec
#10710	[AUDIT] `uncacheTableOrView` changed in CommandUtils
#10711	[AUDIT] Match DataSourceV2ScanExecBase changes to groupPartitions method
#10669	Supporting broadcast of multiple filtering keys in DynamicPruning

PRs


#11400	[DOC] update notes in download page for the decompressing gzip issue [skip ci]
#11355	Update changelog for the v24.08 release [skip ci]
#11353	Update download doc for v24.08.1 [skip ci]
#11352	Update version to 24.08.1-SNAPSHOT [skip ci]
#11337	Update changelog for the v24.08 release [skip ci]
#11335	Fix Delta Lake truncation of min/max string values
#11304	Update changelog for v24.08.0 release [skip ci]
#11303	Update rapids JNI and private dependency to 24.08.0
#11296	[DOC] update doc for 2408 release [skip CI]
#11309	[Doc ]Update lore doc about the range [skip ci]
#11292	Add work around for string split with empty input.
#11278	Fix formatting of advanced configs doc
#10917	Adopt changes from JNI for casting from float to decimal
#11269	Revert "upgrade ucx to 1.17.0"
#11260	Mitigate intermittent test_buckets and shuffle_smoke_test OOM issue
#11268	Fix degenerate conditional nested loop join detection
#11244	Fix ArrayIndexOutOfBoundsException on join counts with constant join keys
#11259	CI Docker to support integration tests with Rocky OS + jdk17 [skip ci]
#11247	Fix `string_test.py` errors on Spark 4.0
#11246	Rework Maven Source Plugin Skip
#11149	Rework on substring index
#11236	Remove the unused vars from the version-def CI script
#11237	Fork jvm for maven-source-plugin
#11200	Multi-get_json_object
#11230	Skip test where Delta Lake may not be fully compatible with Spark
#11220	Avoid failing spark bug SPARK-44242 while generate run_dir
#11226	Fix auto merge conflict 11212
#11129	Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited.
#11163	Support `MapFromArrays` on GPU
#11219	Fix hash_aggregate_test.py to run with ANSI enabled
#11186	from_json Json to Struct Exception Logging
#11180	More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase
#11194	Fix ANSI mode test failures in url_test.py
#11202	Fix read from Delta Lake table with name column mapping and missing Parquet IDs
#11185	Fix multi-release jar problem
#11144	Build the Scala2.13 dist jar with JDK17
#11197	Fix class not found error: com/nvidia/spark/rapids/GpuScalar
#11191	Fix dynamic pruning regression in GpuFileSourceScanExec
#10994	Add Spark 4.0.0 Build Profile and Other Supporting Changes
#11192	Append new authorized user to blossom-ci whitelist [skip ci]
#11179	Allow more expressions to be tiered
#11141	Enable some Rapids config in RapidsSQLTestsBaseTrait for Spark UT
#11170	Avoid listFiles or inputFiles on relations with static partitioning
#11159	Drop spark31x shims
#10951	Case when performance improvement: reduce the `copy_if_else`
#11165	Fix some GpuBroadcastToRowExec by not dropping columns
#11126	Coalesce batches after a logical coalesce operation
#11164	fix the bucketed write error for non-utc cases
#11132	Add deletion vector metrics for low shuffle merge.
#11156	Fix batch splitting for partition column size on row-count-only batches
#11153	Fix LORE dump oom.
#11102	Fix ANSI mode failures in subquery_test.py
#11151	Fix the test error of the bucketed write for the non-utc case
#11147	upgrade ucx to 1.17.0
#11138	Update fastparquet to 2024.5.0 for numpy2 compatibility
#11137	Handle the change for UnaryPositive now extending RuntimeReplaceable
#11094	Add `HiveHash` support on GPU
#11139	Improve MetricsSuite to allow more gc jitter
#11133	Fix `test_window_group_limits_fallback`
#11097	Fix miscellaneous integ tests for Spark 4
#11118	Fix issue with DPP and AQE on reused broadcast exchanges
#11043	Dataproc serverless test fixes
#10965	Profiler: Disable collecting async allocation events by default
#11117	Update Scala2.13 premerge CI against JDK17
#11084	Introduce LORE framework.
#11099	Spark 4: Handle ANSI mode in sort_test.py
#11115	Fix match error in RapidsShuffleIterator.scala [scala2.13]
#11088	Support regex patterns with brackets when rewriting to PrefixRange pattern in rlike.
#10950	Add a heuristic to skip second or third agg pass
#11048	Fixed array_tests for Spark 4.0.0
#11049	Fix some cast_tests for Spark 4.0.0
#11066	Replaced spark3xx-common references to spark-shared
#11083	Exclude a case based on JDK version in Spark UT
#10997	Fix some test issues in Spark UT and keep RapidsTestSettings update-to-date
#11073	Disable ANSI mode for window function tests
#11076	Improve the diagnostics for 'conv' fallback explain
#11092	Add GpuBucketingUtils shim to Spark 4.0.0
#11062	fix duplicate counted metrics like op time for GpuCoalesceBatches
#11044	Fixed Failing tests in arithmetic_ops_tests for Spark 4.0.0
#11086	upgrade blossom-ci actions version [skip ci]
#10957	Support bucketing write for GPU
#10979	[FEA] Introduce low shuffle merge.
#10996	Fallback non-UTC TimeZoneAwareExpression with zoneId
#11072	Workaround numpy2 failed fastparquet compatibility tests
#11046	Calculate parallelism to speed up pre-merge CI
#11054	fix flaky array_item test failures
#11051	[FEA] Increase parallelism of deltalake test on databricks
#10993	`binary-dedupe` changes for Spark 4.0.0
#11060	Add in the ability to fingerprint JSON columns
#11059	Revert "Add in the ability to fingerprint JSON columns (#11002)" [skip ci]
#11039	Concat() Exception bug fix
#11002	Add in the ability to fingerprint JSON columns
#10977	Rewrite multiple literal choice regex to multiple contains in rlike
#11035	Fix auto merge conflict 11034 [skip ci]
#11040	Append new authorized user to blossom-ci whitelist [skip ci]
#11036	Update blossom-ci ACL to secure format [skip ci]
#11032	Fix a hive write test failure for Spark 350
#10998	Improve log to print more lines in build [skip ci]
#10992	Addressing the Named Parameter change in Spark 4.0.0
#10943	Fix Spark UT issues in RapidsDataFrameAggregateSuite
#10963	Add rapids configs to enable GPU running in Spark UT
#10978	More compilation fixes for Spark 4.0.0
#10953	Speed up the integration tests by running them in parallel on the Databricks cluster
#10958	Fix a hive write test failure
#10970	Move Support for `RaiseError` to a Shim Excluding Spark 4.0.0
#10966	Add default value for REF of premerge jenkinsfile to avoid bad overwritten [skip ci]
#10959	Add new ID to blossom-ci allow list [skip ci]
#10952	Add shims to take care of the signature change for writeUDFs in PythonUDFRunner
#10931	Add Support for Renaming of PythonMapInArrow
#10949	Change dependency version to 24.08.0-SNAPSHOT
#10857	[Spark 4.0] Account for `PartitionedFileUtil.splitFiles` signature change.
#10912	GpuInsertIntoHiveTable supports parquet format
#10863	[Spark 4.0] Account for `CommandUtils.uncacheTableOrView` signature change.
#10944	Added Shim for BatchScanExec to Support Spark 4.0
#10946	Unarchive Spark test jar for spark.read(ability)
#10945	Add Support for Multiple Filtering Keys for Subquery Broadcast
#10871	Add classloader diagnostics to initShuffleManager error message
#10933	Fixed Databricks build
#10929	Append new authorized user to blossom-ci whitelist [skip ci]

Older Releases

Changelog of older releases can be found at docs/archives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Change log

Release 24.10

Features

Performance

Bugs Fixed

PRs

Release 24.08

Features

Performance

Bugs Fixed

PRs

Older Releases

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change log

Release 24.10

Features

Performance

Bugs Fixed

PRs

Release 24.08

Features

Performance

Bugs Fixed

PRs

Older Releases