Merge pull request #10313 from tvegas1/news_1.18.0-rc1

RELEASE: Updated NEWS for v1.18.0-rc1
openucx · Dec 23, 2024 · 9ce35d0 · 9ce35d0
2 parents ed3cebd + b436dac
commit 9ce35d0
Showing 1 changed file with 150 additions and 3 deletions.
diff --git a/NEWS b/NEWS
@@ -11,9 +11,156 @@
 ### Features:
 ### Bugfixes:
 
-## 1.18.0-rc2 (December 10, 2024)
-### Features: TBD
-### Bugfixes: TBD
+## 1.18.0-rc3 (December 18, 2024)
+### Features:
+#### UCP
+ * Enabled using CUDA staging buffers for pipeline protocols by default
+ * Added endpoint reconfiguration support for non-reused p2p scenarios
+ * Enabled non-cacheable memory domains, activated for gdr_copy
+ * Added user_data parameter to ucp_ep_query
+ * Added support for host memory pipeline through CUDA buffers for rendezvous protocol
+ * Added global VA infrastructure and memory region in absence of error handling
+ * Made protocol performance node names more informative
+ * Enforced always running on the same thread in single thread mode
+ * Multiple improvements in protocols selection infrastructure
+ * Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
+ * Allowed up-to 64 endpoint lanes for systems with many transports or devices
+ * Added usage tracker to worker
+ * Improved various logging messages
+#### RDMA CORE (IB, ROCE, etc.)
+ * Added environment variable to manage DC initiator capacity
+ * Added DC dcs_hybrid policy
+ * Reduced MLX5/DV stack size consumption
+ * Added ODP support for verbs and mlx5dv
+ * Added support of CUDA managed memory on IB when ODP is available
+ * Added support of Adaptive Routing on RoCE
+ * Enabled use of implicit ODP with relaxed ordering
+ * Improved GPU-Direct detection in IB transport
+ * Increased DC initiator default count to 32 for performance optimization
+ * Added ConnectX-8 device support with DDP
+ * Added support for subnet filter list for RoCE interfaces
+ * Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
+ * Added IB MLX5 as a separate UCX module with separate RPM sub-package
+ * Added initial support for GGA transport, for fast DPU memory access
+ * Set IB DevX atomic mode based on device capabilities
+ * Removed DC keepalive mechanism, since the keepalive is done on UCP layer
+ * Optimized cross-gVMI memory registration using indirect memory keys cache
+ * Improved various logging messages
+#### CUDA
+ * Added multi-node NVlink support
+ * Added CUDA Fabric memory support with detection and allocation
+ * Improved gdr_copy latency estimations on AMD Milan systems
+ * Added check for gdr_copy runtime/build version mismatch
+ * Added handling missing IPC capability when unpacking keys
+ * Added caching for CUDA IPC memory pool import operation
+ * Added gdr_copy variables to optimize performance on Grace Hopper systems
+ * Improved CUDA IPC concurrency for a larger count of reachable peers
+#### UCS
+ * Added support for wildcards in configuration parameter names
+ * Added ASAN protection to several internal data structures
+ * Reduced stack usage in topology detection code
+ * Improved bitmaps configuration parsing with wider bitfield
+ * Added options to set topology distance between devices
+ * Optimized VFS unix socket watch by using user private folder
+ * Added general IP subnet matching infrastructure
+ * Extend array data structure to support user-provided array copy routine
+ * Improved time units description
+#### UCM
+ * Extend CUDA memory hooks to include memory mapping APIs
+#### Tools
+ * Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
+ * Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
+ * Improved ucx_perftest uni-directional test with added fence
+ * Detailed ucx_perftest batch section of command-line documentation
+#### Documentation
+ * Added a section regarding adaptive routing on RoCE
+#### Architecture
+ * Added CPU Model for MI300A
+ * Added Fujitsu ARM specific values to ucx.conf
+ * Added AMD Turin support
+ * Added an optimized non-temporal memory copy implementation for AMD CPU
+#### Build
+ * Improved compiler error reporting with added flag
+ * Improved coverity script to allow faster turnaround time
+ * Improved Intel Compiler detection and support
+#### GO
+ * Added multi-send flag and user memh support in request params
+#### Packaging
+ * Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments
+### Bugfixes:
+#### UCP
+ * Fixed stack overflow in exported rkey unpack
+ * Removed extra remote-cpu overhead from protocol estimation for zcopy
+ * Fixed performance estimation for rndv pipeline protocols
+ * Fixed ATP sending by picking the correct lane
+ * Fixed missing reg_id on memh creation
+ * Fixed repeated invalidations by retaining existing access flags
+ * Fixed abort reason propagation for rendezvous RTR mtype
+ * Do not check transport availability if it is disabled by UCX_TLS environemnt variable
+ * Fixed wrong flag being used for checking BCOPY capability
+ * Fixed sending too many ATPs for small messages
+ * Enforced 16 bits size for Active Messages identifiers
+ * Fixed unnecessary status check for emulated AMO
+ * Fixed more than one fragment sending in rendezvous pipeline
+ * Fixed crash by using biggest max frag across all lanes
+ * Fixed missing memory handle flags by copying from parent to child
+ * Fixed worker interface activate count
+ * Fixed flush requests by replacing ATP/flush lane map with lane indexes
+ * Fixed lost uct_flags when merging memory regions
+#### UCT
+ * Fixed memory domain UCT flags description
+#### RDMA CORE (IB, ROCE, etc.)
+ * Fixed FETCH_ADD remote access error for ODP/KSM case
+ * Fixed missing conditional compilation checks for DM
+ * Fixed IB MD allocation naming typo
+ * Fixed invalid GIDs filter in IB
+ * Fixed flags usage in MLX5 zcopy_post
+ * Do not limit ODP registration retries
+ * Fixed JUCX failures by considering the number of supported completion vectors
+#### CUDA
+ * Fixed async memory handling using CUDA memory type on Grace
+ * Added rcache overhead in performance estimation
+ * Fixed gdr_copy performance regression by providing maximum estimation between get and put
+ * Fixed CUDA IPC reachability check
+ * Fixed crash in MPI_Finalize when CUDA context is destroyed
+ * Always require rcache by default for gdr_copy
+ * Fixed crash in gdr_copy cleanup when registration cache is disabled
+ * Fixed CUDA copy memory domain allocations
+ * Fixed multiple tests for gdr_copy transport
+ * Fixed race condition in CUDA IPC peer accessible cache
+#### UCS
+ * Fixed a crash by using heap allocation to process expired timers in batch
+ * Fixed allocation issue on memtrack dump
+ * Fixed deletion of the monitored folder in VFS
+ * Fixed unsafe resize for DC initiator array
+ * Fixed function macro invocation to match C standard
+ * Fixed calling async handler on already released resource
+ * Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
+ * Fixed undeclared value error in timer conversion routine
+ * Fixed uninitialized value access in registration cache
+#### UCM
+ * Fixed race condition in parsing proc maps
+ * Fixed mremap failure while parsing /proc/self/maps
+#### ROCM
+ * Fixed ROCM interface reachability test
+ * Fixed memory domain fork test
+#### TCP
+ * Always bind endpoint to interface
+#### Tools
+ * Fixed buffer size potential overflow in ucx_perftest
+ * Fixed missing address when packing memory keys on ucx_perftest
+ * Fixed memory leak for endpoint report in ucx_info
+ * Fixed build without openmp in ucx_perftest
+ * Fixed UCT device override on server side on ucx_perftest
+#### Build
+ * Fixed using correct ASAN version for running tests
+#### Configuration
+ * Used POSIX bourne syntax to check equality
+ * Fixed build failure by using proper flags in compiler.m4
+ * Fixed perftest MAD support default guessing
+#### GO
+ * Added serialized thread mode to avoid subtle races between threads
+ * Fixed make distcheck
 
 ## 1.17.0 (June 13, 2024)
 ### Features: