Skip to content

Commit

Permalink
Merge pull request #10313 from tvegas1/news_1.18.0-rc1
Browse files Browse the repository at this point in the history
RELEASE: Updated NEWS for v1.18.0-rc1
  • Loading branch information
tvegas1 authored Dec 23, 2024
2 parents ed3cebd + b436dac commit 9ce35d0
Showing 1 changed file with 150 additions and 3 deletions.
153 changes: 150 additions & 3 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,156 @@
### Features:
### Bugfixes:

## 1.18.0-rc2 (December 10, 2024)
### Features: TBD
### Bugfixes: TBD
## 1.18.0-rc3 (December 18, 2024)
### Features:
#### UCP
* Enabled using CUDA staging buffers for pipeline protocols by default
* Added endpoint reconfiguration support for non-reused p2p scenarios
* Enabled non-cacheable memory domains, activated for gdr_copy
* Added user_data parameter to ucp_ep_query
* Added support for host memory pipeline through CUDA buffers for rendezvous protocol
* Added global VA infrastructure and memory region in absence of error handling
* Made protocol performance node names more informative
* Enforced always running on the same thread in single thread mode
* Multiple improvements in protocols selection infrastructure
* Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
* Allowed up-to 64 endpoint lanes for systems with many transports or devices
* Added usage tracker to worker
* Improved various logging messages
#### RDMA CORE (IB, ROCE, etc.)
* Added environment variable to manage DC initiator capacity
* Added DC dcs_hybrid policy
* Reduced MLX5/DV stack size consumption
* Added ODP support for verbs and mlx5dv
* Added support of CUDA managed memory on IB when ODP is available
* Added support of Adaptive Routing on RoCE
* Enabled use of implicit ODP with relaxed ordering
* Improved GPU-Direct detection in IB transport
* Increased DC initiator default count to 32 for performance optimization
* Added ConnectX-8 device support with DDP
* Added support for subnet filter list for RoCE interfaces
* Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
* Added IB MLX5 as a separate UCX module with separate RPM sub-package
* Added initial support for GGA transport, for fast DPU memory access
* Set IB DevX atomic mode based on device capabilities
* Removed DC keepalive mechanism, since the keepalive is done on UCP layer
* Optimized cross-gVMI memory registration using indirect memory keys cache
* Improved various logging messages
#### CUDA
* Added multi-node NVlink support
* Added CUDA Fabric memory support with detection and allocation
* Improved gdr_copy latency estimations on AMD Milan systems
* Added check for gdr_copy runtime/build version mismatch
* Added handling missing IPC capability when unpacking keys
* Added caching for CUDA IPC memory pool import operation
* Added gdr_copy variables to optimize performance on Grace Hopper systems
* Improved CUDA IPC concurrency for a larger count of reachable peers
#### UCS
* Added support for wildcards in configuration parameter names
* Added ASAN protection to several internal data structures
* Reduced stack usage in topology detection code
* Improved bitmaps configuration parsing with wider bitfield
* Added options to set topology distance between devices
* Optimized VFS unix socket watch by using user private folder
* Added general IP subnet matching infrastructure
* Extend array data structure to support user-provided array copy routine
* Improved time units description
#### UCM
* Extend CUDA memory hooks to include memory mapping APIs
#### Tools
* Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
* Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
* Improved ucx_perftest uni-directional test with added fence
* Detailed ucx_perftest batch section of command-line documentation
#### Documentation
* Added a section regarding adaptive routing on RoCE
#### Architecture
* Added CPU Model for MI300A
* Added Fujitsu ARM specific values to ucx.conf
* Added AMD Turin support
* Added an optimized non-temporal memory copy implementation for AMD CPU
#### Build
* Improved compiler error reporting with added flag
* Improved coverity script to allow faster turnaround time
* Improved Intel Compiler detection and support
#### GO
* Added multi-send flag and user memh support in request params
#### Packaging
* Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments
### Bugfixes:
#### UCP
* Fixed stack overflow in exported rkey unpack
* Removed extra remote-cpu overhead from protocol estimation for zcopy
* Fixed performance estimation for rndv pipeline protocols
* Fixed ATP sending by picking the correct lane
* Fixed missing reg_id on memh creation
* Fixed repeated invalidations by retaining existing access flags
* Fixed abort reason propagation for rendezvous RTR mtype
* Do not check transport availability if it is disabled by UCX_TLS environemnt variable
* Fixed wrong flag being used for checking BCOPY capability
* Fixed sending too many ATPs for small messages
* Enforced 16 bits size for Active Messages identifiers
* Fixed unnecessary status check for emulated AMO
* Fixed more than one fragment sending in rendezvous pipeline
* Fixed crash by using biggest max frag across all lanes
* Fixed missing memory handle flags by copying from parent to child
* Fixed worker interface activate count
* Fixed flush requests by replacing ATP/flush lane map with lane indexes
* Fixed lost uct_flags when merging memory regions
#### UCT
* Fixed memory domain UCT flags description
#### RDMA CORE (IB, ROCE, etc.)
* Fixed FETCH_ADD remote access error for ODP/KSM case
* Fixed missing conditional compilation checks for DM
* Fixed IB MD allocation naming typo
* Fixed invalid GIDs filter in IB
* Fixed flags usage in MLX5 zcopy_post
* Do not limit ODP registration retries
* Fixed JUCX failures by considering the number of supported completion vectors
#### CUDA
* Fixed async memory handling using CUDA memory type on Grace
* Added rcache overhead in performance estimation
* Fixed gdr_copy performance regression by providing maximum estimation between get and put
* Fixed CUDA IPC reachability check
* Fixed crash in MPI_Finalize when CUDA context is destroyed
* Always require rcache by default for gdr_copy
* Fixed crash in gdr_copy cleanup when registration cache is disabled
* Fixed CUDA copy memory domain allocations
* Fixed multiple tests for gdr_copy transport
* Fixed race condition in CUDA IPC peer accessible cache
#### UCS
* Fixed a crash by using heap allocation to process expired timers in batch
* Fixed allocation issue on memtrack dump
* Fixed deletion of the monitored folder in VFS
* Fixed unsafe resize for DC initiator array
* Fixed function macro invocation to match C standard
* Fixed calling async handler on already released resource
* Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
* Fixed undeclared value error in timer conversion routine
* Fixed uninitialized value access in registration cache
#### UCM
* Fixed race condition in parsing proc maps
* Fixed mremap failure while parsing /proc/self/maps
#### ROCM
* Fixed ROCM interface reachability test
* Fixed memory domain fork test
#### TCP
* Always bind endpoint to interface
#### Tools
* Fixed buffer size potential overflow in ucx_perftest
* Fixed missing address when packing memory keys on ucx_perftest
* Fixed memory leak for endpoint report in ucx_info
* Fixed build without openmp in ucx_perftest
* Fixed UCT device override on server side on ucx_perftest
#### Build
* Fixed using correct ASAN version for running tests
#### Configuration
* Used POSIX bourne syntax to check equality
* Fixed build failure by using proper flags in compiler.m4
* Fixed perftest MAD support default guessing
#### GO
* Added serialized thread mode to avoid subtle races between threads
* Fixed make distcheck

## 1.17.0 (June 13, 2024)
### Features:
Expand Down

0 comments on commit 9ce35d0

Please sign in to comment.