Releases: openucx/ucx
Releases · openucx/ucx
v1.16.0 RC4
1.16.0 RC4 (March 12, 2024)
Bugfixes:
UCP
- Disabled rendezvous pipeline protocol selection when using non-contiguous buffer
RDMA CORE (IB, ROCE, etc.)
- Fixed mlx5 WQE posting error due to compiler memory copy optimizations
GPU (CUDA, ROCM)
- Fixed cuda_ipc transport being disabled if a CUDA device is not set during initialization
UCM
- Fixed compilation error when building on PPC64
Packaging
- Fixed already existing target error when using cmake find_package(ucx) twice
v1.16.0 RC3
1.16.0 RC3 (February 20, 2024)
Bugfixes:
UCP
- Fixed crash in rendezvous protocol rkey pack after failed memory registration
v1.16.0 RC2
1.16.0 RC2 (January 21, 2024)
Features:
UCP
- Added tag offload rendezvous protocol in new infrastructure
- Added rcache to old protocols infrastructure
- Added multi-fragment protocols for stream API in new infrastructure
- Enabled new protocols infrastructure by default
- Removed context param from ucp_memh_put
- Added assertion if trying to register unsupported memory type
- Adjusted rendezvous latency to improve scalability
- Improved endpoint configuration logging information
- Added check for max length of user defined Active Message header
- Added rcache support for mem type memory registration
- Enabled error handling for rndv/put_zcopy protocol
- Enabled v2 as default client/server connection establishment packet version
- Enabled rendezvous protocol selection for reachable MDs only
- Added ucp_rkey_compare API to enable rkey comparison
- Added release version to worker address to enable wire compatability
- Added support for memory invalidation for rendezvous through DC transport
- Enabled the use of strong fence with new protocols infrastructure
UCT
- Added UCS_MEMORY_TYPE_RDMA memory type for better latency on supported devices
- Implemented is_reachable_v2 API for IB transport
- Added ep_is_conntected API
RDMA CORE (IB, ROCE, etc.)
- Added Floating LID(FLID) based routing support
- Added latency and min_zcopy configuration variables to ROCm-IPC
- Added support for indirect MR for cross-gvmi mkey instead of direct MR with DEVX UMEM
TCP
- Added filter for eliminate bridge devices from lane selection
GPU (CUDA, ROCM)
- Added support for handling memh with multiple registrations
- Added performance estimation BW based on GPU type
- Adjusted rocm/ipc latency and zcopy threshold parameters
- Improved error message when libnvidia-ml not installed
- Added profiling to Cuda runtime API calls
- Adjusted gdr_copy estimated BW to improve protocol selection
Shared Memory
- Adjusted FIFO_SIZE to improve scalability
- Removed redundent rcahce implementation in knem transport
- Added support for symmetric rkey to improve memory usage
UCS
- Improved scalability of connection establishment flow
- Improved memtype cache performance by replacing ptrhead_lock to spinlock
- Added support for VLAN over channel bonding interface
- Added LRU cache and Usage Tracker datastructures
- Improved cross-NUMA device detection
Build
- Added LCOV coverage report as a build option
- Added binutils 2.40 library dependencies
- Added development modulefile
Tools
- Added information about sizes of ucp_request_t fields in ucx_info
- Added ucx env to profiling output
- Added MAD RTE in ucx_perftest to support setups without IPoIB
Tests
- Added GTEST_LOG_LEVEL env var to set log level just before test run
- Disabled protov1 and ud_verbs tests for valgrind mode
- Reduced gtest execution time
Documentation
- Added a few details to coding style
Bugfixes:
UCP
- Reverted wireup latency calculation which caused lanes selection issue
- Fixed strong fence to always ensure ordering
- Fixed registration of memh for RNDV protocol
- Fixed rndv_put and rkey_ptr assertion failure
- Fixed performance estimation for multi-fragment protocols
- Fixed memory registration error handling
- Fixed buffer overflow of large log messages
- Fixed progress enabling for selected lanes
- Fixed atomic lanes progress enabling
- Added missing rendezvous schemes to environment variable documentation
- Fixed bcopy BW estimation for AMD
- Fixed lanes information printing for new protocols infrastructure
- Fixed rndv_am protocol thresholds
- Fixed fp8 packing issue
- Fixed Intel OneAPI compilation error
- Fixed CM address packing on server side
- Fixed endpoint reconfiguration issue due to asymmetrical selection
- Fixed asymmetrical selection due to wire compatability issue
- Fixed potential deadlock with cuda_copy and RTR protocol
- Fixed tag_recv return value on immediate completion
- Fixed memory corruption by proper memh handling in tag offload rendezvous
- Changed default allocator to not use reserved huge pages
- Fixed rndv put protocol to avoid early completion
RDMA CORE (IB, ROCE, etc.)
- Fixed compilation failure when DevX is explicitly disabled
- Fixed crash when using PCIe relaxed ordering
- Fixed remote access error with rc_verbs transport
- Fixed endpoint address management in unified mode
- Fixed assertion failure when configured with UCX_IB_ADDR_TYPE=ib_global
- Fixed overwritten MD attribute capabilities when querying a device
- Fixed ibv_reg_mr error by registering memory in rcache callback
TCP
- Fixed assymetric lanes selection issue due to inconsistent device listing
GPU (CUDA, ROCM)
- Fixed compilation flags to support ROCm 6.0
- Fixed values of D2H_THRESH and latencey params
- Fixed Cuda memory support for iov datatype
- Increased max number of agents in ROCm
Shared Memoey
- Fixed posix and cma transport selection by enhancing reachability checks
- Fixed UGNI build failure
- Fixed latency overhead for knem and cma transports
- Fixed possible out-of-order issue in mm_iface
UCS
- Fixed a deadlock when forked debugger is attached during an error in rcache operation
- Fixed crash due to passing null pointer to log function
- Fixed crash due to incorrect hashing method
- Fixed crash in configuration parser cleanup by moving it after profiler cleanup
- Fixed floating point division by zero during protocols initialization
UCM
- Fixed occasional crash in bisto hooks by adding a lock before hooking
Java
- Fixed go tests by setting CUDA device before allocating CUDA memory
- Fixed perftest error detection and hanging issue
Tools
- Fixed cpu model type for AMD Genoa in ucx_info
- Enhanced multi-thread test output
Build
- Fixed JUCX package publishing, so it will include support for ARM
- Fixed ROCm building and testing
- Removed libnvidia-compute version dependency
- Removed libibmad/libumad from default build configuration to avoid runtime dependency
v1.16.0-rc1
Merge pull request #9557 from yosefe/topic/uct-ib-add-flid-based-rout…
v1.15.0
1.15.0 (September 28, 2023)
Features:
UCP
- Added 2-stage pipeline protocol in the new protocol infrastructure
- Added reset and abort functionality of rendezvous protocols in the new infrastructure
- Added zero-copy rendezvous data send protocol in the new infrastructure
- Added support for user memory handle in the new protocol infrastructure
- Added option to force ODP registration for certain memory types
- Enabled lock free memory region deregistration
- Updated allow/deny transport list feature to control auxiliary transport selection
- Multiple performance improvements of the new protocol infrastructure
- Multiple improvements in error and debug messages
UCT
- Split UCT_MD_MKEY_PACK_FLAG_INVALIDATE into two flags for RMA and AMO
- Added put_zcopy and get_zcopy scheme support for self transport
- Added base implementation of is_reachable_v2 API using intra/inter flag
- Introduced MD capability for non-blocking registration memory types
RDMA CORE (IB, ROCE, etc.)
- Added implementation of is_reachable_v2 routine to IB interface
- Added option to control CQE zipping per CQ RX/TX direction
- Added option to specify how DCI selects port under RoCE LAG
- Added hw_dcs to the list of policies to select DCI by an endpoint
- Removed implicit on-demand paging
- Added option to set RoCE lag dct port for response under queue affinity mode
- Improved IB memlock limit logging
UCS
- Added ucs_string_buffer_rbrk() to split token
GPU (CUDA, ROCM)
- Added support for atomic reply_buffer on GPU memory
- Added system device information for AMD GPUs
- Improved performance estimation of gdr_copy transport
- Added a simplistic implementation of performance estimation of cuda_ipc transport
- Improved performance estimation of cuda_ipc on Hopper architecture
- Added rcache parameters for rocm transports
- Introduced dmabuf support for rocm transports
- Implemented asynchronous progress for the zcopy operations in the rocm_copy transport
- Added option to enable using cross-device dmabuf file descriptor for rocm
Java
- Added Java bindings for exported memh feature
Tests
- Added a rocm docker container for testing
- Added option to send client_id in iodemo test
- Added support for multiple connections to the same server in iodemo test
- Added synchronization before exit to hello world examples
Tools
- Added user-side memcpy option for AM benchmarks in ucx_perftest
- Added wireshark LUA dissectors for some UCX protocols
Build
- Added support for binutils 2.40
- Added versioned dependency to switch between packages with the same names
- Added a separate xpmem deb subpackage
- Added aarch64 support to the binary distribution pipeline
- Removed dependency on libnuma
Bugfixes:
UCP
- Fixed assertion when sending from non-contiguous GPU buffer to managed buffer
- Fixed the race condition on endpoint configurations
- Fixed endpoint reconfiguration issues due to asymmetrical selection
- Fixed endpoint reconfiguration error due to wrong locality detection
- Fixed crash during connection manager cleanup
- Fixed rkey index calculation for rendezvous protocol
- Fixed rcache dump function
- Removed logging from rkey unpack in release mode
- Fixed dobule free of rkey in rendezvous protocol
- Fixed rendezvous pipeline protocol error flow
- Fixed error handling in rendezvous get zcopy protocol
- Replay pending requests of wireup EP CM during connection establishment to prevent potential ordering issues and wrong configuration
- Pass user-provided memory type to the function that checks whether the buffer can be sent inline or not
- Avoid memory registration during UCP context initialization
- Fixed CPU/device atomics selection in the new protocol infrastructure
- Multiple fixes in the new protocol infrastructure information output
UCT
- Added check for dmabuf kernel support in ROCm memory domain
- Fixed exported memh packing
- Fixed an error in checking return status of multi-threaded memory registration function
RDMA CORE (IB, ROCE, etc.)
- Fixed dma-buf based memory region registration
- Fixed memory handle data corruption when PCIe relaxed ordering is enabled
- Fixed performance degradation when indirect atomic key is not supported by the hardware
- Fixed remote access error to strict-order keys because of wrong offset
- Added check for UAR support to memory domain opening
- Fixed updating port counters for devx qp
- Fixed ibv_create_cq error message on node without Infiniband
- Fixed performance degradation due to using 2 paths on NDR400 by default
- Removed unnecessary async lock which otherwise would block UD progress
GPU (CUDA, ROCM)
- Fixed CUDA IPC performance degradation due to libnuma removal
UCS
- Fixed lane selection and added bandwidth estimation for Sapphire Rapids family
- Fixed displaying wrong environment variable suggestions
- Fixed VFS warning output
- Fixed SEGV in ucs_debug_backtrace_next(), upon previous SEGV handling, due to ENOMEM situation
- Fixed memory corruption when using UCX_MPOOL_FIFO=y
UCM
- Fixed conditional jump patching
- Fixed mremap() override
GPU (CUDA, ROCM)
- Fixed usage of dmabuf when the buffer is not page-aligned
- Removed async_cb from cuda_copy to avoid the issue with UCP worker async lock
Java
- Fixed leakage of jucx_request global references
Documentation
- Updated ucp_worker_release_address description
Tests
- Fixed wrong usage of ep_close in examples
Tools
- Fixed memory access flags in perftest
- Removed support for librte from perf
- Fixed worker flush deadlock when using multiple workers in ucx_perftest
Build
- Changed 'unsupported option' ICC command line warning to error
- Removed never used fault-injection configuration option
- Fixed obsolete macro warnings in new autoconf/libtool
- Fixed building UCX with GCC 13
- Fixed UCX RPM build on machines that have libxpmem-devel rpm from MLNX_OFED installation
- Fixed ucx-rdmacm package requirements
- Fixed compilation errors with armcc-22.1
- Fixed passing port number to goperftest
v1.15.0 RC6
1.15.0 RC6 (September 20, 2023)
Bugfixes:
UCP
- Fixed assertion when sending from noncontig GPU buffer to managed buffer.
v1.15.0 RC5
1.15.0 RC5 (September 12, 2023)
Bugfixes:
UCP
- Fixed the data race on endpoint configurations.
v1.15.0 RC4
1.15.0 RC4 (August 30, 2023)
Bugfixes:
RDMA CORE (IB, ROCE, etc.)
- Fixed dma-buf based memory region registration
- Fixed memory handle data corruption when PCIe relaxed ordering is enabled
UCS
- Fixed lane selection, adding bandwidth estimation for Sapphire Rapids family
v1.15.0 RC3
1.15.0 RC3 (August 8, 2023)
Bugfixes:
UCP
- Fixed endpoint reconfiguration issues because of asymmetrical selection
UCT
- Check dmabuf kernel support in ROCm memory domain
UCM
- Fixed conditional jump patching
Tools
- Fixed memory access flags in perftest
v1.15.0 RC2
1.15.0 RC2 (July 27, 2023)
Features:
RDMA CORE (IB, ROCE, etc.)
- Implemented is_reachable_v2 for IB interfaces
Build
- Enabled build with binutils 2.40
- Added versioned dependency to switch between packages with the same names
Bugfixes:
UCP
- Fixed endpoint reconfiguration error due to wrong locality detection
RDMA CORE (IB, ROCE, etc.)
- Fixed performance degradation when indirect atomic key is not supported by the hardware
- Fixed remote access error to strict-order key because of wrong offset
GPU (CUDA, ROCM)
- Fixed CUDA IPC performance degradation after libnuma removal