Skip to content

Releases: lablup/backend.ai

21.09

08 Nov 07:23
21.09.0
Compare
Choose a tag to compare

Key Highlights

  • Hardware platforms
    • Support running on ARM64 platforms (Linux / macOS with Apple Silicon)
    • Improve RDMA support with Infiniband networks (backported to 21.03)
    • NetApp storage integration
  • UI/UX
    • Statistics dashboard integration (Enterprise only)
    • Improved the performance of listing many items by server-side filtering and pagination
    • Display the progress of image pulls while creating a new session when agents do not yet have the image
  • Client SDK
    • Revamp the CLI with JSON-formatted outputs for better scriptability and restructured command hierarchy for consistency
  • API
    • Allow manual assignment of agent(s) when creating a session for ease of node diagnosis
    • Global query filter and query ordering expression support in GraphQL paginated list queries (backported to 21.03, 20.09)
  • Scheduler
    • Fix the HoL blocking issue in the FIFO scheduler with priority adjustments (backported to 21.03, 20.09)
    • Fix lots of database stability issues by adopting SQLAlchemy v1.4 with asyncio support (backported to 21.03, 20.09)
  • Stability
    • Explicitly apply TCP keepalive timeouts in every database and RPC connections to avoid implicit and silent connection drops by network middleboxes (backported to 21.03)
    • Adopt aioredis v2 and rewrite the internal event bus with Redis STREAM APIs (backported to 21.03)
    • Adopt aiohttp v3.8 and drop aiojobs

21.03.0

29 Mar 01:55
21.03.0
Compare
Choose a tag to compare

Key Highlights

  • All server-side components now run on top of Python 3.9.
  • (BETA) The native support for Windows 10 (and Server 2019) is coming soon.
  • This release has the identical set of features and fixes in the latest v20.09 series.
    You may treat it as an integrated stability update against the v20.09 series.

Changelogs

20.09.0

26 Dec 19:15
20.09.0
Compare
Choose a tag to compare

Key Highlights

  • Bumped the gateway API to version 6
  • Multi-container sessions
    • Single-node multi-container sessions
    • Multi-node multi-container sessions
      • It requires Docker Swarm configuration on the cluster.
    • All containers run on the same base Docker image.
    • Each container has auto-configured SSH access to other containers in the same session, using a keypair randomly generated for each session.
    • The "main" container controls other "worker" containers that belongs to the same session and the user interacts with the main container.
  • CUDA MIG (Multi-instance GPU) support -- Enterprise Edition only
    • Added UNIQUE resource slot types to the open-source edition.
  • Storage Proxy
    • Offloads the vfolder upload/download traffic from the manager
    • Provides abstraction of storage nodes and volumes
    • Accelerates volume-specific fs operations based on storage backend implementations (e.g., PureStorage FlashBlade's RapidFile Tools integration)
    • Enables new vfolder features such as cloning, live volume statistics, per-folder quota (only for supported filesystems such as XFS)
  • Agent Resource Allocation
    • Evenly-distributed fraction allocator which reduces imbalance of GPU slice sizes for multi-GPU sessions running on a single node
      • Known limitation: currently this works only for plain single-container multi-GPU sessions.
  • UI/UX Improvements
    • A brand new theme for the web console
    • backend.ai ssh and backend.ai scp CLI commands for easy-access to compute sessions

Terms

  • vfolder: virtualized storage folder

Changelogs

20.03.1

22 Sep 06:38
20.03.1
Compare
Choose a tag to compare

20.03.0

22 Sep 06:53
20.03.0
Compare
Choose a tag to compare

Key highlights

  • Bumped the gateway API to version 5, to clear out sessions and kernels for future multi-kernel cluster session supports (to land in v20.09)
  • Session scheduler
    • It is now customizable by extension modules, and now the mananger package embeds FIFO (default), LIFO, and DRF (dominant resource fairness) schedulers.
    • The scheduler favors CPU-only agents for session creation requests without accelerator slots.
  • Resource management
    • Individual agents no longer have to have the same set of resource slot types, easing deployment on heterogenous hardware setups.
    • Only intrinsic slots (cpu and mem) are auto-filled with minimum requirements of the target image to support execution of containers with a subset of accelerators among the accelerators available in the agents.
    • Add shmem (shared memory) option to resource presets.
  • VFolder
    • Adopted tus.io protocol for large-file uploads.
  • Kernel images
    • Introduces the service-definition DSL to define and declare the service ports in Docker images, without manually updating the agent's kernel runner codebase for new services.
  • Plugins
    • Introduces the plugin API v2.0 with lifecycle and context management for explicit groups of plugins sharing the same interfaces.
    • Extended and generalized the hook plugins and added many PRE/POST action hook places to allow finer-grained customization.
  • Internals
    • Add pre-open service ports which assumes that user programs in containers manage their daemon lifecycles by themsevles.
    • Using the new plugin API, the manager ships an embedded intrinsic error monitor plugin that logs exceptions in the manager API handlers and those collected from agents via the event bus into the database.
    • Rewrote manager-agent RPC layer using Callosum.
    • Now the manager and the agent run on Python 3.8 or higher.

There are many more changes and fixes.
Please refer the per-component changelogs in their repisotories:

19.09.0

22 Sep 07:08
19.09.0
Compare
Choose a tag to compare

Key highlights

  • Custom image import API which can automatically convert existing Python-based Docker images into a runnable Backend.AI kernel images.
  • Batch jobs which execute the given startup command immediately after session creation and terminated immediately once done, with an explicit record of success or failure depending on the command's exit code.
  • High availability support by running multiple manager instances.
  • Job queueing which allows submission of session creation requests even when the resources in the cluster are fully utilized, and automatically starts the oldest pending requests whenever the required amount of resources becomes available.
  • Event monitoring API using HTML5 Server-Sent Events protocol to allow clients to get kernel lifecycle notification without excessive polling.
  • 3-level user privileges: super-admin, domain-admin, and users
  • Customizable new user signup process
  • Authentication support for etcd
  • SSH keypairs fixed for user keypairs which are auto-installed into their sessions
  • Support for integration with Harbor docker registries

There are many more changes and fixes.
Please refer the per-component changelogs in their repisotories:

19.03.0

22 Sep 07:15
19.03.0
Compare
Choose a tag to compare

Key highlights

  • This is the first version to support a usable web GUI via the console project.
  • Integration with NGC (NVIDIA GPU Cloud) images
  • Per-keypair resource policies
  • Support for authentication with Redis
  • Resource presets
  • Multiple vfolder hosts to utilize multiple volume mounts
  • Various clean ups related to resource slot definitions and its operation semantics, including renaming of "gpu" slots into "cuda.shares" and "cuda.device"

There are many more changes and fixes.
Please refer the per-component changelogs in their repisotories:

18.12.0

22 Sep 07:11
18.12.0
Compare
Choose a tag to compare

Key highlights

  • Service ports
  • CORS support in the gateway API
  • TPU plugin support

There are many more changes and fixes.
Please refer the per-component changelogs in their repisotories:

v1.4.0

02 Oct 14:17
1.4.0
Compare
Choose a tag to compare

Key highlights: Shared virtual folders and multi-GPU scheduling

Manager

  • Add a new set of virtual folder APIs to invite other user to my own vfolder and list/accept invitations from other users. (lablup/backend.ai-manager#80)
  • Improve existing APIs to stream downloads/uploads of virtual folder files and explicit option to recursively delete a directory (lablup/backend.ai-manager#89, lablup/backend.ai-manager#70)
  • Add a new kernel API to list files in the session container (lablup/backend.ai-manager#63)
  • All API endpoints are now available without version prefixes (e.g., /v2/) and in the future only this will be supported. (lablup/backend.ai-manager#78)
  • The user_id field of the keypairs database table is now string instead of integer. You need to provide a manual user_id_map.txt mapping file to run the database schema upgrade using alembic.
  • Upgrade to aiohttp v3.4 series.

Agent

  • Add support for multi-GPU scheduling, where you can allocate multiples of GPU shares to compute sessions so that they can access multiple GPUs. The agent's decimal-based "share" model supports fractional allocations as well, but currently fractional CUDA GPU sharing is highly experimental and only provided to private testers. (lablup/backend.ai-agent#66)
  • Introduces an initial version of accelerator plugins. Currently there is only one plugin: CUDA accelerator. Now you can easily turn on/off CUDA GPU supports by installing/uninstalling this plugin. (lablup/backend.ai-agent#66)
  • Add support for nvidia-docker v2. (lablup/backend.ai-agent#64)
  • Agent restarts now completely preserves the kernel session states. (lablup/backend.ai-agent#35, lablup/backend.ai-agent#73)
  • You may limit the view of agents against available system resources such as CPU cores and GPU devices using a hexademical mask for benchmarks and multi-GPU debugging. (lablup/backend.ai-agent#65)
  • Stability improvements including that it does no longer retry to kill already terminated kernel containers but report them as "terminated", preventing an infinite loop of kernel creation failures in certain usage scenarios.
  • Improve inner beauty for future support of non-dockerized environments.

Client for Python (v1.4)

  • Add support for new vfolder subcommands to invite and accept invitation of shared virtual folders.
  • Add support for listing and downloading vfolder files.
  • Now client library users should wrap the API function invocation codes with an explicit session like aiohttp's client APIs. (example)
  • Upgrade to aiohttp v3.4 series.

v1.3.0

14 Mar 08:35
0af0b4e
Compare
Choose a tag to compare

Key highlight: Improve dockerization support and add a plugin architecture for future extension

Manager

Agent

  • Fix repeating docker event polling even when there is connection/client-side aiohttp errors.
  • Upgrade aiohttp to v3.0 release.
  • Improve dockerization. (lablup/backend.ai-agent#55)
  • Improve inner beauty.

Client for Python (v1.2.1)

  • Improve exception handling (use Exception instead of BaseException as the base class for BackendError)
  • Upgrade aiohttp to v3.0 release.
  • Fix silent swallowing of asyncio.CancelledError and asyncio.TimeoutError
  • Allow uploading multiple files to a virtual folder in a single command (backend.ai vfolder upload)