All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- images:
- Add rocky9
- Add debian13
- Introduce
fhpc_namespace
extra variable with the name of containers namespace. - Add bash-completion script for
firehpc
command (#12). - Add
firehpc list
command to list clusters present in state directory (#16). - load:
- Submit jobs randomly in existing QOS and partitions.
- Submit jobs of various sizes, with a power of 2 number (1, 2, 4, 8…) of cores or nodes, up to the full size of the cluster. A number of nodes is selected when Slurm SelectType plugin is linear, a number of cores is selected otherwise. Small jobs are more submitted than big jobs.
- Select job partition randomly weighted by their number of resources to favor largest partitions.
- Make some (about 1/10th) submitted jobs randomly fail (#9).
- Submit jobs with random durations and timelimit with low probability for jobs to reach their timelimit (#10).
- Support Slurm configuration without accounting service and QOS.
- conf:
- Add possibility to define additional QOS and alternative partitions in Slurm.
- Add support for RHEL9 and compatibles distributions.
- Add possibility to define custom site file name in nginx role.
- Introduce metrics role to deploy prometheus, alloy and grafana.
- Declare nodes in Slurm configuration with their socket/cores/memory configuration extracted from RacksDB.
- Add
params
key inslurm_partitions
parameter to give possibility to set any arbitrary Slurm partition configuration parameter in inventory. - Support Slurm native authentication in alternative to munge (#22).
- Add possibility to disable deployment of SlurmDBD accounting service (#20).
- docs:
- Add sysctl
fs.inotify.max_user_instances
value increase recommendation in README.md to avoid weird issue when launching many containers. - Mention Metrics stack and Slurm-web optional features in README.md with URL to access Grafana and Slurm-web interfaces.
- Explain in README.md Ansible core 2.16 requirement for both rocky8 and debian13 clusters with a method to install this version from PyPI repository.
- Mention
firehpc list
command in manpage. - Mention
firehpc load
command in manpage.
- Add sysctl
- Replace
fhpc-emulate-slurm-usage
command byfirehpc load
(#13). - Transform
fhpc_nodes
dictionary values from list of nodes to list of dictionaries to group nodes by type in RacksDB. firehpc ssh <cluster>
now connects to admin host by default (#8).- load: Change pending jobs limit formula to avoid number of jobs growing as fast as the number of nodes.
- conf:
- Install
socat
package on all nodes in common role. - Use packages list instead of loop to install MariaDB packages.
- Enable
config_overrides
slurmd parameter in Slurm configuration to avoid compute nodes sockets/cores/memory matching configuration check. - Move
maxtime
andstate
Slurm partitions parameters inparams
sub-dictionary. - Rename
slurm_partitions
>node
→nodes
key. - Change default Slurm authentication plugin from munge to slurm. This can be
changed by setting
slurm_with_munge: true
in Ansible inventory.
- Install
- docs: Explain in manpage ssh command considers admin container by default.
- core:
- Properly handle DBus error when getting containers addresses.
- Potential key conflict in dictionnary of SSH clients when multiple users connect to the same host with Paramiko library.
- Set jobs time limit to partition time limit when set to avoid jobs that exceed partition time limit.
- Remove cluster state on cluster clean.
- load:
- Order of partition/qos variables in job submission informational message.
- Support of Slurm 24.05
sacctmgr show qos --json
format to retrieve the list of defined QOS.
- conf:
- Install mpi packages in parallel instead of sequential loop.
- Configure system locale to
en_US.UTF-8
on rocky8. - Add SLURMRESTD_SECURITY=disable_user_check environment variable in slurmrestd service to allow running as slurm user.
- Containers namespace missing in Slurm-web gateway
[ui]
>host
. - Force creation of CA and LDAP certificates to override possibly existing certificates during bootstrap.
- docs: Remove
fhpc-emulate-slurm-usage
manpage.
1.1.0 - 2024-05-07
- Integration with RacksDB to extract emulated cluster topology (#1).
- Support for debian12 (Debian bookworm) in OS images sources YAML file.
- Introduce
fhpc_addresses
,fhpc_nodes
,fhpc_emulator_mode
andfhpc_db
extra variables. The first is a hash with containers as keys and the list of IP addresses as values. The second is also a hash with node tags as keys and the list of nodes assigned with the tag in values. The third is a boolean set to true when--slurm-emulator
option is set onfirehpc
command line. The fourth is the local absolute path to RacksDB database. - Possibility to run command with SSH paramiko library in addition to ssh binary executable.
- Add example RacksDB database.
- Add possibility to deploy users directory extracted from another existing cluster to have the same user accounts on multiple clusters eventually.
- Generate and manage groups tree internally. Groups definitions are exported to
ansible with
fhpc_groups
extra variable and can be dumped withfirehpc status
command. - Support containers namespace to allow multiple users start the same virtual clusters on the same host without conflict.
- cli:
- Support for tags to filter deployed configuration tasks.
- Report cluster status in JSON format with
--json
option. - Add
--slurm-emulator
option to deploy and configure a cluster with emulated Slurm cluster nodes (only one admin node with up to 64k virtual compute nodes). - Add
--users
option on deploy command to extract users directory from another existing cluster. - Introduce
fhpc-emulate-slurm-usage
command to emulate random usage of Slurm cluster. - Add
start
andstop
commands to respectively start and stop all containers of an emulated cluster.
- conf:
- Optional support of Rackslab developement Deb and RPM repositories, disabled by default.
- Introduce racksdb role to install RacksDB and deploy database content.
- Introduce slurmweb role to install and setup Slurmweb, optional and disabled by default.
- Support multiple Slurm accounts definitions with hierarchy and control of users membership.
- Add tags on all roles.
- Add variable for slurmrestd socket path in slurm role.
- Support optional additional slurmdbd parameters.
- Deploy SSH root private and public keys on admin.
- Generate /etc/hosts with all cluster IP addresses and hostnames.
- Add
nodeset_fold
andnodeset_expand
Jinja2 filters. - Support Slurm emulation with fully virtual nodes (up to 64k).
- Support optional secondary groups in LDAP directory.
- Add possibility to deploy Redis server on admin host.
- Use
fhpc_groups
for defaultslurm_accounts
variable value and to define LDAP groups. - Use
fhpc_db
for defaultracksdb_database
variable value and to define RacksDB database content. - Install
bach-completion
by default on all nodes with common role. - Install
clustershell
on all nodes by default with new clustershell role (#3). - Introduce nginx role.
- docs:
- Mention
conf
command--db
,--schema
and--tags
options infirehpc(1)
manpage. - Mention
deploy
command--db
and--schema
options infirehpc(1)
manpage. - Mention
status
command--json
option infirehpc(1)
manpage. - Mention new
start
andstop
commands infirehpc(1)
manpage. - Add manpage for
fhpc-emulate-slurm-usage
- Mention
conf
anddeploy
commands--slurm-emulator
option infirehpc(1)
manpage. - Mention
deploy
command--users
option infirehpc(1)
manpage.
- Mention
- Replaced notion of zone in favor of cluster, both in CLI options and configuration variables names.
- Removed extra directory from source tree. It used to contain ansible machinectl connection plugin as Git submodule. This dependency is now injected in FireHPC as a package supplementary source in packages built by Fatbuildr.
- conf:
- Declare SSH host keys valid for both containers FQDN and short hostname in system known hosts file.
- Split ssh role in 3 steps: localkeys for local bootstrap, bootstrap to initialize files on containers with machinectl and main for normal operations with SSH (known_hosts, SSH root keys).
- Replace hardcoded admin hosts by selection of first admin group member for LDAP server hostname and Slurm server.
- Generate Slurm nodes and partitions based RacksDB database content.
- Split playbook by sections with hosts targets to avoid many skipped tasks.
- docs: Update after zone→cluster rename in CLI options.
- Check OS images argument in CLI against values available in OS images YAML file instead of hard-coded argparse choices.
- Storage service stop and removal.
- Start storage service with container when cluster is started.
- Retry SSH connections up to 3 times in case of failure.
- Wait some time before starting the second container to finish container private network setup and avoid the following container from erasing everything before completion.
- Handle RacksDB format and schema errors with correct error message.
- Wait for both IPv4 and IPv6 addresses when retrieving container addresses, to avoid retrieving only IPv6 before IPv4 address is finally available.
- Correctly handle and report DNS errors in SSH module.
- conf:
- Open slurmd spool directory permissions to all users for running batch jobs scripts.
- Manage home directories ownership and permissions, in addition to some their content.
- Add missing common name in LDAP x509 TLS/SSL certificate.
- Do not use cgroups with Slurm in emulator mode.
- Force update of APT repositories metadata.
- Install
en_US.UTF-8
locale on Debian, as well as done on RHEL by default. - Set
systemd-networkd
DHCP client identifier to mac on RHEL to avoid getting a different address than those obtained by NetworkManager at boot, which eventually result in IPv4 adresses in/etc/hosts
being removed from network interfaces when initial leases reach their timeout.
- docs: Grammatical error and typos in
firehpc(1)
manpage - lib: limit network devices names to 12 characters to avoid network zone name
errors with
systemd-nspawn
.