Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Slurm 24.11 and REST API v0.0.40 #400

Merged
merged 25 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
43cd29e
refactor(conf): remove unused slurm jobs fields
rezib Nov 20, 2024
a06970c
tests: update assets for slurm API v0.0.40
rezib Nov 19, 2024
1ad2b89
chore(dev): deploy slurm 24.11 on dev cluster
rezib Nov 19, 2024
1d57439
conf: bump slurmrestd > version to 0.0.40
rezib Nov 19, 2024
983cd76
refactor(agent): bump min Slurm version to 23.11.0
rezib Nov 19, 2024
c08e602
refactor(agent): meta>slurm>version are now str
rezib Nov 19, 2024
26e2359
refactor(agent): meta > slurm is lowercase
rezib Nov 19, 2024
76fca38
tests(agent): no unknown jobs/resources in metrics
rezib Nov 20, 2024
4ead10a
tests(agent): reflect slurm key lowercase
rezib Nov 19, 2024
9a83d46
tests(agent): node not found error change
rezib Nov 19, 2024
91e0111
refactor(agent): job_state is now list of states
rezib Nov 19, 2024
539876d
tests(agent): reflect job_state is now a list
rezib Nov 19, 2024
7ac33c8
chore(dev): try to get admin password from env
rezib Nov 20, 2024
608df7c
docs: update conf references
rezib Nov 20, 2024
4ebefeb
refactor(front): update jobs interfaces API 0.0.40
rezib Nov 20, 2024
1b387df
refactor(front): adapt to job_state being a list
rezib Nov 20, 2024
4802556
refactor(front): nodes interfaces for API 0.0.40
rezib Nov 20, 2024
785d7ae
refactor(front): qos/resv interfaces API 0.0.40
rezib Nov 20, 2024
6751709
refactor(front): node {boot_time,last_busy} type
rezib Nov 20, 2024
251bde2
refactor(front): resv {start,end}_time type
rezib Nov 20, 2024
af7c00b
refactor(front): adapt to exit_code type change
rezib Nov 20, 2024
6aea7a7
docs: bump refs to Slurm API version to v0.0.40
rezib Nov 20, 2024
b067dfb
docs: update supported Slurm version
rezib Nov 20, 2024
e1cbb10
docs: update CHANGELOG.md
rezib Nov 20, 2024
3b1d45e
chore(assets): update screenshots
rezib Nov 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [unreleased]

### Added
- Support Slurm 24.11 and Slurm REST API v0.0.40 (#366 → #400).
- agent:
- Return RacksDB infrastructure name and a boolean to indicate if metrics
feature is enabled in `/info` endpoint, in addition to the cluster name.
Expand Down Expand Up @@ -64,14 +65,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Add requirement on markdown external library for `gateway` extra package.

### Changed
- agent: Bump minimal required Slurm version from 23.02.0 to 23.11.0.
- gateway: Change error message when unable to parse agent info fields.
- docs:
- Update configuration reference documentation.
- Update dashboard screenshot in overview page with example of resource chart.
- Replace mention of Slurm REST API version v0.0.39 by v0.0.40.
- Mention requirement of Slurm >= 23.11 and dropped support of Slurm 23.02.
- conf:
- Convert `[cache]` > `password` agent parameter from string to password type.
- Convert `[ldap]` > `bind_password` gateway parameter from string to password
type.
- Bump `[slurmrestd]` > `version` default value from `0.0.39` to `0.0.40` in
agent configuration for compatibility with Slurm 24.11.
- pkgs:
- Add requirement on RFL.core >= 1.1.0.
- Add requirement on RFL.settings >= 1.1.1.
Expand All @@ -96,6 +102,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Update dependencies to fix CVE-2024-45812 and CVE-2024-45811 (vite),
CVE-2024-47068 (rollup), CVE-2024-21538 (cross-spawn).

### Removed
- Support of Slurm 23.02 and Slurm REST API v0.0.39.
- conf:
- Remove unused `required` from default selected jobs field on `slurmrestd`
`/slurm/*/jobs` endpoint.
- Remove unused `state_reason` from default selected job field on `slurmrestd`
`/slurm/*/job/<id>` endpoint.

## [3.2.0] - 2024-09-05

### Added
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/screenshots/raw/screenshot_clusters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/screenshots/shadowed/screenshot_clusters.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/screenshots/shadowed/screenshot_dashboard_tablet.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 1 addition & 3 deletions conf/vendor/agent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ slurmrestd:
doc: Path to slurmrestd UNIX socket
version:
type: str
default: '0.0.39'
default: '0.0.40'
doc: |
Slurm REST API version.

Expand Down Expand Up @@ -101,7 +101,6 @@ filters:
- partition
- priority
- qos
- required
- script
- state
- steps
Expand Down Expand Up @@ -131,7 +130,6 @@ filters:
- standard_error
- standard_input
- standard_output
- state_reason
- tasks
- tres_req_str
doc: |
Expand Down
43 changes: 33 additions & 10 deletions dev/crawl-tests-assets
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ import getpass
import socket
import shlex
import random
import os
import logging

import requests
Expand All @@ -39,6 +40,7 @@ USER = getpass.getuser()
METRICS_PREFERRED_CLUSTER = "emulator"
# Map between infrastructure names and cluster names that are visible in Slurm-web.
MAP_CLUSTER_NAMES = {"emulator": "atlas"}
ADMIN_PASSWORD_ENV_VAR = "SLURMWEB_DEV_ADMIN_PASSWORD"


def slurmweb_cluster_name(infrastructure: str):
Expand Down Expand Up @@ -172,13 +174,13 @@ def crawl_slurmrestd(socket: Path) -> None:

session = requests.Session()
prefix = "http+unix://slurmrestd/"
api = "0.0.39"
api = "0.0.40"
session.mount(prefix, SlurmrestdUnixAdapter(socket))

# Get Slurm version
text, _, _ = query_slurmrestd(session, prefix, f"/slurm/v{api}/ping")
ping = json.loads(text)
release = ping["meta"]["Slurm"]["release"]
release = ping["meta"]["slurm"]["release"]
version = release.rsplit(".", 1)[0]
logger.info("Slurm version: %s release: %s", version, release)

Expand Down Expand Up @@ -238,7 +240,7 @@ def crawl_slurmrestd(socket: Path) -> None:
)

def dump_job_state(state: str):
if _job["job_state"] == state:
if state in _job["job_state"]:
dump_slurmrestd_query(
session,
requests_statuses,
Expand Down Expand Up @@ -317,6 +319,8 @@ def crawl_slurmrestd(socket: Path) -> None:
assets_path,
"slurm-nodes",
skip_exist=False,
limit_dump=100,
limit_key="nodes",
)

def dump_node_state():
Expand Down Expand Up @@ -387,7 +391,7 @@ def crawl_slurmrestd(socket: Path) -> None:

# Save resulting status file
with open(status_file, "w+") as fh:
json.dump(requests_statuses, fh, indent=2)
json.dump(requests_statuses, fh, indent=2, sort_keys=True)
fh.write("\n")


Expand Down Expand Up @@ -438,7 +442,15 @@ def gateway_url(dev_tmp_dir):
def user_token(url: str, user: str):
"""Ask user password interactively, authenticate on gateway and return
authentication JWT."""
password = getpass.getpass(prompt=f"Password for {user} on gateway: ")

try:
password = os.environ[ADMIN_PASSWORD_ENV_VAR]
except KeyError:
logger.info(
"Unable to read admin password from environment, opening interactive "
"prompt."
)
password = getpass.getpass(prompt=f"Password for {user} on gateway: ")

response = requests.post(
f"{url}/api/login", json={"user": user, "password": password}
Expand All @@ -461,6 +473,7 @@ def dump_component_query(
asset_name: dict[int, str] | str,
skip_exist: bool = True,
prettify: bool = True,
limit_dump=0,
) -> Any:
"""Send GET HTTP request to Slurm-web component pointed by URL and save JSON result
in assets directory."""
Expand Down Expand Up @@ -499,7 +512,10 @@ def dump_component_query(
else:
with open(asset, "w+") as fh:
if asset.suffix == ".json":
fh.write(json.dumps(data, indent=2 if prettify else None))
_data = data
if limit_dump:
_data = _data[:limit_dump]
fh.write(json.dumps(_data, indent=2 if prettify else None))
else:
fh.write(data)
return data
Expand Down Expand Up @@ -568,6 +584,7 @@ def crawl_gateway(cluster: str, infrastructure: str, dev_tmp_dir: Path) -> str:
assets_path,
"jobs",
skip_exist=False,
limit_dump=100,
)

if not (len(jobs)):
Expand All @@ -578,7 +595,7 @@ def crawl_gateway(cluster: str, infrastructure: str, dev_tmp_dir: Path) -> str:
min_job_id = jobs[0]["job_id"]

def dump_job_state() -> None:
if _job["job_state"] == state:
if state in _job["job_state"]:
dump_component_query(
requests_statuses,
url,
Expand Down Expand Up @@ -696,7 +713,7 @@ def crawl_gateway(cluster: str, infrastructure: str, dev_tmp_dir: Path) -> str:

# Save resulting status file
with open(status_file, "w+") as fh:
json.dump(requests_statuses, fh, indent=2)
json.dump(requests_statuses, fh, indent=2, sort_keys=True)
fh.write("\n")

return token
Expand Down Expand Up @@ -742,7 +759,13 @@ def crawl_agent(port: int, token: str, metrics: bool) -> None:
"stats",
)
dump_component_query(
requests_statuses, url, f"/v{get_version()}/jobs", headers, assets_path, "jobs"
requests_statuses,
url,
f"/v{get_version()}/jobs",
headers,
assets_path,
"jobs",
limit_dump=100,
)
nodes = dump_component_query(
requests_statuses,
Expand Down Expand Up @@ -813,7 +836,7 @@ def crawl_agent(port: int, token: str, metrics: bool) -> None:
# FIXME: Download unknown job/node
# Save resulting status file
with open(status_file, "w+") as fh:
json.dump(requests_statuses, fh, indent=2)
json.dump(requests_statuses, fh, indent=2, sort_keys=True)
fh.write("\n")


Expand Down
2 changes: 1 addition & 1 deletion dev/firehpc/conf/tiny/group_vars/all.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
common_with_devs_repos: true
common_hpckit_derivatives:
- main
- slurm24.05
- slurm24.11
slurm_with_jwt: false
slurm_params:
PriorityType: priority/multifactor
Expand Down
1 change: 1 addition & 0 deletions docs/antora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ asciidoc:
source-language: asciidoc@
table-caption: false
version: 4.0.0
api_version: 0.0.40
nav:
- modules/overview/nav.adoc
- modules/install/nav.adoc
Expand Down
10 changes: 4 additions & 6 deletions docs/modules/conf/examples/agent.ini
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,8 @@ socket=/run/slurmrestd/slurmrestd.socket
# rather than end users. Slurm-web is officially tested and validated with
# the default value only.
#
# Default value: 0.0.39
version=0.0.39
# Default value: 0.0.40
version=0.0.40

[filters]

Expand All @@ -100,6 +100,7 @@ version=0.0.39
# - qos
# - cpus
# - node_count
# - nodes
jobs=
job_id
user_name
Expand All @@ -111,6 +112,7 @@ jobs=
qos
cpus
node_count
nodes

# List of slurmdbd job fields selected in slurmrestd API when retrieving a
# unique job, all other fields are filtered out.
Expand All @@ -126,7 +128,6 @@ jobs=
# - partition
# - priority
# - qos
# - required
# - script
# - state
# - steps
Expand All @@ -148,7 +149,6 @@ acctjob=
partition
priority
qos
required
script
state
steps
Expand Down Expand Up @@ -176,7 +176,6 @@ acctjob=
# - standard_error
# - standard_input
# - standard_output
# - state_reason
# - tasks
# - tres_req_str
ctldjob=
Expand All @@ -192,7 +191,6 @@ ctldjob=
standard_error
standard_input
standard_output
state_reason
tasks
tres_req_str

Expand Down
8 changes: 3 additions & 5 deletions docs/modules/conf/partials/conf-agent.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ the default value only.



*Default:* `0.0.39`
*Default:* `0.0.40`

|-

Expand Down Expand Up @@ -207,6 +207,8 @@ jobs, all other fields arefiltered out.

* `node_count`

* `nodes`


|-

Expand Down Expand Up @@ -242,8 +244,6 @@ unique job, all other fields are filtered out.

* `qos`

* `required`

* `script`

* `state`
Expand Down Expand Up @@ -303,8 +303,6 @@ unique job, all other fields are filtered out.

* `standard_output`

* `state_reason`

* `tasks`

* `tres_req_str`
Expand Down
18 changes: 10 additions & 8 deletions docs/modules/install/pages/quickstart.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

== Requirements

:fn-slurm-version: footnote:slurm-version[Slurm-web {version} actually requires Slurm REST API v0.0.39 available in Slurm 23.02 and above. Please refer to xref:overview:architecture.adoc#slurm-versions[Slurm REST API versions section] for more details.]
:fn-slurm-version: footnote:slurm-version[Slurm-web {version} actually requires Slurm REST API v{api_version} available in Slurm 23.11 and above. Please refer to xref:overview:architecture.adoc#slurm-versions[Slurm REST API versions section] for more details.]

* Cluster with Slurm >= 23.02 {fn-slurm-version} and
* Cluster with Slurm >= 23.11 {fn-slurm-version} and
https://slurm.schedmd.com/accounting.html[accounting enabled]
* Host installed with a supported GNU/Linux distributions among:
** CentOS
Expand Down Expand Up @@ -132,16 +132,18 @@ Enable and start `slurmrestd` service:

To check `slurmrestd` daemon is properly running, run this command:

[source,console]
[source,console,subs=attributes]
----
# curl --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurm/v0.0.39/diag
# curl --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurm/v{api_version}/diag
{
"meta": {
"plugin": {
"type": "openapi\/v0.0.39",
"name": "Slurm OpenAPI v0.0.39",
"data_parser": "v0.0.39"
},
"type": "openapi\/slurmctld",
"name": "Slurm OpenAPI slurmctld",
"data_parser": "data_parser\/v{api_version}",
"accounting_storage": "accounting_storage\/slurmdbd"
},
}
}
----
Expand Down
16 changes: 8 additions & 8 deletions docs/modules/misc/pages/troubleshooting.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@ This page contains troubleshooting tips to help find out the reason of issues.
Test Slurm `slurmrestd` API is properly responding on Unix socket with this
command:

[source,console]
[source,console,subs=attributes]
----
$ curl --silent --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurm/v0.0.39/diag | \
$ curl --silent --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurm/v{api_version}/diag | \
jq '.statistics | with_entries(select(.key | startswith("jobs")))'
{
"jobs_submitted": 0,
"jobs_started": 0,
"jobs_completed": 0,
"jobs_submitted": 385,
"jobs_started": 407,
"jobs_completed": 411,
"jobs_canceled": 0,
"jobs_failed": 0,
"jobs_pending": 40,
"jobs_pending": 0,
"jobs_running": 0
}
----
Expand All @@ -29,9 +29,9 @@ cluster.

Test Slurm accounting on in REST API with this command:

[source,console]
[source,console,subs=attributes]
----
$ curl --silent --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurmdb/v0.0.39/config | \
$ curl --silent --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurmdb/v{api_version}/config | \
jq .clusters[].nodes
"cn[1-4]"
----
Expand Down
Loading
Loading