Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs and resources charts in dashboard page #378

Merged
merged 23 commits into from
Nov 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5d58743
chore(docs): print default URI in conf ref/ex
rezib Nov 4, 2024
bb4d80c
chore(dev): update slurm conf and racksdb for dev
rezib Nov 4, 2024
5e1c385
chore(dev): redirect prometheus to dev clusters
rezib Nov 4, 2024
15544df
feat(agent): boolean to flag metrics in /info
rezib Nov 4, 2024
4f32220
tests(agent): update test to reflect changes
rezib Nov 5, 2024
fb465c6
feat(gateway): metrics flag in /clusters
rezib Nov 4, 2024
b7af5f5
refactor(gateway): generic management query params
rezib Nov 4, 2024
47367e5
refactor(agent): move metrics module in subdir
rezib Nov 4, 2024
3dec1fc
tests(agent): adapt tests after module move
rezib Nov 4, 2024
3fe0d29
feat(conf): add metrics > host agent parameter
rezib Nov 4, 2024
b36f8ef
chore(front): add dep on chart.js w/ luxon adapter
rezib Nov 4, 2024
654e143
feat(conf): add metrics > job agent parameter
rezib Nov 5, 2024
60053ab
docs: update conf references
rezib Nov 4, 2024
33b56ad
feat(agent): query metrics in agent
rezib Nov 5, 2024
705b89c
feat(gateway): proxy metrics endpoint
rezib Nov 5, 2024
e53c211
tests(agent): adapt tests to reflect changes
rezib Nov 5, 2024
10347cb
feat(front): resources/jobs charts in dashboard
rezib Nov 5, 2024
feb485f
chore(dev): prometheus and metrics assets
rezib Nov 6, 2024
2c3c385
tests(front): setup vitest with canvas mock
rezib Nov 6, 2024
019170c
tests(front): cover dashboard resource/jobs charts
rezib Nov 6, 2024
6f7cc93
tests(agent): cover metrics DB requests code
rezib Nov 7, 2024
8ed2a63
chore(assets): add screenshots of dashboard charts
rezib Nov 7, 2024
2889297
docs: mention metrics charts feature
rezib Nov 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 21 additions & 9 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,33 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added
- agent:
- Return RacksDB infrastructure name in `/info` endpoint in complement of
the cluster name.
- Return RacksDB infrastructure name and a boolean to indicate if metrics
feature is enabled in `/info` endpoint, in addition to the cluster name.
- Add optional `/metrics` endpoint with various Slurm metrics in OpenMetrics
format designed to be scraped by Prometheus or compatible (#274).
- Add possibility to query metrics from Prometheus database with
`/v<version>/metrics/<metric>` endpoint.
- gateway:
- Return RacksDB infrastructure name of every clusters in `/clusters`
endpoint.
- Return RacksDB infrastructure name and boolean metrics feature flag of every
clusters in `/clusters` endpoint.
- Return optional markdown login service message as rendered HTML page with
`/messages/login` enpoint.
`/messages/login` endpoint.
- Proxy metrics requests to agent through
`/api/agents/<cluster>/metrics/<metric>` endpoint.
- frontend:
- Request RacksDB with the infrastructure name provided by the gateway (#348).
- Display time limit of running jobs in job details page (#352).
- Display service message below login form if defined (#253).
- Add dependency on _charts.js_ and _luxon_ adapter to draw charts with
timeseries metrics.
- Display charts of resources (nodes/cores) status and jobs queue in dashboard
page based on metrics from Prometheus (#275).
- conf:
- Add `racksdb` > `infrastructure` parameter for the agent.
- Add `metrics` > `enabled` parameter for the agent.
- Add `metrics` > `restrict` parameter for the agent.
- Add `metrics` > `host` parameter for the agent.
- Add `metrics` > `job` parameter for the agent.
- Add `ui` > `templates`, `message_template`, `message_login` parameters for
the gateway.
- Select `alloc_cpus` and `alloc_idle_cpus` nodes fields on `slurmrestd`
Expand All @@ -36,9 +46,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
which can either be configuration definition file or site override (#349).
- docs:
- Add manpage for `slurm-web-show-conf` command.
- Add metrics export configuration documentation.
- Mention metrics export optional feature in quickstart guide.
- Mention metrics export feature in overview page.
- Add metrics feature configuration documentation page.
- Mention metrics optional feature in quickstart guide.
- Mention metrics export and charts feature in overview page.
- Mention possible Prometheus integration in architecture page.
- Mention login service message feature in overview page.
- Add page to document _Service Messages_ configuration.
Expand All @@ -47,7 +57,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Add requirement on markdown external library for `gateway` extra package.

### Changed
- docs: Update configuration reference documentation.
- docs:
- Update configuration reference documentation.
- Update dashboard screenshot in overview page with example of resource chart.
- conf:
- Convert `[cache]` > `password` agent parameter from string to password type.
- Convert `[ldap]` > `bind_password` gateway parameter from string to password
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 76 additions & 0 deletions assets/screenshots/assemblies/slurm-web_charts.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions assets/screenshots/build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ assemblies:
- large
slurm-web_responsive.svg:
- large
slurm-web_charts.svg:
- medium
# List raw screenshots for which versions with dropped shadow must be generated.
shadowed:
- screenshot_auth.png
Expand Down
Binary file added assets/screenshots/raw/screenshot_charts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/screenshots/raw/screenshot_dashboard_tablet.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/screenshots/shadowed/screenshot_dashboard_tablet.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions conf/vendor/agent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -373,3 +373,12 @@ metrics:
- ::1/128
doc: |
Restricted list of IP networks permitted to request metrics.
host:
type: uri
default: http://localhost:9090
doc: |
URL of Prometheus server (or compatible) to requests metrics with PromQL.
job:
type: str
default: slurm
doc: Name of Prometheus job which scrapes Slurm-web metrics.
3 changes: 3 additions & 0 deletions dev/conf/agent.ini.j2
Original file line number Diff line number Diff line change
Expand Up @@ -27,5 +27,8 @@ enabled={{ cache_enabled }}
port={{ redis_port }}
password={{ redis_password }}

{% if cluster_name != "pocket" %}
[metrics]
enabled=yes
host=http://localhost:{{ prometheus_port }}
{% endif %}
129 changes: 119 additions & 10 deletions dev/crawl-tests-assets
Original file line number Diff line number Diff line change
Expand Up @@ -27,17 +27,84 @@ from rfl.settings.errors import (
)
from racksdb import RacksDB
from slurmweb.slurmrestd.unix import SlurmrestdUnixAdapter
from slurmweb.metrics.db import SlurmwebMetricsDB
from slurmweb.version import get_version

logger = logging.getLogger("crawl-tests-assets")

DEBUG_FLAGS = ["slurmweb", "rfl", "werkzeug", "urllib3"]
DEV_HOST = "firehpc.dev.rackslab.io"
USER = getpass.getuser()
METRICS_PREFERRED_CLUSTER = "emulator"
# Map between infrastructure names and cluster names that are visible in Slurm-web.
MAP_CLUSTER_NAMES = {"emulator": "atlas"}


def slurmweb_cluster_name(infrastructure: str):
if infrastructure in MAP_CLUSTER_NAMES:
return MAP_CLUSTER_NAMES[infrastructure]
return infrastructure


ASSETS = Path(__file__).parent.resolve() / ".." / "slurmweb" / "tests" / "assets"


def crawl_prometheus(url: str, job: str) -> None:
"""Crawl and save test assets from Prometheus."""
# Check assets directory
assets_path = ASSETS / "prometheus"
if not assets_path.exists():
assets_path.mkdir(parents=True)

# Save requests status
status_file = assets_path / "status.json"
if status_file.exists():
with open(status_file) as fh:
requests_statuses = json.load(fh)
else:
requests_statuses = {}

headers = {}
db = SlurmwebMetricsDB(url, job)

for metric in ["nodes", "cores", "jobs"]:
for _range in ["hour"]:
dump_component_query(
requests_statuses,
url,
f"{db.REQUEST_BASE_PATH}{db._query(metric, _range)}",
headers,
assets_path,
f"{metric}-{_range}",
prettify=False,
)

# query unexisting metric
dump_component_query(
requests_statuses,
url,
f"{db.REQUEST_BASE_PATH}{db._query('fail', 'hour')}",
headers,
assets_path,
"unknown-metric",
)

# query unknown API path
dump_component_query(
requests_statuses,
url,
f"{db.REQUEST_BASE_PATH}/fail",
headers,
assets_path,
"unknown-path",
)

# Save resulting status file
with open(status_file, "w+") as fh:
json.dump(requests_statuses, fh, indent=2)
fh.write("\n")


def query_slurmrestd(session: requests.Session, prefix: str, query: str) -> Any:
"""Send GET HTTP request to slurmrestd and return JSON result. Raise RuntimeError in
case of connection error or not JSON result."""
Expand Down Expand Up @@ -385,8 +452,9 @@ def dump_component_query(
query: str,
headers: str,
assets_path: Path,
asset_name: dict[int,str]|str,
asset_name: dict[int, str] | str,
skip_exist: bool = True,
prettify: bool = True,
) -> Any:
"""Send GET HTTP request to Slurm-web component pointed by URL and save JSON result
in assets directory."""
Expand Down Expand Up @@ -425,17 +493,17 @@ def dump_component_query(
else:
with open(asset, "w+") as fh:
if asset.suffix == ".json":
fh.write(json.dumps(data, indent=2))
fh.write(json.dumps(data, indent=2 if prettify else None))
else:
fh.write(data)
return data


def crawl_gateway(cluster: str, dev_tmp_dir: Path) -> str:
def crawl_gateway(cluster: str, infrastructure: str, dev_tmp_dir: Path) -> str:
"""Crawl and save test assets from Slurm-web gateway component and return
authentication JWT."""
# Retrieve admin user account to connect
user = admin_user(cluster)
user = admin_user(infrastructure)
logger.info("Found user %s in group admin on cluster %s", user, cluster)

# Get gateway HTTP base URL from configuration
Expand Down Expand Up @@ -471,9 +539,9 @@ def crawl_gateway(cluster: str, dev_tmp_dir: Path) -> str:
headers,
assets_path,
{
200: "message_login",
404: "message_login_not_found",
500: "message_login_error",
200: "message_login",
404: "message_login_not_found",
500: "message_login_error",
},
)

Expand Down Expand Up @@ -592,6 +660,19 @@ def crawl_gateway(cluster: str, dev_tmp_dir: Path) -> str:
"accounts",
)

# metrics
for metric in ["nodes", "cores", "jobs"]:
for _range in ["hour"]:
dump_component_query(
requests_statuses,
url,
f"/api/agents/{cluster}/metrics/{metric}?range={_range}",
headers,
assets_path,
f"metrics-{metric}-{_range}",
prettify=False,
)

# Save resulting status file
with open(status_file, "w+") as fh:
json.dump(requests_statuses, fh, indent=2)
Expand All @@ -600,7 +681,7 @@ def crawl_gateway(cluster: str, dev_tmp_dir: Path) -> str:
return token


def crawl_agent(port: int, token: str) -> None:
def crawl_agent(port: int, token: str, metrics: bool) -> None:
"""Crawl and save test assets from Slurm-web agent component."""
# Compose and return the URL to the gateway
url = f"http://localhost:{port}"
Expand Down Expand Up @@ -678,6 +759,20 @@ def crawl_agent(port: int, token: str) -> None:
"accounts",
)

# metrics
if metrics:
for metric in ["nodes", "cores", "jobs"]:
for _range in ["hour"]:
dump_component_query(
requests_statuses,
url,
f"/v{get_version()}/metrics/{metric}?range={_range}",
headers,
assets_path,
f"metrics-{metric}-{_range}",
prettify=False,
)

# FIXME: Download unknown job/node
# Save resulting status file
with open(status_file, "w+") as fh:
Expand Down Expand Up @@ -714,8 +809,16 @@ def main() -> None:
db = RacksDB.load(db="dev/firehpc/db", schema="../RacksDB/schemas/racksdb.yml")
logger.info("List of clusters: %s", db.infrastructures.keys())

gateway_infrastructure = list(
db.infrastructures.filter(name=METRICS_PREFERRED_CLUSTER)
)[0].name

# Crawl gateway and get bearer token
token = crawl_gateway(list(db.infrastructures.keys())[0], dev_tmp_dir)
token = crawl_gateway(
slurmweb_cluster_name(gateway_infrastructure),
gateway_infrastructure,
dev_tmp_dir,
)

for cluster in db.infrastructures.keys():
# Load agent configuration
Expand All @@ -730,8 +833,14 @@ def main() -> None:
logger.critical(err)
sys.exit(1)

crawl_metrics = cluster == METRICS_PREFERRED_CLUSTER

# Crawl agent
crawl_agent(settings.service.port, token)
crawl_agent(settings.service.port, token, metrics=crawl_metrics)

# Crawl prometheus
if crawl_metrics:
crawl_prometheus(settings.metrics.host.geturl(), settings.metrics.job)

# Crawl slurmrestd
try:
Expand Down
Loading
Loading