[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

morrone · 2024-11-23T03:33:17Z

No description provided.

…s counter lookup fails

morrone · 2024-11-23T03:34:59Z

tom95858 · 2024-11-23T19:54:20Z

@morrone, do we need a log message at cache_cxil_dev_get too?

tom95858 · 2024-11-23T19:56:14Z

@morrone, I guess the point is that when this fails, we need to know where and why so we can pass the issue on the HPE. With this change, we only know about 1 of the 3 failure modes.

morrone · 2024-11-24T00:41:23Z

@morrone, do we need a log message at cache_cxil_dev_get too?

No, when this very issue of a counter lookup fails, I call cxil_close_device(cache_cxil_dev[i]);. It will be normal for the device to be missing from the cache until at least the next cache refresh.

With this change, we only know about 1 of the 3 failure modes.

Assuming cache_cxil_dev_get() was your second failure mode, what is the third that you are wondering about?

I guess the point is that when this fails, we need to know where and why so we can pass the issue on the HPE.

This tells most of what we know. A read of a counter failed for a device. The console log message is going to tell HPE more than LDMS can.

The thing that I was considering is maybe making cache_cxil_device_list_get() non-fatal. But if that fires, it probably means you don't even have the cxi kernel module loaded. Do we print a message over and over again when things are potentially misconfigured? Or do we move in the direction of considering almost nothing fatal? I guess I don't have a strong opinion either way on that.

tom95858 · 2024-11-24T16:13:34Z

@morrone, I don't have answers to any of those questions. I was suggesting them in case you did ;-). @jstile might you be able to test this on your system? It should minimize the time lost between recognizing and restarting a stopped sampler.

johnstile · 2024-11-26T17:06:29Z

@morrone The patch logs the error, but does not include the return code. Would that be helpful to find the cause?

johnstile · 2024-11-26T17:06:53Z

As I watch the logs, after ldms starts the sampler we see a kernel message from cxi_core

kernel: cxi_core 0000:c2:00.0: cxi0[hsn0] starting automatic telemetry cache refresh

After ldmsd logged " 'slingshot_metrics': failed to sample. Stopping the plug-in.", we see the message from cxi_core

kernel: cxi_core 0000:21:00.0: cxi0[hsn0] stopping automatic telemetry cache refresh

These messages from cxi_core are purely informational and just means that some application started using telemetry and then stopped using it on the node.

slingshot_metrics: don't return an error from sample() when a device'…

a5bcca5

…s counter lookup fails

morrone added this to the v4.4.5 milestone Nov 23, 2024

morrone changed the title ~~[b4.4] slingshot_metrics: don't return an error from sample() when a device's count lookup failes~~ [b4.4] slingshot_metrics: don't return an error from sample() when a device's count lookup fails Nov 23, 2024

morrone changed the title ~~[b4.4] slingshot_metrics: don't return an error from sample() when a device's count lookup fails~~ [b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails Nov 23, 2024

tom95858 merged commit a7cb0c6 into ovis-hpc:b4.4 Nov 24, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

tom95858 commented Nov 23, 2024

tom95858 commented Nov 23, 2024

morrone commented Nov 24, 2024

tom95858 commented Nov 24, 2024

johnstile commented Nov 26, 2024

johnstile commented Nov 26, 2024

[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

Conversation

morrone commented Nov 23, 2024

morrone commented Nov 23, 2024

tom95858 commented Nov 23, 2024

tom95858 commented Nov 23, 2024

morrone commented Nov 24, 2024

tom95858 commented Nov 24, 2024

johnstile commented Nov 26, 2024

johnstile commented Nov 26, 2024