Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails #1535

Merged
merged 1 commit into from
Nov 24, 2024

Conversation

morrone
Copy link
Collaborator

@morrone morrone commented Nov 23, 2024

No description provided.

@morrone morrone added this to the v4.4.5 milestone Nov 23, 2024
@morrone
Copy link
Collaborator Author

morrone commented Nov 23, 2024

Closes #1533

@morrone morrone changed the title [b4.4] slingshot_metrics: don't return an error from sample() when a device's count lookup failes [b4.4] slingshot_metrics: don't return an error from sample() when a device's count lookup fails Nov 23, 2024
@morrone morrone changed the title [b4.4] slingshot_metrics: don't return an error from sample() when a device's count lookup fails [b4.4] slingshot_metrics: don't return an error from sample() when a device's counter lookup fails Nov 23, 2024
@tom95858
Copy link
Collaborator

@morrone, do we need a log message at cache_cxil_dev_get too?

@tom95858
Copy link
Collaborator

@morrone, I guess the point is that when this fails, we need to know where and why so we can pass the issue on the HPE. With this change, we only know about 1 of the 3 failure modes.

@morrone
Copy link
Collaborator Author

morrone commented Nov 24, 2024

@morrone, do we need a log message at cache_cxil_dev_get too?

No, when this very issue of a counter lookup fails, I call cxil_close_device(cache_cxil_dev[i]);. It will be normal for the device to be missing from the cache until at least the next cache refresh.

With this change, we only know about 1 of the 3 failure modes.

Assuming cache_cxil_dev_get() was your second failure mode, what is the third that you are wondering about?

I guess the point is that when this fails, we need to know where and why so we can pass the issue on the HPE.

This tells most of what we know. A read of a counter failed for a device. The console log message is going to tell HPE more than LDMS can.

The thing that I was considering is maybe making cache_cxil_device_list_get() non-fatal. But if that fires, it probably means you don't even have the cxi kernel module loaded. Do we print a message over and over again when things are potentially misconfigured? Or do we move in the direction of considering almost nothing fatal? I guess I don't have a strong opinion either way on that.

@tom95858
Copy link
Collaborator

@morrone, I don't have answers to any of those questions. I was suggesting them in case you did ;-). @jstile might you be able to test this on your system? It should minimize the time lost between recognizing and restarting a stopped sampler.

@tom95858 tom95858 merged commit a7cb0c6 into ovis-hpc:b4.4 Nov 24, 2024
14 checks passed
@johnstile
Copy link

@morrone The patch logs the error, but does not include the return code. Would that be helpful to find the cause?

@johnstile
Copy link

As I watch the logs, after ldms starts the sampler we see a kernel message from cxi_core

kernel: cxi_core 0000:c2:00.0: cxi0[hsn0] starting automatic telemetry cache refresh

After ldmsd logged " 'slingshot_metrics': failed to sample. Stopping the plug-in.", we see the message from cxi_core

kernel: cxi_core 0000:21:00.0: cxi0[hsn0] stopping automatic telemetry cache refresh

These messages from cxi_core are purely informational and just means that some application started using telemetry and then stopped using it on the node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants