Nvidia Metrics

The repository, nvidia-metrics, leverages the Nvidia Management Library (NVML), a C-based API that can interface with Nvidia GPUs. This repository gives a clear insight into the usage statistics of Nvidia GPUs like temperature, power consumption, memory usage, etc. It is intended for developers working on high performance computing, machine learning, and other GPU-intensive tasks. The gathered metrics are transmitted to Prometheus and Grafana for visualization. You can utilize the config/metrics.yaml file to specify and modify the metrics that are required. This YAML configuration file allows you to define and adjust metrics and labels.

Getting Started

These instructions will provide you a guideline for installing prerequisites, running the application, and building necessary files.

Usage

  -config string
        Path to the configuration file (default "config/metrics.yaml")
  -filelog string
        Enable file logging (default "false")
  -host string
        Host to run the metrics server (default "0.0.0.0")
  -interval string
        Time interval in seconds to scrape metrics (default "5")
  -logfile string
        Log file path (default "logs/gpu-metrics.log")
  -loglevel string
        Log level (debug, info, warn, error,fatal) (default "info")
  -port string
        Port to run the metrics server (default "9500")

Prerequisites

To use this repository, you should have the Nvidia CUDA toolkit installed on your system.

You can install the toolkit via:

sudo apt-get install nvidia-cuda-toolkit

Installing & Running the application

Clone the repository to your local machine.

git clone <repo_link>

Navigate to the cloned directory.

cd nvidia-metrics

Compile the project.

make

After the project has been compiled, run the resulting binary.

./nvidiaMetrics --config config/metrics.yaml

docker run -e CONFIG_FILE=/path/to/config.yaml \
           -e LOG_LEVEL=debug \
           -e PORT=8080 \
           -e HOST=0.0.0.0 \
           -e INTERVAL=10 \
           your-image-name

Built With

NVML - A C-based GO API for monitoring and managing Nvidia GPUs.
CUDA - A parallel computing platform and programming model developed by Nvidia for general computing on GPUs.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

License

This project is licensed under the MIT License.

Contact

Please feel free to contact the project maintainers if you encounter any issues or have any enquiries about the repository.

We hope you find this repository useful in your venture!

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
api		api
cmd		cmd
config		config
internal		internal
pkg		pkg
scripts		scripts
tests/mock_data		tests/mock_data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
docker-compose.yaml		docker-compose.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nvidia Metrics

Getting Started

Prerequisites

Installing & Running the application

Built With

Contributing

License

Contact

About

Releases

Packages

Languages

License

rupeshtr78/nvidia-metrics-prometheus

Folders and files

Latest commit

History

Repository files navigation

Nvidia Metrics

Getting Started

Prerequisites

Installing & Running the application

Built With

Contributing

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages