Skip to content

arista-netdevops-community/cvp-monitoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Monitoring CloudVision with Prometheus and Grafana

CVP runs on CentOS and prometheus node-exporter is used to collect various metrics both from the VM (memory and CPU usage, disk latency, writes/reads per second, etc) and the application’s components. These metrics can be viewed by accessing the Prometheus UI or creating dashboards in Grafana.

Exposing Prometheus on CloudVision

1. To be able to setup the Prometheus datasource in Grafana, TCP 9090 has to be allowed on the primary node’s firewalld (Note that in newer releases this is already enabled). There are two ways to do this:

a) Using the firewall-cmd CLI command:

firewall-cmd --permanent --add-port=9090/tcp --zone=eth0-Zone && firewall-cmd --reload

After applying this command a success message should be printed on stdout.

b) Manually editing /etc/firewalld/zones/eth0-Zone.xml and adding the port similarly to how others are added, e.g.:

<port protocol="tcp" port="9090"/>

Make sure to save the file in this case before exiting. The file’s first 15 lines should look like this:

[root@cvp11 ~]# cat /etc/firewalld/zones/eth0-Zone.xml | head -n 15
<?xml version="1.0" encoding="utf-8"?>
<zone>
  <interface name="eth0"/>
  <service name="dhcpv6-client"/>
  <service name="http"/>
  <service name="ssh"/>
  <service name="https"/>
  <port protocol="tcp" port="9910"/>
  <port protocol="tcp" port="8090"/>
  <port protocol="udp" port="3851"/>
  <port protocol="tcp" port="4433"/>
  <port protocol="tcp" port="8443"/>
  <port protocol="udp" port="161"/>
  <port protocol="tcp" port="9090"/>
  <rule family="ipv4">

2. After this is done, firewalld has to be restarted in order for the new rule to be applied.

firewall-cmd --reload

After applying this command a success message should be printed on stdout.

Quickly test if we can connect to that port from the remote end:

nc -zv 192.0.2.100 9090
Connection to 192.0.2.100 port 9090 [tcp/websm] succeeded!

If the response is “Connection refused” it would mean that there’s a Firewall in the middle that’s blocking the connection.

Building Grafana container

If you don't already have a container, after cloning this repo, docker-compose can be used to bring up the containers:

1. git clone https://github.com/arista-netdevops-community/cvp-monitoring.git

2. cd cvp-monitoring

3. Run docker-compose up -d.

That should build a local Prometheus and Grafana container preloaded with the dashboards and various metrics.

Note that the local Prometheus container would be only needed if offline the CVP data would want to be checked out and not necessary for real-time monitoring.

The next steps would be to add your Prometheus data-source described in steps 1-3 in Adding Prometheus to your Grafana instance. The rest of the steps (4-10) would be only required for existing Grafana instances where the dashboards are not loaded.

Adding Prometheus to your Grafana instance

1. Access your Grafana UI and add a new data source by clicking on the Gear icon on the left pane and select Data Sources

img1

2. Select the Prometheus template and fill in the form:

img2 img3

3. Save & Test.

4. Now let’s create the dashboards, you can create new dashboards or import them from a JSON file. In the below example importing via JSON is demonstrated. Click on the “+” sign on the left pane and select Import:

img4

grafana-prom.json

grafana-apiserver.json

5. You can either import via panel json or upload the .json file. Select upload .json file

img5

6. Select the json file from your local PC/laptop, if the file is correct, we’ll be able to set the name of the dashboard, the folder where we’ll store it and the uid of the dashboard which has to be unique.

img6

Simply renaming the UID to an arbitrary name is enough.

7. After clicking import the dashboards should show up. In the JSON file the name of the datasource is Prometheus, so if the dashboards are not showing any data and the exclamation mark is shown on the top left of each graph like below, we just simply need to replace the datasource name from Prometheus to the one we chose initially (this can be also done prior to importing the JSON file):

img9

8. To make the modifications in Grafana, click on the Gear icon aka Dashboard settings (top right this time)

img7

9. Select JSON model from the left pane (there’s a delay of 5-10 seconds until the JSON model will show up) and replace all instances of "datasource": "Prometheus" with the correct datasource name, e.g. in below case it’s "datasource": "mycvp", as mycvp was the name of the datasource chosen in step 2.

img10

10. Hit Save and you should see your dashboards like below:

img8

Loading offline Prometheus data

1. Download this project either using git clone

git clone https://github.com/arista-netdevops-community/cvp-monitoring.git

or using the Download button

downloadzip

2. Get the Prometheus tarball from CVP.

Option1)

cvpi stop prometheus
cd /data
tar -zcvf prometheus_data.tar.gz prometheus
cvpi start prometheus

Option2)

cvpi debug --prometheus

This will generate the debug tarball which will have the tarball inside e.g.: cvpi_debug_all_20220307224437/prometheus/prometheus_data.tar.gz

3. Untar the prometheus folder into the same folder you saved this project tar -zxvf prometheus_data.tar.gz

4. Run: docker-compose up -d

5. Access grafana at localhost:3000 using admin/arista (default credentials as per the docker compose file but can be changed to anything) or prometheus at localhost:9090

Useful metrics

If using ssh tunnel e.g.: ssh -nNT -L 9090:localhost:9090 root@<CVP_IP> the following links will work directly. Otherwise, replace localhost with your CloudVision FQDN or IP.

General server metrics

Load Average: node_load1

Better look at cpu %: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

I/O wait: avg by (instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100

Disk writes per second: irate(node_disk_written_bytes_total[2m])

Disk reads a second: irate(node_disk_read_bytes_total[2m])

Hadoop bytes read: irate(hadoop_datanode_bytesread[2m])

Percent disk used: 100 - ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes)

Available memory: node_memory_MemAvailable_bytes

Per-process metrics

memory utilization: process_resident_memory_bytes

Cpu utilization: http://localhost:9090/graph?g0.range_input=1d&g0.expr=irate(process_cpu_seconds_total%5B5m%5D)&g0.tab=0

Kafka lag for dispatcher: max(kafka_log_log_value{topic="postDB_v2", name="LogEndOffset"}) by (partition, topic) - max(offset_consumed{topic="postDB_v2"}) by (topic, partition) The Kafka LAG, shows us how many messages are in the queue per Kafka partitions. Having high number of messages in the queue can lead to performance issues, either because of resource constraints or more devices are streaming than are supported and the scale limit is being hit.

Flow Data collection

number of flows per second for the last 5 minutes: sum(rate(clover_ingest_num_notifications_received{type=~"ipfix|int|greent"}[5m]))

Many more are included in the json files in ./grafana/provisioning/dashboards/.

Ask a question

Easiest way to get support is to open an issue.

About

Monitor CVP cluster using Prometheus and Grafana

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published