Daily IOPs spike on AZURE VMs Cluster causing delays in replications #2930

zdravko123 · 2020-06-06T04:34:34Z

zdravko123
Jun 6, 2020

Everyday we get a spike in the IOPs for about 3-5 minutes which causes delays in the replications.
This is a bit of an issue for us, as we need the database to synchronize almost instantly which they generally do. I have thought about upgrading the disk IOPS to 5x higher, but I suspect we will still get some spikes and it's expensive, we could go up to 10x dis iops if we needed to. I have also considered splitting out the Data from the Views to eliminate some issues. What I suspect it is is related to the beam Queuing technology, I suspect it might be crashing or garbage collecting and dumping things to disk, or perhaps log files. Is there anything I can do to investigate? My linux skills aren't as good as windows but happy to have a poke around along with another developer. I have read on other forums that this causes High CPU, but for us it seems to be high IOPS and thus causing outages. It happens on all the nodes, 4 in the cluster at the same time.

More information can be found about our set up here: #2298

We are in the process of eliminating the need for the synchronizations as we have a FULLDB --> MiniDB, we are working on changing our application to write directly to the mini and not full so it doesn't need the real time replication. That is about 1 month away, and we need an interim fix until then. We can not guarantee that this won't affect other things because we also had a 2-3 minute outage when we had a IOPS spike that caused the whole cluster to be unresponsive ie bad gate way.

So I guess I'm looking for assistance. What could cause this on a daily cycle? What can we investigate that might be related to , and what we can do to fix it, apart from increasing the disk usage? We do realize the tech isn't suitable for our implementation where we are using it in a transactional ACID way rather than an eventual consistency, which is where a fundamental design flaw/assumption was made, that we are trying to rectify. That was just the trade of at the time to use PouchDB client side offline storage and replication.

Any help would be much appreciated!

Thanks

Answered by zdravko123

Jun 13, 2020

Thanks for that janl,
I think we have tracked this down, we suspect its related to the system chrono jobs that are running at this time. Prob configured in UTC to run out of hours but for us in AUST time zone falls at peak time. NOt sure what they do, but we can try to stagger them and get them to run at midnight ect.. each node at 1, 2, 3, and 4 am.

We also ran out of space as well on one of them and had to increase the machine size too.

We are investigating going to version 3 is there any doco to help us evaluate what we need to do and know if there are any breaking changes? I also need to investigate couchBase vs CouchDB as it might have paid or more support that could come in handy. T…

View full answer

janl · 2020-06-10T17:48:16Z

janl
Jun 10, 2020
Collaborator

This is likely going to take a while to pin down.

The first thing we need to figure out is what happens during the 2-3 minutes.

To narrow that down, how frequent does this happen? Is it consistent (periodic)? Does it coincide with anything special going on in the system? What kind of CouchDB logs are you getting when the spike happens?

Then we need to see if this is actually CouchDB doing the IOPS, is there anything else that could produce those?

Do you run on any kind of virtualisation besides Azure VMs? Is your CPU time fixed on your VMs, or could this be a case of your couch not getting enough CPU time for a bit because another VM is taking it on the host and then your VM gets the CPU and bursts some data to disk?

Are the IOPS local or to a networked block store? If the later, same question basically? Any network stalls there, or do you get consistent performance out of there?

0 replies

zdravko123 · 2020-06-13T01:25:09Z

zdravko123
Jun 13, 2020
Author

Thanks for that janl,
I think we have tracked this down, we suspect its related to the system chrono jobs that are running at this time. Prob configured in UTC to run out of hours but for us in AUST time zone falls at peak time. NOt sure what they do, but we can try to stagger them and get them to run at midnight ect.. each node at 1, 2, 3, and 4 am.

We also ran out of space as well on one of them and had to increase the machine size too.

We are investigating going to version 3 is there any doco to help us evaluate what we need to do and know if there are any breaking changes? I also need to investigate couchBase vs CouchDB as it might have paid or more support that could come in handy. This can be closed i guess and a note should be made that these things can impact the cluster in the way of synchronization delays.

Cheers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daily IOPs spike on AZURE VMs Cluster causing delays in replications #2930

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Daily IOPs spike on AZURE VMs Cluster causing delays in replications #2930

zdravko123 Jun 6, 2020

Replies: 2 comments

janl Jun 10, 2020 Collaborator

zdravko123 Jun 13, 2020 Author

zdravko123
Jun 6, 2020

janl
Jun 10, 2020
Collaborator

zdravko123
Jun 13, 2020
Author