Daily IOPs spike on AZURE VMs Cluster causing delays in replications #2930
-
Everyday we get a spike in the IOPs for about 3-5 minutes which causes delays in the replications. More information can be found about our set up here: #2298 We are in the process of eliminating the need for the synchronizations as we have a FULLDB --> MiniDB, we are working on changing our application to write directly to the mini and not full so it doesn't need the real time replication. That is about 1 month away, and we need an interim fix until then. We can not guarantee that this won't affect other things because we also had a 2-3 minute outage when we had a IOPS spike that caused the whole cluster to be unresponsive ie bad gate way. So I guess I'm looking for assistance. What could cause this on a daily cycle? What can we investigate that might be related to , and what we can do to fix it, apart from increasing the disk usage? We do realize the tech isn't suitable for our implementation where we are using it in a transactional ACID way rather than an eventual consistency, which is where a fundamental design flaw/assumption was made, that we are trying to rectify. That was just the trade of at the time to use PouchDB client side offline storage and replication. Any help would be much appreciated! Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
This is likely going to take a while to pin down. The first thing we need to figure out is what happens during the 2-3 minutes. To narrow that down, how frequent does this happen? Is it consistent (periodic)? Does it coincide with anything special going on in the system? What kind of CouchDB logs are you getting when the spike happens? Then we need to see if this is actually CouchDB doing the IOPS, is there anything else that could produce those? Do you run on any kind of virtualisation besides Azure VMs? Is your CPU time fixed on your VMs, or could this be a case of your couch not getting enough CPU time for a bit because another VM is taking it on the host and then your VM gets the CPU and bursts some data to disk? Are the IOPS local or to a networked block store? If the later, same question basically? Any network stalls there, or do you get consistent performance out of there? |
Beta Was this translation helpful? Give feedback.
-
Thanks for that janl, We also ran out of space as well on one of them and had to increase the machine size too. We are investigating going to version 3 is there any doco to help us evaluate what we need to do and know if there are any breaking changes? I also need to investigate couchBase vs CouchDB as it might have paid or more support that could come in handy. This can be closed i guess and a note should be made that these things can impact the cluster in the way of synchronization delays. Cheers |
Beta Was this translation helpful? Give feedback.
Thanks for that janl,
I think we have tracked this down, we suspect its related to the system chrono jobs that are running at this time. Prob configured in UTC to run out of hours but for us in AUST time zone falls at peak time. NOt sure what they do, but we can try to stagger them and get them to run at midnight ect.. each node at 1, 2, 3, and 4 am.
We also ran out of space as well on one of them and had to increase the machine size too.
We are investigating going to version 3 is there any doco to help us evaluate what we need to do and know if there are any breaking changes? I also need to investigate couchBase vs CouchDB as it might have paid or more support that could come in handy. T…