Helix CallbackHandler performance improvement

Introduction

We experienced some customers complaining that Helix is not responding to their newly created resources or jobs. During our debugging, we notice the pending call back queue is extremely high for these customers. The workaround solution is to trigger an alert when pending callback queue size is above a certain limit and the oncall engineer to restart the controller when needed. To fundamentally address the issue, we conducted a series of efforts that improved the callback handling. It includes removing duplicated logic, replacing old time consuming libraries, reducing lock scope and some other critical changes outside callBackHandler. In the latest Helix release, the number of pending callbacks won’t keep increasing when receiving intensive callbacks and customers won’t be blocked on new resources or job creation.

Background

Helix uses Zookeeper as a centralized metadata store for resource status and communication channel with participants. Helix ZkClient subscribes to ZNode on different paths and gets zookeeper callbacks for any zNode change. When zkClient gets a zk callback, ZkClient sends the event to a callback handler to invoke the corresponding business logic based on the ZNode change. Each ZkClient has a separate _eventHandling thread. All callback event handling is executed on this thread. When the callBack event's arriving rate is higher than than a single thread process rate, the following event will need to wait in the event queue. As the pending callback events accumulate in the queue, newly arrived callback events have to wait longer and longer. Thus in customers’ perspective, the newly created job or task takes longer and longer to be scheduled. It heavily impacts their business use case.

Methodology

Hot Spot in callback handling process

This issue (https://github.com/apache/helix/issues/401) shows the call stack and time spent for call back event handling in our test environment using a profile tool.

From our profile result, sterilization and deserialization take half of the time. Because it is in the critical path in ZkCallback Handling as well as our controller pipeline, fixing this hot spot would improve not only the callback handling but also other Helix operations.

Remove duplicated logic

Remove duplicated logic to re-subscribe

Resubscribing is needed for ZK watchers every time when we get a callback. In Helix, ZkClient resubscribes for that path just after getting the callback, and CallBackHandler subscribes for all children ZNodes. Fanout subscribing for all children zNodes is already time consuming.To make it worse, there are two fanout subscribeForChange calls for each child change event in our original code. One in HandleChildChange another is called asynchronous in invoke. To make the logic simple and clear, callBackHandler only needs to call subscribeForChange when handling children change and this can be done in invoke as long as we resubscribe first and read the new children nodes after.

Use batchMode to dedupe event

CallbackHandler could handle each call back event one at a time or handle them in batch mode. When batch mode is enabled, multiple call back events for the same path are handled only once. Another benefit of using batch mode is relieving the stress to ZkClient event handling thread. Compared to the non-batch mode, where resubscribing and event handling are done in one event handling thread, these 2 steps are processed using separate threads when in batch mode. We advised high volume users to enable batch mode to reduce duplicate events.

Resolve lock contention

Lock contention on zkHelixManager All listener's most time consuming function 'subscribeForChanges' is synchronized on one HelixManager object. The rebalance pipeline is also synchronized on this HelixManager object. This causes callback handling and rebalance computation to be extremely slow and the cluster is almost unresponsive. We shrunk the lock scope that needed to be synced on one HelixManager object. Also replaces the synchronized object to be a per path callBackHandler object instead of a widely shared HelixManager object.

Lock contention on CallbackHandler

The event queue in callBackHandler requires exclusive lock during init and reset. It is synchronized on ‘this’ keyword after the previous change. To move one step further, we could use atomic reference rather than locking the whole object. This would relieve the lock contention on callBackhandler.

Reduce number of ZK write request to reduce ZK latency

There are some redundant ZK write IOs in our task framework. The workflow and job context is being written to ZK in each pipeline even when it remains unchanged during the pipeline. These ZK write IO could be avoided by eliminating those context writes.

Use thread Pool to reduce memory usage caused too many threads

When batch mode is enabled, each callBackHandler has 2 threads to resubscribe and handles all callbacks. Since we have one callBackHandler object per path, the huge number of threads could cause memory issues. We will need a thread pool for each zkHelixManager to limit the thread number.

Result The following picture is the 99pct pipeline process time for some Helix managed clusters. Screen Shot 2021-10-19 at 5 02 31 PM