-
Notifications
You must be signed in to change notification settings - Fork 228
Helix CallbackHandler performance improvement
We experienced some customers complaining that Helix is not responding to their newly created resources or jobs. During our debugging, we notice the pending call back queue is extremely high for these customers. The workaround solution is to trigger an alert when pending callback queue size is above a certain limit and the oncall engineer to restart the controller when needed. To fundamentally address the issue, we conducted a series of efforts that improved the callback handling. It includes removing duplicated logic, replacing old time consuming libraries, reducing lock scope and some other critical changes outside callBackHandler. In the latest Helix release, the number of pending callbacks won’t keep increasing when receiving intensive callbacks and customers won’t be blocked on new resources or job creation.
Helix uses Zookeeper as a centralized metadata store for resource status and communication channel with participants. Helix ZkClient subscribes to ZNode on different paths and gets zookeeper callbacks for any zNode change. When zkClient gets a zk callback, ZkClient sends the event to a callback handler to invoke the corresponding business logic based on the ZNode change. Each ZkClient has a separate _eventHandling thread. All callback event handling is executed on this thread. When the callBack event's arriving rate is higher than than a single thread process rate, the following event will need to wait in the event queue. As the pending callback events accumulate in the queue, newly arrived callback events have to wait longer and longer. Thus in customers’ perspective, the newly created job or task takes longer and longer to be scheduled. It heavily impacts their business use case.
This issue (https://github.com/apache/helix/issues/401) shows the call stack and time spent for call back event handling in our test environment using a profile tool.
From our profile result, sterilization and deserialization take half of the time. Because it is in the critical path in ZkCallback Handling as well as our controller pipeline, fixing this hot spot would improve not only the callback handling but also other Helix operations.
Resubscribing is needed for ZK watchers every time when we get a callback. In Helix, ZkClient resubscribes for that path just after getting the callback, and CallBackHandler subscribes for all children ZNodes. Fanout subscribing for all children zNodes is already time consuming.To make it worse, there are two fanout subscribeForChange calls for each child change event in our original code. One in HandleChildChange another is called asynchronous in invoke. To make the logic simple and clear, callBackHandler only needs to call subscribeForChange when handling children change and this can be done in invoke as long as we resubscribe first and read the new children nodes after.
CallbackHandler could handle each call back event one at a time or handle them in batch mode. When batch mode is enabled, multiple call back events for the same path are handled only once. Another benefit of using batch mode is relieving the stress to ZkClient event handling thread. Compared to the non-batch mode, where resubscribing and event handling are done in one event handling thread, these 2 steps are processed using separate threads when in batch mode. We advised high volume users to enable batch mode to reduce duplicate events.
Lock contention on zkHelixManager All listener's most time consuming function 'subscribeForChanges' is synchronized on one HelixManager object. The rebalance pipeline is also synchronized on this HelixManager object. This causes callback handling and rebalance computation to be extremely slow and the cluster is almost unresponsive. We shrunk the lock scope that needed to be synced on one HelixManager object. Also replaces the synchronized object to be a per path callBackHandler object instead of a widely shared HelixManager object.
The event queue in callBackHandler requires exclusive lock during init and reset. It is synchronized on ‘this’ keyword after the previous change. To move one step further, we could use atomic reference rather than locking the whole object. This would relieve the lock contention on callBackhandler.
There are some redundant ZK write IOs in our task framework. The workflow and job context is being written to ZK in each pipeline even when it remains unchanged during the pipeline. These ZK write IO could be avoided by eliminating those context writes.
When batch mode is enabled, each callBackHandler has 2 threads to resubscribe and handles all callbacks. Since we have one callBackHandler object per path, the huge number of threads could cause memory issues. We will need a thread pool for each zkHelixManager to limit the thread number.
Result The following picture is the 99pct pipeline process time for some Helix managed clusters.
- Replace org.codehaus.jackson with FasterXML.jackson : https://github.com/apache/helix/pull/1293
- Remove duplicate subscribe in CallBackHandler : https://github.com/apache/helix/pull/1504
- Lock contention on HelixManager object: https://github.com/apache/helix/pull/1540
- Lock contention on CallbackHandler object : https://github.com/apache/helix/pull/1589
- Eliminate redundant job context writes : https://github.com/apache/helix/pull/1611
- Eliminate redundant workflow context writes : https://github.com/apache/helix/pull/1600
- Use thread pool for batched call back handling events: https://github.com/apache/helix/pull/1645
Pull Request Description Template
ZooKeeper API module for Apache Helix
DataAccessor for Assignment Metadata
Concurrency and Parallelism for BucketDataAccessor
WAGED Rebalance Pipeline Redesign
WAGED rebalancer Hard Constraint Scope Expansion
IdealState Dependency Removal Progression Remove requested state in Task Framework