You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you have a lot of nodes which run a lot of Kafka tables the rolling restart can lead to a terrible sequence of the rebalances.
node1 going down
every kafka table on that node get shutting down
that triggers rebalances in consumer groups
rebalance is 'stop the world' thing in clickhouse
so all other replicas pauses the consumtion and start the relabalnce protocol to redestribute the topics / partitions
it usually takes seconds to dozen of seconds till they will get the new assignment
in the meanwhile the node1 get back online and trigger one more rebalance
then the situation repeats for other nodes.
Possible solution:
let's introduce some setting like stopSteamingTablesDuringRestarts
when enabled clickhouse-operator before restarting the first node should
do
DETACH TABLE db.table ON CLUSTER '{cluster}' PERMANENTLY
for every table with engine Kafka (maybe also RabbitMQ? and others)
and store in the state that the table were detached (wouldn't it be too much?)
after that do normal reconsile / restarts.
in cases or success / failure do ATTACH TABLE for every table stored in the state.
The text was updated successfully, but these errors were encountered:
If you have a lot of nodes which run a lot of Kafka tables the rolling restart can lead to a terrible sequence of the rebalances.
node1 going down
every kafka table on that node get shutting down
that triggers rebalances in consumer groups
rebalance is 'stop the world' thing in clickhouse
so all other replicas pauses the consumtion and start the relabalnce protocol to redestribute the topics / partitions
it usually takes seconds to dozen of seconds till they will get the new assignment
in the meanwhile the node1 get back online and trigger one more rebalance
then the situation repeats for other nodes.
Possible solution:
let's introduce some setting like stopSteamingTablesDuringRestarts
when enabled clickhouse-operator before restarting the first node should
do
DETACH TABLE db.table ON CLUSTER '{cluster}' PERMANENTLY
for every table with engine Kafka (maybe also RabbitMQ? and others)
and store in the state that the table were detached (wouldn't it be too much?)
after that do normal reconsile / restarts.
in cases or success / failure do ATTACH TABLE for every table stored in the state.
The text was updated successfully, but these errors were encountered: