Auto-bootstrapping an all-down cluster

zeng_ray · June 5, 2024, 9:51am

For production use, cluster can greatly increase Scalability, but we have came across servral unexpected powering off event：power down， power came back in few seconds or more.

each time this happen, all nodes goes down all of a sudden, without manual intervention, the whole application is out of use or some unexpected thing will happen.

even with manual intervention, it’s not that easy ! all of the solutions in the manual is not that helpful.

Firstly, when we realized this powering off event, some time had passed. power might already came back. The cluster may already goes into multiple independent nodes, and client could not know about it.
Secondly, we need to check grastate.dat to see if one node has safe_to_bootstrap, but most of the case, none of the node has it when a sudden powering off happens, and we need to check the seqno, but this is not safe because of the first reason, unexpected data bugs could happen.

so, sudden powering off event is a real pain in the ass for the whole application. It’s almost impossible for us to assign someone standing by for manually checking which node is with the latest data for a cluster, this has to be done internally by manticore node itself communicating with each other to negotiate a valid bootstrap one.

so Auto-bootstrapping an all-down cluster is quite needed, with this feature, a real production use can be made

tomat · June 5, 2024, 9:58am

it could be better to create a feature request ticket at Github there provide the explanation as you do here

zeng_ray · June 5, 2024, 1:23pm

thank you, and here it is:

https://github.com/manticoresoftware/manticoresearch/issues/2284