according to manticore manual, when starting cluster after unexpected shutting down, the node with the last executed query, should be started first with extra --new-cluster flag.
but in most of the case, manticore node is running through systemd or service. which means the manticore process will automaticly be started after crash or power coming back.
so, it will not be that practical, It’s almost impossible for us to assign someone standing by for manually checking which node is with the last requested query, and then rebooting those nodes one by one.
is there any way to make cluster recovery more friendly? could those nodes communicate with each other to negotiate a valid bootstrap node without manual intervention?
you do not need to start single node with the --new-cluster or restart node after crash. If cluster has quorum \ some alive nodes the node that restarts just join cluster as usual. As manual said you need use this cli only if all your cluster nodes were shutdowns and you need to start your whole cluster.
what I mean the scenario is not for Minority nodes of cluster going down, but majority of nodes or all nodes.
This could be quit possible, say, a cluster with 3 nodes in the same room encoutering unexpected powering off, but a few seconds later, power comes back.
in linux, nodes are running through systemd, when powering on, manticore process will be automanticly started, their starting order is unguaranteed.
when cluster lost quorum it is impossible to figure out is all nodes goes up and has connectivity or maybe most of nodes goes up but other nodes is still running and you got the split brain case.
You could create external script that check nodes periodically and if nodes crashed and restarted and cluster got into non primary state but all nodes are available and connected then set pc.bootstrap at any node to fix the cluster as described in the manual
thank you for pointing it out, but problem is still there;
node crash is not that common, and all nodes crash at the same time, could be quite rare. I don’t think I can come across one.
but unexpected powering off , which I encoutered lots of time, is what I concern most.
when this happen, I currenly do not have a clue of how to locate the node with the most recent data, could manticore officially create an external script to demonstrate how to deal with unexpected powering off?
we have such script in the helm chart manticoresearch-helm you try to copy code from there worker/quorum.php and Core\Manticore\ManticoreConnector->restoreCluster