Save restart all cluster-nodes in VMs

manti · September 15, 2023, 9:51am

I have set up a test cluster under Linux that consists of 3 virtual machines (all Ubuntu 22.04 LTS).

The replication works without problems.

At Manticore Search Manual: Creating a cluster > Setting up replication > Replication cluster status you have explained how to restart the cluster completely.

Now it is so that all three VMs are shut down at the same time when I terminate or restart the host.

I don’t know which VM will be shut down last.
How do I restart the cluster without a problem when the host is restarted and all VMs boot at the same time?

Manticore is started automatically on all three VMs via systemd and config file. I did not find anything how to write the “–new-cluster” or the “–new-cluster-force” in the searchd {} config.

I would need a little tip on this scenario. Thanks

tomat · September 15, 2023, 9:54am

you could start all nodes as usual and cluster should build up but become in non primary state, ie can not accept writes but could perform searches well then use steps from Case 5 from the cluster recovery topic

manti · September 15, 2023, 11:23am

Manual intervention as in Case 5 cannot be the solution here, it MUST somehow work automatically.
It must be possible to trigger the “new-cluster-force” automatically (as I understood it).

Why is there the command “–new-cluster-force” at the manual start, but not at the automatic start via OS and config?

tomat · September 15, 2023, 11:39am

not quite sure how you could change config for one time action, to restart node as leader

if you change config allow node to start then node crashes after it will form a new cluster as it sees at the config the option.

you could start daemon via manticore_new_cluster for the VM started first then as usual for all other VMs

manti · September 15, 2023, 12:51pm

Surely this is not a one-time thing? With every reboot of the host where the 3 VMs are running, all 3 VMs are terminated almost simultaneously and also restarted simultaneously after the reboot. You should not have to do anything manually. I would say that this scenario is quite normal in the VM world. Also VM hosts have to be restarted here and there (updates etc.). If you have e.g. a Proxmox or VMWare server as host, they will be rebooted at night without the knowledge of the user.

In the course “Rebooting the cluster” I found the following:

“To reinstate a cluster, it needs to be bootstraped by its most advanced node - which should be, in general, the last node that went off. The information we need for this is stored in the ‘grastate.dat’ file from the ‘data_dir’ folder. There are two variables we need to look at : ‘safe_to_bootstrap’ - the last node to have exited from the cluster should have a value of ‘1’ and ‘seqno’ number - which should be equal to the highest sequence number.”

Why can’t the Manticore Deamon, which starts automatically with the OS, read the “grastate.dat” and then decide for itself whether to set “–new-cluster” for this service or not.

If you can’t do that, you should at least be able to set the “–new-cluster-force” through the config.

Or have I understood this wrong?

tomat · September 15, 2023, 1:16pm

nodes replication usually used for nodes in different location and it is a rare situation when all nodes goes down

however if you use all nodes at the same box and do not have network issues you could use Galera cluster option pc.bootstrap=1. You could read more about automatic cluster bootstrap and this option at the Galera documentation Crash Recovery and PC recovery and Quorum Reset

However you might end up with a split brain situation then nodes restarted and there is no network connectivity between them you will get multiple independent clusters and clients will not get informed about that and could write into separate clusters these do not see each other.

manti · September 15, 2023, 3:05pm

Of course, this rarely happens in production. But as a developer, this is a normal situation. When I program in the cluster we use the VMs for this. Since all INSERT, REPLACE, DELETE, TRUNCATE commands must be written in the format cluster_name:index_name we create at least two clusters so we can test it under real conditions.

Therefore there should be an automatic function, so that after a restart the complete cluster works again, without manual intervention.

Replication, Clustername: adm:

If I try the replication in the VMs and take a VM out and back in again, it works without problems.

If I stop all three VMs manually in series, I have the said status: “safe_to_bootstrap: 1” in the “grastate.dat” in the last VM.


# GALERA saved state
version: 2.1
uuid:    5ed22906-53d7-11ee-b0fa-eb215146403b
seqno:   2
safe_to_bootstrap: 1

If I now restart the Manticore service in this VM (192.168.0.61) with “manticore_new_cluster”, the log shows:

WARNING: cluster 'adm': invalid nodes ''(192.168.0.61:9312,192.168.0.62:9312,192.168.0.63:9312), replication is disabled, error: '192.168.0.62:9312': receiving failure (errno=111, msg=Connection refused)

Well, the other nodes are not yet up, but I had understood it so that the first node with “safe_to_bootstrap: 1” should also be started first.

I have also tried it the other way around, first all other nodes booted and then finally with “manticore_new_cluster” the first node. But this did not work with similar error messages.

I did everything as described in Case 3, but I can’t get the cluster to start up again.

What am I doing wrong?

manti · September 15, 2023, 4:00pm

Udpate:

With “/usr/bin/searchd --config /etc/manticoresearch/manticore.conf --new-cluster” it works and the cluster is available again.

With the command “manticore_new_cluster” as described in the documentation (for Ubuntu) it doesn’t work and error messages like “WARNING: cluster ‘adm’: invalid nodes …” appear in the logfile.

It seems that the “manticore_new_cluster” command is broken under Ubuntu.

Sergey · September 19, 2023, 2:17pm

It seems that the “manticore_new_cluster” command is broken under Ubuntu.

Could be caused by _ADDITIONAL_SEARCHD_PARAMS is not working · Issue #1403 · manticoresoftware/manticoresearch · GitHub