We have two VM. On the first one we create the cluster and fill some index with about 400.000 documents.
After this we start the second node (which is pre-configured to be joined to the cluster on starting up), however it is not joined and we see the following rows in the searchd.log:
[Sun Oct 4 20:17:53.091 2020] [849] WARNING: ‘192.168.107.11:9312’: query timed out [Sun Oct 4 20:17:53.091 2020] [849] WARNING: cluster ‘manticore_cluster’: no available nodes, replication is disabled, error: ‘192.168.107.11:9312’: query timed out
When we try to join the node to the cluster manually, it takes about 2.5 minutes. Is there a way to increase the timeout to join a cluster when starting VM?
means that one node can not query another node. Could you start your nodes with searchd option --logreplication to collect more information related to replication and check the root of the error?
When we try to join the node to the cluster manually, it takes about 2.5 minutes. Is there a way to increase the timeout to join a cluster when starting VM?
It is also not clear what takes time in this case SST of all index files or CRC32 check of index files chunks. It helps to start daemon with --logreplication and check messages at daemon log.
I restart Manticore and wait for the node to be joined to the cluster. After 15 minutes it changes cluster_manticore_cluster_local_state_comment from Joining: receiving State Transfer to Initialized.
could you upload same log from donor node as from joiner node it seems that donor closes connection or sends empty sync command - not sure how that might happen
at the first node message log is quite short that is why it is not clear what was between 17:05 and 17:10 then JOINER node reports SST failure
the first message at first node was at 17:08:49.787 and it is already said '192.168.107.13:9312': connect timed out seems JOINER already got rejected by that time
Ok, here is what we have. Nodes 192.168.107.12 and 192.168.107.13 are in the cluster. We created new VM (192.168.107.11) with clean Manticore and ran join cluster manticore_cluster at ‘192.168.107.13:9312’; there. Detailed logs from each node are attached here:
It looks over all available ports. When it reaches the last available one, it starts from the beginning like RPL: remote cluster command 1, client 192.168.107.12:48894. So, there is no useful info in 192.168.107.11 logs after 12:25. It is all the same.
could you provide reproducible case with docker images or VM images? or k8s setup that reproduces issue here locally?
There is sensitive data in our indexes. So, unfortunately, we cannot provide you with it. The problem is not reproduced when indexes are empty or partially filled.
could you provide ls -lh of all index files in your cluster?
I see no info about file sizes only folder names. Could you provide ls -lh of all index files? I would like to see index files names, count and files sizes.