Second node timed out on joining a cluster

We have two VM. On the first one we create the cluster and fill some index with about 400.000 documents.

After this we start the second node (which is pre-configured to be joined to the cluster on starting up), however it is not joined and we see the following rows in the searchd.log:

[Sun Oct 4 20:17:53.091 2020] [849] WARNING: ‘192.168.107.11:9312’: query timed out
[Sun Oct 4 20:17:53.091 2020] [849] WARNING: cluster ‘manticore_cluster’: no available nodes, replication is disabled, error: ‘192.168.107.11:9312’: query timed out

When we try to join the node to the cluster manually, it takes about 2.5 minutes. Is there a way to increase the timeout to join a cluster when starting VM?

WARNING: ‘192.168.107.11:9312’: query timed out

means that one node can not query another node. Could you start your nodes with searchd option --logreplication to collect more information related to replication and check the root of the error?

When we try to join the node to the cluster manually, it takes about 2.5 minutes. Is there a way to increase the timeout to join a cluster when starting VM?

It is also not clear what takes time in this case SST of all index files or CRC32 check of index files chunks. It helps to start daemon with --logreplication and check messages at daemon log.

Ok, we will provide you with logs if we experience the problem in the future.

I’ve uploaded logs here: https://pastebin.com/raw/0A1rvbG5

I restart Manticore and wait for the node to be joined to the cluster. After 15 minutes it changes cluster_manticore_cluster_local_state_comment from Joining: receiving State Transfer to Initialized.

could you upload same log from donor node as from joiner node it seems that donor closes connection or sends empty sync command - not sure how that might happen

First node - https://pastebin.com/raw/8Qd30DVn
Second node - https://pastebin.com/raw/rNN8wTUq

at the first node message log is quite short that is why it is not clear what was between 17:05 and 17:10 then JOINER node reports SST failure

the first message at first node was at 17:08:49.787 and it is already said '192.168.107.13:9312': connect timed out seems JOINER already got rejected by that time

seems something prevent SST from first node to JOINER need more logs or full logs from these nodes

Ok, here is what we have. Nodes 192.168.107.12 and 192.168.107.13 are in the cluster. We created new VM (192.168.107.11) with clean Manticore and ran join cluster manticore_cluster at ‘192.168.107.13:9312’; there. Detailed logs from each node are attached here:

192.168.107.11 - https://pastebin.com/raw/zNQMCR0b
192.168.107.12 - https://pastebin.com/raw/vhZeGacM
192.168.107.13 - https://pastebin.com/raw/xn49p2HM

from log file you provide I see that node11 gets SST (RT index files) from node12

and at log of node12 I see

12:44:51.077 2020] [848] DEBUG: AsyncNetInputBuffer_c::AppendData: error 110 (Connection timed out) return -1

however log at node11 ends at 12:25:40 could you provide log from node11 from 12:35 tp 12:55 to search more events at that node?

as it is not clear why node12 reports about timeout

could you provide reproducible case with docker images or VM images? or k8s setup that reproduces issue here locally?

could you provide ls -lh of all index files in your cluster?

could you provide log from node11 from 12:35 tp 12:55 to search more events at that node?

It is the same as at 12:25:

[Thu Oct 29 12:25:40.735 2020] [852] RPL: remote cluster command 3, client 192.168.107.12:55414

It looks over all available ports. When it reaches the last available one, it starts from the beginning like RPL: remote cluster command 1, client 192.168.107.12:48894. So, there is no useful info in 192.168.107.11 logs after 12:25. It is all the same.

could you provide reproducible case with docker images or VM images? or k8s setup that reproduces issue here locally?

There is sensitive data in our indexes. So, unfortunately, we cannot provide you with it. The problem is not reproduced when indexes are empty or partially filled.

could you provide ls -lh of all index files in your cluster?

192.168.107.11 - https://pastebin.com/raw/mMJ4p7iX
192.168.107.12 - https://pastebin.com/raw/SnPKZWsC
192.168.107.13 - https://pastebin.com/raw/ZBPFAxsU

P.S. Index rzn is the biggest index in the cluster. It’s size is about 6.6G and it has 1808695 documents.

I see no info about file sizes only folder names. Could you provide ls -lh of all index files? I would like to see index files names, count and files sizes.

could you also post searchd section of config for nodes 11,12,13 ?

192.168.107.11 - https://pastebin.com/raw/hpujtPwh
192.168.107.12 - https://pastebin.com/raw/TV9Uwz6V
192.168.107.13 - https://pastebin.com/raw/fBCiYsqC

could you also post searchd section of config for nodes 11,12,13 ?

https://pastebin.com/raw/mn3sgJeQ

It is the same for all nodes. Except for server_id and IP addresses to listen.

I’ve just fixed the issue at replication SST for large indexes at master c419a324

You could use package from our dev repo after CI passed for that commit to check that issue is fixed for you.

The problem is not reproduced anymore, thank you! I will let you know if anything goes wrong with this.