The searchd stucks after a while

kino505 · November 22, 2023, 8:23am

Hello.
I have deployed Manticore 6.2.12 ARM64. My topology is:

MC-0: dedicated server for initial a Manticore Cluster
A group of servers (based on AWS ASG and Network LOad Balancer)

Situation 1:
MC-0 server started replication cluster configuration: CREATE CLUSTER MC… , CREATE TABLE favorite…, ALTER CLUSTER ADD favorite MC
ASG group contain one server.
The server executed a command:
echo “JOIN CLUSTER MC AT MC-0:9312’” | mysql -h127.0.0.1 -P36307

My application starts INSERT/UPDATE operations and we have a good cluster!

Situation 2:
When I added another server to ASG , I have a troubles after a while.
Sometimes same hours but sometimes the same minutes. One or both server members of ASG are frozen. The searchd.log does not have any errors.
I tried to turn off binlog, moving binlog to dedicated nvme device. I have stucked cluster always!
So, the server MC-0 is ok and contains all my data! And MC-0 is never stuck!

My config on each servers:
indexer
{
mem_limit = 512M
}

searchd
{
listen = 9312
listen = 36307:mysql41
listen = 9306:mysql
listen = 9308:http
listen = $NODE_IP:9320-9528:replication
log = /data/log/searchd.log
query_log = /data/log/query.log
query_log_min_msec = 1000
query_log_format = sphinxql

pid_file                        = /run/manticore/searchd.pid
data_dir                        = /data

rt_flush_period                 = 3600 # 1 hour
binlog_path                     = 
#binlog_flush                    = 2 #Flush every transaction and sync every second.

max_packet_size                 = 32M
net_workers                     = 4 # default is 1

sphinxql_state                  = uservars.sql


seamless_rotate                 = 1
unlink_old                      = 1
collation_server                = utf8_general_ci
watchdog                        = 1
max_filter_values               = 10000
persistent_connections_limit    = 256

}

Example of log:
[root@ip-10-2-14-128 log]# more searchd.log
[Tue Nov 21 19:28:50.159 2023] [1779] watchdog: main process 1780 forked ok
[Tue Nov 21 19:28:50.315 2023] [1780] starting daemon version ‘6.2.12 7b7275e2b@231107 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822)’ …
[Tue Nov 21 19:28:50.316 2023] [1780] listening on all interfaces for sphinx and http(s), port=9312
[Tue Nov 21 19:28:50.316 2023] [1780] listening on all interfaces for mysql, port=36307
[Tue Nov 21 19:28:50.316 2023] [1780] listening on all interfaces for mysql, port=9306
[Tue Nov 21 19:28:50.316 2023] [1780] listening on all interfaces for sphinx and http(s), port=9308
[Tue Nov 21 19:28:50.348 2023] [1781] prereading 0 tables
[Tue Nov 21 19:28:50.348 2023] [1781] preread 0 tables in 0.000 sec
[Tue Nov 21 19:28:50.352 2023] [1780] accepting connections
[Tue Nov 21 19:28:52.855 2023] [1781] [BUDDY] started v1.0.18 ‘/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://0.0.0.0:9312 --threads=1’ at http://127.0.0.
1:17661
[Tue Nov 21 19:28:52.869 2023] [1781] [BUDDY] Loaded plugins:
[Tue Nov 21 19:28:52.870 2023] [1781] [BUDDY] core: empty-string, backup, emulate-elastic, insert, select, show, cli-table, plugin, test, insert-mva
[Tue Nov 21 19:28:52.870 2023] [1781] [BUDDY] local:
[Tue Nov 21 19:28:52.870 2023] [1781] [BUDDY] extra:
[Tue Nov 21 19:29:21.319 2023] [1870] WARNING: Member 2.0 (node_10.2.14.128_MY_1781) requested state transfer from ‘any’, but it is impossible to select State Transfer donor: Resource
temporarily unavailable

tomat · November 22, 2023, 8:29am

seems you cluster lost its primary state you could read at our manual of how to recovery its state Cluster_recovery

To investigate of the case - how cluster got into non-primary state you need to start all nodes with --logreplication cli then after cluster got from working state into non-primary - provide searchd.log from all nodes for investigation

kino505 · November 22, 2023, 10:01am

Thank You for reply. I’ll do it. Could You help me about my opinion about manticore replication cluster: all members of cluster can be used for update/insert and select operations ? In other words, the replication cluster has an active-active functionality ? Maybe I was wrong for creation that infrastructure based on ASG + Load Balancer ?

tomat · November 22, 2023, 10:10am

yes all nodes are active-active or master-master functionality

kino505 · November 22, 2023, 10:25am

Great! I am trying the same configuration but based on x86_64.

kino505 · December 1, 2023, 6:54am

The same cluster based on x86_64 has no problems and has been working normally for several days without failures and errors. So, arm64 has a problem.

Sergey · December 1, 2023, 9:42am

We’ll appreciate it if you file a bug report on github and provide an instruction how to reproduce the issue.