Server A created a posts cluster, but Server B failed to join the replication cluster

fcw · May 29, 2024, 6:03am

Server A created a posts cluster, but Server B failed to join the replication cluster. The failure log is as follows:
/* Wed May 29 13:47:05.391 2024 conn 5 */ JOIN CLUSTER posts AT ‘8.138.88.168:9312’ # error=cluster ‘posts’, no nodes available(8.138.88.168:9312), error

manticore.conf:
searchd {
listen = 9312
listen = 9306:mysql
listen = 9308:http
listen = 0.0.0.0:9360-9369:replication
log = /var/log/manticore/searchd.log
query_log = /var/log/manticore/query.log
pid_file = /run/manticore/searchd.pid
data_dir = /var/lib/manticore
}

tomat · May 29, 2024, 7:15am

are your nodes in the same network or in different data centers or behind the NAT?

Could you enable replication verbosity logs at all nodes via SphinxQL statement SET GLOBAL log_level = replication then provide daemon logs from all nodes?

fcw · May 29, 2024, 8:31am

Nodes are on different Alibaba Cloud servers.

node A (8.138.88.168)
tcp 0 0 0.0.0.0:9360 0.0.0.0:* LISTEN 18668/searchd
tcp 0 0 0.0.0.0:9306 0.0.0.0:* LISTEN 18668/searchd
tcp 0 0 0.0.0.0:9308 0.0.0.0:* LISTEN 18668/searchd
tcp 0 0 0.0.0.0:9312 0.0.0.0:* LISTEN 18668/searchd

[Wed May 29 16:29:51.747 2024] [11014] DEBUG: P01: syntax error, unexpected identifier near ‘JOIN CLUSTER posts ‘8.138.88.168:9312’ as nodes’
[Wed May 29 16:29:51.763 2024] [11012] RPL: cluster ‘posts’ wait to finish
[Wed May 29 16:29:51.763 2024] [11012] RPL: cluster ‘posts’ finished, cluster deleted, lib (nil) unloaded

Sergey · May 29, 2024, 11:41am

Try node_address - Manticore Search Manual: Server settings > Searchd

fcw · May 31, 2024, 1:23am

searchd {
listen = 9312
listen = 9306:mysql
listen = 9308:http
listen = 0.0.0.0:9360-9369:replication
log = /var/log/manticore/searchd.log
query_log = /var/log/manticore/query.log
pid_file = /run/manticore/searchd.pid
data_dir = /var/lib/manticore
node_address = 8.138.88.168
}
I have set node.address, but when server B joins the 8.138.88.168 replication cluster, an error is reported:
[Fri May 31 09:20:32.600 2024] [11014] DEBUG: P01: syntax error, unexpected identifier near ‘JOIN CLUSTER posts AT ‘8.138.88.168:9312’’
[Fri May 31 09:20:32.612 2024] [11012] RPL: cluster ‘posts’ wait to finish
[Fri May 31 09:20:32.612 2024] [11012] RPL: cluster ‘posts’ finished, cluster deleted, lib (nil) unloaded

Sergey · May 31, 2024, 4:14am

If you mean “DEBUG: P01: syntax error” - this is just a debug message meaning one of the parsers couldn’t parse the command, you can skip it or disable debug/replication logging (off by default).

fcw · May 31, 2024, 5:42am

±----------------------------- | Counter ±----------------------------- | command_cluster | cluster_name | cluster_posts_state_uuid | cluster_posts_conf_id | cluster_posts_status | cluster_posts_size | cluster_posts_local_index | cluster_posts_node_state | cluster_posts_nodes_set | cluster_posts_nodes_view | cluster_posts_indexes_count | cluster_posts_indexes | cluster_posts_local_state_uuid | cluster_posts_protocol_version | cluster_posts_last_applied | cluster_posts_last_committed | cluster_posts_replicated | cluster_posts_replicated_bytes | cluster_posts_repl_keys | cluster_posts_repl_keys_bytes | cluster_posts_repl_data_bytes | cluster_posts_repl_other_bytes | cluster_posts_received | cluster_posts_received_bytes | cluster_posts_local_commits | cluster_posts_local_cert_failures | cluster_posts_local_replays | cluster_posts_local_send_queue | cluster_posts_local_send_queue_max | cluster_posts_local_send_queue_min | cluster_posts_local_send_queue_avg | cluster_posts_local_recv_queue | cluster_posts_local_recv_queue_max | cluster_posts_local_recv_queue_min | cluster_posts_local_recv_queue_avg | cluster_posts_local_cached_downto | cluster_posts_flow_control_paused_ns | cluster_posts_flow_control_paused | cluster_posts_flow_control_sent | cluster_posts_flow_control_recv | cluster_posts_flow_control_interval | cluster_posts_flow_control_interval_low | cluster_posts_flow_control_interval_high | cluster_posts_flow_control_status | cluster_posts_cert_deps_distance | cluster_posts_apply_oooe | cluster_posts_apply_oool | cluster_posts_apply_window | cluster_posts_commit_oooe | cluster_posts_commit_oool | cluster_posts_commit_window | cluster_posts_local_state | cluster_posts_local_state_comment | cluster_posts_cert_index_size | cluster_posts_cert_bucket_count | cluster_posts_gcache_pool_size | cluster_posts_causal_reads | cluster_posts_cert_interval | cluster_posts_open_transactions | cluster_posts_open_connections | cluster_posts_ist_receive_status | cluster_posts_ist_receive_seqno_start | cluster_posts_ist_receive_seqno_current | cluster_posts_ist_receive_seqno_end | cluster_posts_incoming_addresses | cluster_posts_cluster_weight | cluster_posts_desync_count | cluster_posts_evs_delayed | cluster_posts_evs_evict_list | cluster_posts_evs_repl_latency | cluster_posts_evs_state | cluster_posts_gcomm_uuid ±----------------------------- I have set node.address, SQLSTATE[42000]: Syntax ------------±------------------------------------------------+
| Value |
------------±------------------------------------------------+
| 5 |
| posts |
| 51b15a99-1d7e-11ef-afeb-c6ffb78f4622 |
| 1 |
| primary |
| 1 |
| 0 |
| synced |
| |
| 8.138.88.168:9312,8.138.88.168:9360:replication |
| 0 |
| |
| 51b15a99-1d7e-11ef-afeb-c6ffb78f4622 |
| 9 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 2 |
| 195 |
| 0 |
| 0 |
| 0 |
| 0 |
| 1 |
| 0 |
| 0.000000 |
| 0 |
| 2 |
| 0 |
| 0.500000 |
| 0 |
| 0 |
| 0.000000 |
| 0 |
| 0 |
| [ 100, 100 ] |
| 100 |
| 100 |
| OFF |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 0.000000 |
| 4 |
| Synced |
| 0 |
| 2 |
| 1320 |
| 0 |
| 0.000000 |
| 0 |
| 0 |
| |
| 0 |
| 0 |
| 0 |
| 8.138.88.168:9312,8.138.88.168:9360:replication |
| 1 |
| 0 |
| |
| |
| 0/0/0/0/0 |
| OPERATIONAL |
| bb21e757-1eeb-11ef-9761-128243f6c959 |
------------±---------------------------------------------
but server B still cannot join the replication cluster on 8.138.88.168, and an error is reported:
error or access violation: 1064 cluster ‘posts’, no nodes available(8.138.88.168:9312), error:

tomat · May 31, 2024, 6:47am

as I said you need enable replication verbosity logs at all nodes via SphinxQL statement SET GLOBAL log_level = replication then provide full daemon logs from all nodes to investigate the issue further

Sergey · June 1, 2024, 3:12am

I get this when I just can’t connect to the donor:

mysql> join cluster clustername at '127.0.0.1:10201';
ERROR 1064 (42000): cluster 'clustername', no nodes available(127.0.0.1:10201), error: '127.0.0.1:10201': retries limit exceeded

so make sure it’s not a connectivity issue. E.g. do this:

telnet 8.138.88.168 9312

fcw · June 1, 2024, 6:42am

[root@iZ94laeyoplZ bin]# telnet 8.138.88.168 9312
Trying 8.138.88.168…
Connected to 8.138.88.168.
Escape character is ‘^]’.
Connection closed by foreign host.

8.138.88.168：
tcp 0 0 0.0.0.0:9360 0.0.0.0:* LISTEN 20850/searchd
tcp 0 0 0.0.0.0:9306 0.0.0.0:* LISTEN 20850/searchd
tcp 0 0 0.0.0.0:9308 0.0.0.0:* LISTEN 20850/searchd
tcp 0 0 0.0.0.0:9312 0.0.0.0:* LISTEN 20850/searchd
Port 9312 is open

Sergey · June 1, 2024, 3:04pm

Is this a full error? Nothing after error: ?

fcw · June 4, 2024, 1:34am

yes

Sergey · June 17, 2024, 3:08pm

If you can reproduce it, please create an issue on GitHub. We’d definitely want to fix it since an empty error is no good.

fcw · June 18, 2024, 3:48am

The problem has been resolved, it is caused by inconsistency between two versions. Thank you