I managed to crash a node on a cluster (inserted too much data, the combined total of rt_mem_limit, was larger than the resources limit, so container was forcibly restarted by k8s)
Now the replication is broken. The current service IPs:
manticorert-worker-0: 10.72.38.198
manticorert-worker-1: 10.72.42.238
manticorert-worker-2: 10.72.45.210
manticorert-worker-2 is the node that was restarted, and I think USED to have the IP 10.72.47.81
… manticorert-worker-2 wont start, because it cant contact 10.72.47.81 - its old IP!
# php scripts/runsphrt.php "show status like 'uptime'" | grep Value
0: Value: 254864
1: Value: 254933
2: Value: 2334
Log from: manticorert-worker-2
[Wed Aug 24 14:19:26.878 2022] [42] WARNING: cluster 'manticore': no available nodes (10.72.47.81,10.72.43.101,10.72.45.210), replication is disabled, error: '10.72.47.81:9312': connect timed out;'10.72.43.101:9312': connect timed out
Frankly not sure what 10.72.43.101 is!
And each node have different list of nodes
# php scripts/runsphrt.php "show status like 'cluster%node%'"
0: Counter, Value
0: cluster_manticore_node_state, synced
0: cluster_manticore_nodes_set, 10.72.38.198,10.72.42.238,10.72.45.210
0: cluster_manticore_nodes_view, 10.72.42.238:9312,10.72.42.238:9315:replication,10.72.38.198:9312,10.72.38.198:9315:replication
1: Counter, Value
1: cluster_manticore_node_state, synced
1: cluster_manticore_nodes_set, 10.72.47.81,10.72.42.238,10.72.45.210
1: cluster_manticore_nodes_view, 10.72.42.238:9312,10.72.42.238:9315:replication,10.72.38.198:9312,10.72.38.198:9315:replication
2: success but zero rows returned
… so intend to run UPDATE nodes
on instances 0 and 1. Possibly promote one to master for bootstrap
On 2, will have to run JOIN cluster. But as it already has local copies of all the indexes, wont JOINing fail? I guess I need to clear out the data folder, so can ‘start fresh’ (syncing data from either 0 or 1)