третья нода не может реплицировать данные

добрый день
после вчерашнего падения кластера не могу присоединить третью ноду
какие-то странные тайм-ауты хотя все порты открыты и работали
размер таблиц 10-15гб
Примерно такие ошибки:

НА исходнике
[Thu Sep 21 14:29:07.846 2023] [943499] RPL: calculated sha1 of table ‘rt_contractors_search’, files 507, hashes 42211880
[Thu Sep 21 14:29:07.846 2023] [943499] RPL: reserve table ‘rt_contractors_search’ at 1 nodes with timeout 900.000 sec
[Thu Sep 21 14:29:08.962 2023] [943499] RPL: reserved table ‘rt_contractors_search’ - ok
[Thu Sep 21 14:29:08.988 2023] [943499] RPL: sending table ‘rt_contractors_search’
[Thu Sep 21 14:29:08.988 2023] [943499] RPL: sending file rt_contractors_search.0.spi (7) to 10.30.0.168:9312, packets 1, timeout 120.000 sec
[Thu Sep 21 14:29:14.885 2023] [943499] RPL: ‘10.30.0.168:9312’ error when sending data: Broken pipe
[Thu Sep 21 14:29:14.885 2023] [943499] RPL: sending file rt_contractors_search.0.spidx (8) to 10.30.0.168:9312, packets 1, timeout 120.000 sec
[Thu Sep 21 14:29:15.018 2023] [943499] RPL: sending file rt_contractors_search.0.spm (9) to 10.30.0.168:9312, packets 2, timeout 120.000 sec
[Thu Sep 21 14:29:15.023 2023] [943499] RPL: sending file rt_contractors_search.0.spp (10) to 10.30.0.168:9312, packets 3, timeout 120.000 sec
[Thu Sep 21 14:29:26.306 2023] [943499] RPL: ‘10.30.0.168:9312’ error when sending data: Broken pipe
[Thu Sep 21 14:29:26.306 2023] [943499] RPL: sending file rt_contractors_search.0.spt (11) to 10.30.0.168:9312, packets 4, timeout 120.000 sec
[Thu Sep 21 14:29:26.319 2023] [943499] RPL: sending file rt_contractors_search.1.spa (12) to 10.30.0.168:9312, packets 5, timeout 120.000 sec
[Thu Sep 21 14:29:26.904 2023] [943499] RPL: sending file rt_contractors_search.1.spb (13) to 10.30.0.168:9312, packets 6, timeout 120.000 sec
[Thu Sep 21 14:29:32.773 2023] [943499] RPL: ‘10.30.0.168:9312’ error when sending data: Broken pipe
[Thu Sep 21 14:29:32.773 2023] [943499] RPL: sending file rt_contractors_search.1.spd (14) to 10.30.0.168:9312, packets 7, timeout 120.000 sec
[Thu Sep 21 14:29:33.874 2023] [943499] RPL: sending file rt_contractors_search.1.spds (15) to 10.30.0.168:9312, packets 8, timeout 120.000 sec
[Thu Sep 21 14:30:22.451 2023] [943499] WARNING: ‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’
error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when s
ending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe
[Thu Sep 21 14:30:22.452 2023] [943499] RPL: 0(1) nodes finished well

НА получателе (новый)
[Thu Sep 21 14:28:08.925 2023] [715077] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_str.cpp:prepar
e_state_request():604: State gap can’t be serviced using IST. Switching to SST
[Thu Sep 21 14:28:08.925 2023] [715077] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_str.cpp:prepar
e_state_request():606: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (a9dd5c12-5230-11ee
-86fc-b691867bf737): 1 (Operation not permitted)
at /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_str.cpp:prepare_for_IST():538. IST will be unav
ailable.
[Thu Sep 21 14:28:08.925 2023] [715077] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcs/src/gcs.cpp:gcs_request_state_tr
ansfer():1817: ist_uuid[00000000-0000-0000-0000-000000000000], ist_seqno[-1]
[Thu Sep 21 14:28:08.925 2023] [715076] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/gcs/src/gcs_group.cpp:group_select_d
onor():1354: Member 2.0 (node_10.30.0.168_prodmain01_685933) requested state transfer from ‘any’. Selected 0.0 (node_10.30.0.167_prodmain01_943323)(SYNCED) as donor.
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:proces
s_trx():1404: Ignorng trx(487693) due to SST failure
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:proces
s_trx():1404: Ignorng trx(487694) due to SST failure
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:proces
s_trx():1404: Ignorng trx(487695) due to SST failure
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/gcs_action_source.cpp:dis
patch():142: Received SELF-LEAVE. Closing connection.
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: new cluster membership: -1(0), global seqno: 0, status non-primary, gap 0
[Thu Sep 21 14:20:43.291 2023] [686484] RPL:
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: /_w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_smm.cpp:async
recv():461: Slave thread exit. Return code: 6
[Thu Sep 21 14:20:43.291 2023] [686484] RPL: receiver prodmain01 done, code 6, error in client connection, must abort
[Thu Sep 21 14:20:43.291 2023] [685913] FATAL: ‘prodmain01’ cluster after join error: ‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending
data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe, nodes ‘10.30.0.166:9322,10.30.0.167:9320’
[Thu Sep 21 14:20:43.291 2023] [686484] DEBUG: Detached::RemoveThread called for 686484
[Thu Sep 21 14:20:43.291 2023] [685913] RPL: deleting cluster prodmain01
[Thu Sep 21 14:20:43.291 2023] [686484] DEBUG: Terminated thread 686484, ‘prodmain01_repl_0’

скопирует 80-200мб и здыхает
gcache 4096M

ну нужны логи со всех нод и чтобы можно было бы сопоставить timestamp на всех нодах и сравнить события на нодах - пока не понятно, почему sending file на ноду 10.30.0.168:9312 завершаетсся ошибкой через минуту, хотя timeout 120 sec - что в это время на ноде 10.30.0.168:9312 залогировано?

вот со 168 в тоже время

[Thu Sep 21 14:29:08.459 2023] [685917] RPL: remote cluster command 1, client 10.30.0.167:60926
[Thu Sep 21 14:29:08.521 2023] [685917] RPL: reserve table ‘rt_contractors_search’
[Thu Sep 21 14:29:08.960 2023] [685917] RPL: remote cluster ‘prodmain01’ command 1, client 10.30.0.167:60926 - ok
[Thu Sep 21 14:29:14.074 2023] [685944] WARNING: failed to receive API body (client=10.30.0.167:38281(1292), exp=22380223(4332), error=‘Connection timed out’)
[Thu Sep 21 14:29:14.839 2023] [685922] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.0.spi.new (-1>7), restart 0
[Thu Sep 21 14:29:14.999 2023] [685938] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.0.spidx.new (-1>8), restart 0
[Thu Sep 21 14:29:15.023 2023] [685920] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.0.spm.new (-1>9), restart 0
[Thu Sep 21 14:29:20.061 2023] [685943] WARNING: failed to receive API body (client=10.30.0.167:54280(1308), exp=14636075(4332), error=‘Connection timed out’)
[Thu Sep 21 14:29:25.566 2023] [685919] WARNING: failed to receive API body (client=10.30.0.167:62347(1318), exp=14636075(4332), error=‘Connection timed out’)
[Thu Sep 21 14:29:26.280 2023] [685938] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.0.spp.new (-1>10), restart 0
[Thu Sep 21 14:29:26.317 2023] [685910] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.0.spt.new (-1>11), restart 0
[Thu Sep 21 14:29:26.832 2023] [685937] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spa.new (-1>12), restart 0
[Thu Sep 21 14:29:27.959 2023] [685928] DEBUG: attrflush: doing the check
[Thu Sep 21 14:29:27.959 2023] [685928] DEBUG: attrflush: no dirty tables found
[Thu Sep 21 14:29:31.973 2023] [685910] WARNING: failed to receive API body (client=10.30.0.167:42863(1334), exp=17702870(4332), error=‘Connection timed out’)
[Thu Sep 21 14:29:32.736 2023] [685918] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spb.new (-1>13), restart 0
[Thu Sep 21 14:29:33.748 2023] [685917] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spd.new (-1>14), restart 0
[Thu Sep 21 14:29:35.163 2023] [685949] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spds.new (-1>15), restart 0
[Thu Sep 21 14:29:35.620 2023] [685938] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spe.new (-1>16), restart 0
[Thu Sep 21 14:29:35.627 2023] [685928] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.sph.new (-1>17), restart 0
[Thu Sep 21 14:29:35.630 2023] [685925] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.sphi.new (-1>18), restart 0
[Thu Sep 21 14:29:40.740 2023] [685936] WARNING: failed to receive API body (client=10.30.0.167:46206(1358), exp=47935680(7228), error=‘Connection timed out’)
[Thu Sep 21 14:29:41.708 2023] [685911] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spi.new (-1>19), restart 0
[Thu Sep 21 14:29:46.916 2023] [685940] WARNING: failed to receive API body (client=10.30.0.167:57849(1372), exp=19639170(5780), error=‘Connection timed out’)
[Thu Sep 21 14:29:47.736 2023] [685929] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spidx.new (-1>20), restart 0
[Thu Sep 21 14:29:47.794 2023] [685915] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spm.new (-1>21), restart 0
[Thu Sep 21 14:29:52.939 2023] [685938] WARNING: failed to receive API body (client=10.30.0.167:63383(1386), exp=44598424(1436), error=‘Connection timed out’)
[Thu Sep 21 14:29:53.889 2023] [685919] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spp.new (-1>22), restart 0
[Thu Sep 21 14:29:59.010 2023] [685926] WARNING: failed to receive API body (client=10.30.0.167:58209(1401), exp=1936719(8676), error=‘Connection timed out’)
[Thu Sep 21 14:29:59.519 2023] [685937] RPL: switching disk file /var/lib/manticore/rt_contractors_search/rt_contractors_search.1.spt.new (-1>23), restart 0
[Thu Sep 21 14:30:04.733 2023] [685923] WARNING: failed to receive API body (client=10.30.0.167:32982(1415), exp=88790872(2884), error=‘Connection timed out’)
[Thu Sep 21 14:30:10.833 2023] [685944] WARNING: failed to receive API body (client=10.30.0.167:34958(1430), exp=88790872(4332), error=‘Connection timed out’)
[Thu Sep 21 14:30:16.504 2023] [685914] WARNING: failed to receive API body (client=10.30.0.167:53432(1443), exp=88790872(4332), error=‘Connection timed out’)
[Thu Sep 21 14:30:22.216 2023] [685927] WARNING: failed to receive API body (client=10.30.0.167:61112(1451), exp=88790872(2884), error=‘Connection timed out’)
[Thu Sep 21 14:30:22.343 2023] [685911] RPL: remote cluster command 4, client 10.30.0.167:34713
[Thu Sep 21 14:30:22.343 2023] [685911] RPL: rotating table ‘rt_contractors_search’ content from /var/lib/manticore/rt_contractors_search/rt_contractors_search
[Thu Sep 21 14:30:22.379 2023] [685911] RPL: rolling-back table ‘rt_contractors_search’ into cluster ‘prodmain01’ from /var/lib/manticore/rt_contractors_search/rt_contractors_searc
h
[Thu Sep 21 14:30:22.419 2023] [685911] RPL: remote cluster ‘prodmain01’ command 4, client 10.30.0.167:34713 - ok
[Thu Sep 21 14:30:22.453 2023] [685920] RPL: remote cluster command 5, client 10.30.0.167:49418
[Thu Sep 21 14:30:22.453 2023] [685920] RPL: join sync prodmain01, UID a9dd5c12-5230-11ee-86fc-b691867bf737:488595, sent failed, tables 4, ‘10.30.0.168:9312’ error when sending dat
a: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pi
pe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe;‘10.30.0.168:9312’ error when sending data: Broken pipe
[Thu Sep 21 14:30:22.453 2023] [685920] RPL: /__w/manticoresearch/manticoresearch/build/galera-build/galera_populate-prefix/src/galera_populate/galera/src/replicator_str.cpp:sst_re
ceived():70: SST request was cancelled

could you show the output of the SphinxQL show status at donor and joiner nodes prior to issue join at the joiner node and after the join statement started to work?

sorry, but problem was solved by reinstalling Manticore :slight_smile:
ps: 6.2.12