Getting different results for KNN query on replication nodes

I’ve found a weird issue, have RT index on 3 node cluster - using the official manticore helm chart ( v10.1.0)

But KNN queries return different results on each node

$ php scripts/runsphrt.php --namespace=dev "select id, user_id, realname, title, 1 as reference_index, knn_dist() from gridimage_embedding where knn(image_vector, 100, 181780) limit 10"
0: id, user_id, realname, title, reference_index, knn_dist()
0:   875248, 19979, Michael Trolove, House and Barn Conversion, Lingham Lane, 1, 0.08535016
0:   3662336, 4330, Dave Hitchborne, High Street, West Wycombe, 1, 0.08751625
0:   877002, 19979, Michael Trolove, Rectory Farm, Bullock Road, 1, 0.08935702
0:   1495549, 11627, andrew auger, Salisbury Road, Blandford Forum, 1, 0.08986270
0:   3635586, 4330, Dave Hitchborne, Main Road, Dowsby, 1, 0.09074509
0:   73190, 2215, Phil Williams, Lissett, 1, 0.09110427
0:   1724480, 4330, Dave Hitchborne, Cagthorpe, Horncastle, 1, 0.09228814
0:   437560, 4330, Dave Hitchborne, Road Junction at Halton Fenside, 1, 0.09278101
0:   1092160, 19979, Michael Trolove, Main Street Great Gidding, 1, 0.09309936

1: id, user_id, realname, title, reference_index, knn_dist()
1:   90948, 3141, Michael Graham, Tarn Pike o Blisco, 1, -0.00000024
1:   875248, 19979, Michael Trolove, House and Barn Conversion, Lingham Lane, 1, 0.08535016
1:   3662336, 4330, Dave Hitchborne, High Street, West Wycombe, 1, 0.08751625
1:   877002, 19979, Michael Trolove, Rectory Farm, Bullock Road, 1, 0.08935702
1:   1495549, 11627, andrew auger, Salisbury Road, Blandford Forum, 1, 0.08986270
1:   3635586, 4330, Dave Hitchborne, Main Road, Dowsby, 1, 0.09074509
1:   73190, 2215, Phil Williams, Lissett, 1, 0.09110427
1:   1724480, 4330, Dave Hitchborne, Cagthorpe, Horncastle, 1, 0.09228814
1:   437560, 4330, Dave Hitchborne, Road Junction at Halton Fenside, 1, 0.09278101
1:   1092160, 19979, Michael Trolove, Main Street Great Gidding, 1, 0.09309936

2: id, user_id, realname, title, reference_index, knn_dist()
2:   178386, 2215, Phil Williams, Craigmark, 1, -0.00000024
2:   875248, 19979, Michael Trolove, House and Barn Conversion, Lingham Lane, 1, 0.08534968
2:   3662336, 4330, Dave Hitchborne, High Street, West Wycombe, 1, 0.08751625
2:   877002, 19979, Michael Trolove, Rectory Farm, Bullock Road, 1, 0.08935738
2:   1495549, 11627, andrew auger, Salisbury Road, Blandford Forum, 1, 0.08986270
2:   3635586, 4330, Dave Hitchborne, Main Road, Dowsby, 1, 0.09074509
2:   64977, 2215, Phil Williams, The Post Office at Bathford, 1, 0.09110427
2:   1724480, 4330, Dave Hitchborne, Cagthorpe, Horncastle, 1, 0.09228814
2:   437560, 4330, Dave Hitchborne, Road Junction at Halton Fenside, 1, 0.09278101
2:   1092160, 19979, Michael Trolove, Main Street Great Gidding, 1, 0.09309953

This script just runs the query on each node seperately, and prints the result.
Very similar, but see they not identical. document 875248 even has different distance.

As far as can seen the nodes are all in sync, and even querying the vector attributes seem to be identical on each

Data was just inserted to a worker, and allowed to replicate naturally - gridimage_embedding was added to cluster before it was populated

php scripts/runsphrt.php --namespace=dev "select id, user_id, realname, title, image_vector from gridimage_embedding where id = 90948 limit 10" 
0:                                       id: 90948
0:                             image_vector: -0.03217115,0.03704010,-0.03127992,0.02554078,-0.00850291,-0.03710286,-0.01276904,0.02601549,0.00510677,0.04268274,0.01847675,0.00375635,0.02503659,0.01960007,0.00948665,-0.04172490,0.07599994

1:                             image_vector: -0.03217115,0.03704010,-0.03127992,0.02554078,-0.00850291,-0.03710286,-0.01276904,0.02601549,0.00510677,0.04268274,0.01847675,0.00375635,0.02503659,0.01960007,0.00948665,-0.04172490,0.07599994

2:                             image_vector: -0.03217115,0.03704010,-0.03127992,0.02554078,-0.00850291,-0.03710286,-0.01276904,0.02601549,0.00510677,0.04268274,0.01847675,0.00375635,0.02503659,0.01960007,0.00948665,-0.04172490,0.07599994

Have checked more, they identical right to end! Checked other docs, too. The output of the attribute is identical!

… My only assumption is somehow the HNSW’s ‘small world’ network has been built different on each node - the links between items have somehow ended up with a different graph.
I wouldn’t mind if the results were comparable (it just swapping similar results), but the results are vastly different quality. Hard to see in list form.

CREATE TABLE gridimage_embedding (
id bigint,
title text,
grid_reference text,
realname text,
user_id integer,
title_vector float_vector knn_type='hnsw' knn_dims='512' hnsw_similarity='COSINE',
image_vector float_vector knn_type='hnsw' knn_dims='512' hnsw_similarity='COSINE'
)

The same issue happens on the title_vector, results are diofferent to searching on the image vector, but they are stiff different between nodes.

maybe some nodes got this document flushed on disk and hnsw build only for disk chunks but other node has that document in memory and calc of the knn distance performed by hand (our code)

Could you check that by inserting that doc in the new temp index and all knn dist for it to check the knn dist value for RAM segments then flush data into disk chunk and check knn dist again for hnsw / disk chunk ?

If it differs please create ticket with create table statement and data to post / MRE

I dumped data from one node.
… deleted the whole deployment (deleted the helm release, as well as the pv claims)

Recreated the cluster with helm
… then imported the table back into one node.

Data is now identical on all nodes (ie test KNN queries return same results). So dont know how to reproduce it at the moment!

Only difference I can see is this was all imported in one go (piping the dump via mysql client)
whereas the first time, my own PHP injected the rows ‘piecemeal’. (inserted 1000 rows, runs some tests, then inserted some more etc. Each set might of gone to different worker nodes, which seems to have replicated the data fine (all rows present), while the KNN index still ended up different)

Suppose will have to try doing that again. But will be harder to create as a reproducible example!
Will have to log the queries as go.

Edit: Just to confirm, I’ve now recreated the whole index, with my original PHP code. And the resultant index seems fine. KNN queries return same (good) results on all nodes.
… what can’t guarantee that inserted into exactly the same nodes in the same sequence as original. possible avoided some sort race condition, orgiinally writtne to nodes, while still replicating.

So for now will write it off a as one-off corruption.