KNN vector search, documents always sorted by their distance

Hello.

The documentation KNN vector search specifies: “Documents are always sorted by their distance to the search vector. Any additional sorting criteria you specify will be applied after this primary sort condition.”

Why is there such a limitation? Why can’t I control the sorting as usual?
For example, I want the priority of records to always be taken into account, regardless of the relevance of the search. Let’s say we are talking about displaying products in an online store: I want the products that are in stock to always be displayed at the top of the list and the products that are out of stock to be displayed only after them.

For each record, I have a sort_order field that indicates the priority. In this case:
sort_order = 100 - products in stock;
sort_order = 300 - products that are not in the store and cannot be purchased.

As a result, the buyer sees the product they are looking for, but some of the products that cannot be ordered are displayed on the first screen, which is unnecessary information for them.

SELECT product_id, name, sort_order, WEIGHT(), knn_dist() FROM product_index_rt_64453318 WHERE knn ( index_vector, 20, (...) ) ORDER BY sort_order ASC, WEIGHT() DESC LIMIT 0, 20;

+------------+--------------------------------------+------------+----------+------------+
| product_id | name                                 | sort_order | weight() | knn_dist() |
+------------+--------------------------------------+------------+----------+------------+
|    1860046 | Television LG 75QNED86A6A            |        300 |        1 |   0.360425 |
|    1860052 | Television LG 75QNED80A6A            |        100 |        1 | 0.36158258 |
|    1878233 | Television OzoneHD 19HN82T2          |        300 |        1 | 0.37646562 |
|    1882064 | Television 75" LG NanoCell 4K 60Hz   |        100 |        1 | 0.37682074 |
|    1847746 | Television Samsung QE75QN80FAUXUA    |        100 |        1 | 0.38049912 |
|    1879042 | Television 50" LG QNED 4K 120Hz Smart|        100 |        1 | 0.38467413 |

I also cannot use knn_dist() in ORDER BY
This causes to an error. Perhaps this would solve the sorting problem in my case, if it were possible.

ORDER BY sort_order ASC, knn_dist() ASC, WEIGHT() DESC LIMIT 0, 20;
ERROR 1064 (42000): P01: syntax error, unexpected '(', expecting $end near '() ASC, WEIGHT() DESC LIMIT 0, 20'

Because when KNN search is used, top-k documents are retrieved from a HNSW index based on their distance from the query vector. Then they may be sorted by other criteria. We can either add pre-filtering to HNSW so that it selects only candidates which fulfill a certain condition (e.g. sort_order = 100). Create a feature request if this is what you want. Or we can expose a distance function that bypasses the HNSW index and just calculates distance of every document to the query vector. That way your sort condition will work, but query will likely become very slow.

Usually, in a DBMS, you first get a list of records according to specified conditions and then sort them.
Therefore, I assumed that Manticore works on the same principle. That is, we first obtain records using the WHERE knn(index_vector, 20, (…)) condition and then we can sort them by knn_dist(), as was available through the WEIGHT() function when searching using MATCH().

Ideally, this construction should work without any problems:
ORDER BY sort_order ASC, knn_dist() ASC

After all, the knn_dist() function is available in SELECT via the knn_dist() call, but for some reason it is not available in the ORDER BY section.

because as was explained above WHERE knn(index_vector, 20, (…)) always use ORDER BY knn_dist() asc, ... when you could add any your sort conditions. However matching returns only k top documents Searching > KNN | Manticore Search Manual

Documents are always sorted by their distance to the search vector. Any additional sorting criteria you specify will be applied after this primary sort condition. For retrieving the distance, there is a built-in function called knn_dist().

ie

  • knn matching has primary sort by dist order
  • returns only k documents

you could follow the Hybrid search feature reqest at Hybrid search · Issue #2079 · manticoresoftware/manticoresearch · GitHub ticket to get the full-text matching along with KNN

knn(…) is not a condition. It means “give me k documents closest to the query vector”. And it gives appoximate best-k documents, which essentially works the same as calculating distances to the query vector for every document and sorting them by that distance. Which means that sorting by knn_dist() is built into the knn(…) clause. And knn_dist() is returned only for those best-k documents.

Code can be updated for this (“ORDER BY sort_order ASC, knn_dist() ASC”) to work. However, it would sort only those k documents already found by KNN search. I.e. if it finds k documents with sort_order = 300, sorting by sort_order does nothing. If you want this to work over the whole index, you would need to do bruteforce distance calculation and then sort by it. Which means potentially calculating distances for all documents in a table, and that is slow.