Search for similar documents

I have a database composed of small documents (circa 20-200 terms each), and I’d like to use Manticore to help find documents similar to a record already in the index. For example, I would supply Manticore with a document ID, and it would give me a ranked list of similar ones.

I currently use the QUORUM operator to achieve a pseudo-similarity search, but the results are frequently poor. In addition, this operator really hurts search performance, with queries taking up to one second to execute against my index of several million documents.

I see that Manticore recently released the K-nearest neighbor vector search feature, which is probably what I’m looking for. It’s documented here:

However, I’m confused about how I would go about implementing this. The documentation explains how to configure a table for KNN search, but then leaves us with this instruction:

After creating the table, you need to insert your vector data, ensuring it matches the dimensions you specified when creating the table.

How would I go about generating the vector data to insert?
I’m aware that there are multiple ways to represent documents in vector space, but which method does Manticore have in mind with the KNN search? I see that word embeddings such as Word2Vec are mentioned in the documentation. What would my workflow look like in integrating such a library with Manticore?

Any instructions, or a sample case, would be greatly appreciated! Thanks!

How would I go about generating the vector data to insert?

It depends on the programming language you want to use.
E.g. for PHP I would try

GitHub - ankane/onnxruntime-php: Run ONNX models in PHP +

If you write in Python it should be even easier, e.g. (JUST RANDOM EXAMPLES FROM CHATGPT):

import spacy

# Load the language model
nlp = spacy.load("en_core_web_md")  # or another model

# Process your text
text = "Your text goes here"
doc = nlp(text)

# Get a vector for the entire document
doc_vector = doc.vector

# Alternatively, get vectors for each word
word_vectors = [token.vector for token in doc]


from transformers import AutoTokenizer, AutoModel
import torch

# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and encode text
text = "Your text goes here"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

Hi Sergey,

Thanks for the recommendation. It looks like this would be a major undertaking, well outside the scope of Manticore.

I hope that someday we might have document similarity features built right in. It would be very useful!

Yes, indeed having it built-in would make things much simpler. I’ve created a feature request about it - Automatic embeddings generation · Issue #1778 · manticoresoftware/manticoresearch · GitHub

Feel free to subscribe to it to be informed.

1 Like

I’m very interested in this topic, and looking at the using the KNN functioanlity for similarity searching.

But yes, it tricky to get going.

Perhaps the biggest decision is what model to use, there are many you could use. There are many domain specific models that could use, as well ‘general purpose’ ones.
Even just using the ‘text embedding model’, ie not running a full ML model, lots of options, as I understand it, lots full models developed their own text embedding model.

Or in fact, you could just use ‘tokens’ with KNN, conceptually similar to the inverted index manticore uses natively. But probably wont have very good accuacy.

So any sort of ‘auto’ function, would either need to be very opinionated and pick and embedding model (I think elastic calls it a ‘sentence transformer’. )
… or use a framework that allows use of difference models.

Anyway, suppose what saying, is onnx runtime, might be a bit over the top (don’t need full training and inference for KNN)

Its a sentance transfomer that need

Yes, I’ve seen there are many pre-trained models which may work fine on general text. But I’ve got domain-specific, multi-language text, and I’m afraid that using these general models may not be helpful. So, I’m unsure if this is a project worth pursuing unless I want invest a lot of effort to train my own model.

Yes, I was thinking about this, too. But, like you, I don’t know if it would gain us anything over and above what Manticore’s index offers now.

The sentence transformer you linked looks very interesting. They even have pre-trained multilingual models, which may work for me.

Definitely. There are so many models out now and more will come. So, users should be able to pick the model they want to use. Manticore can make this easier and maybe even pick a default model (but it might not).