Ranking based on unmatched words

gillux · June 2, 2023, 6:29pm

I am looking for advices about how to rank results in a very particular way: based on index-level frequency of words contained in a matched document that did not match the search. Let me explain my use case.

My indexed documents are all very short, just one or two sentences. In fact, they are all example sentences. I use Manticore to find example sentences that contain a particular word and it works great.

However, as more and more example sentences are being added, I am looking for a clever way to rank the results in order to bring the most useful example sentences to the top. So what makes an example sentence useful? It depends on many factors, but here is one I think Manticore may be able to help: word frequency. When learning a language, you learn the most frequent words first. So if you are learning the language of those example sentences, chances are a sentence composed with frequent words will be easier to understand, so I want to rank it up. I am talking about frequency among the whole index, not the document.

Assuming that word frequencies in my documents reflect actual language use, I would like to compare the frequency of the searched keyword with the frequency of all the other unmatched words of the matched document in a ranker. I looked in Manticore’s documentation and even in the source code, but I haven’t found a way, even with a custom ranker.

Any help appreciated.

barryhunter · June 4, 2023, 9:19am

Well as such I dont think you could do with with a ‘ranking expression’ on its own anyway. When evaulating the ranking of a single document, it only has data on the matched words, knows nothing about the ‘rest’ of the words. Read up on how an ‘inverted index’ works.

But ranking expressions do have access to the attributes of each document. So could ‘precompute’ the popularity rating of each document and store in attribute. Then use that during ranking to affect the final score.

Would be some sort of ‘aggregate’ score for the whole document/sentence, not specifically ‘unmatched’. But there is ranking for the matched anyway (eg IDF)

Might need some sort of ‘stopwords’ eg dont use the ‘really popular’ words (like ‘of’ and ‘the’) to avoid distorting the score.