Ideas how to 'rank' Percolate Results?

barryhunter · September 20, 2019, 4:17pm

Using the Percolate function to try to find ‘related’ items. it works pretty well.

But only get a unordered list of matching queries. ie from the CALL PQ() gives the queries that matched, but seem to be just in id order.

It of course must be running the queries in the index against the supplied document, so it could theoretically compute a basic weight. Even a simple ‘word count’ might be good enough. (ie wouldnt be able to use full IDF or whatever ranking, because wouldn’t have term frequencies in the corpus - as percolate query is just supplying the documents)

ie want to find the ‘best’ queries that match the document, not just all of them. Say the top 10

Can try to compute a weight myself. (eg the CALL PQ lists the matching query, so could re extract words and cross-reference against the document(s) - maybe even use CALL SNIPPET ). But seems like something that could be done internally (as manticore is doing all the tokenization already)

tomat · September 21, 2019, 6:14am

not quite clear how to weight queries for batched mode, ie there multiple documents pushed

barryhunter · September 26, 2019, 1:15pm

Ah, yes true, for multi-document query would be quite complex.

Just looking at single document percolate right now.

tomat · September 26, 2019, 2:09pm

Yes, this way we could keep query - doc regular weight and provide it back at result set or sort by it.

Could you create ticket at Github to track its progress here?

Sergey · October 1, 2019, 10:46am

Hi @barryhunter

Can you please elaborate more on the use cases when ranked queries make sense?

barryhunter · October 1, 2019, 11:20am

I put a basic example in the github feature request

github.com/manticoresoftware/manticoresearch

Suggestion: get basic weight with CALL PQ() results

opened 03:44PM - 26 Sep 19 UTC

barryhunter

It would be nice if when run a percolate CALL PQ() query, to get a basic weight… for each match. Manticore is of course running each the queries in the percolate index, against the supplied document(s) - so could compute a weight. This would be useful to 'rank' or sort the returned queries (being able to sort/limit the results in weight order might be nice, but not necessary. ) ... ultimately could have have the percolate contain many fuzzy queries (quorum, or with MAYBE) and so want to then only get the 'best' matching queries, not necessarily all of them. Even a simple ‘word count’ weight would probably be good enough. (ie wouldnt be able to use full IDF or whatever ranking, because wouldn’t have term frequencies in the document corpus - as percolate query is just supplying the few document(s)) But also lccs (larged commons subsequence) would be useful too. a bit convoluted example: ``` INSERT INTO pqp(id,query) VALUES(1, 'fresh apple'); INSERT INTO pqp(id,query) VALUES(2, 'orange tree'); CALL PQ('pqp','dead apple trees near fresh pear and orange trees', 0 AS docs_json, 1 AS query, 'sum(word_count lccs)' as weight); --------- -------------- ------ --------- -------- | id | query | tags | filters | weight | --------- -------------- ------ --------- -------- | 1 | fresh apple | | | 2 | | 2 | orange tree | | | 4 | --------- -------------- ------ --------- -------- ``` In theory the 'orange tree' is a better match, because common phrase. 'fresh apple' does still match, but it not a common phrase. if supplying multiple documents, guess weight could augment the 'documents' column 1[34], 2[16] sort of thing. Weight per matching document.

… but back to the use case, I building a ‘related articles’ function. Slightly simplified example, but have a site with articles about fruit trees. Say have articles ‘Fruit Trees’, ‘Apple Fruit Trees’, ‘Growing Espalier Apple Trees’ and ‘Pear Fruit Trees’

I can put those titles as queries in a percolate index. (even just as simple ‘and’ queries, but maybe ultimately as ‘quorum’ queries!)

So now looking for ‘related’ documents to a new document, which is a general article about growing trees. All the above article titles could well match (eg it discusses growing apple trees)
… but the Apple tree documents should ‘bubble’ up to the top.

Similally, if focusing on a document about growing Pear trees. a related article about Pear trees should match highly. (better than the one about apples, as the document contains more uses of word Pear, than Apple!)

call pq('{title:"Growing Conference Pear Trees",
content:"This page is all about growing excellent quality Conference Pears. 
Pears are a soft fruit and not always easy to grow, particularly in relation to Apples"');

(sorry that is a deliberatly convoluted example text!)

Both ‘Apple Fruit Trees’ and ‘Pear Fruit Trees’ will match from the percolate index, but want the ‘pear’ one HIGHER, because it has more common words - the searched document mentions ‘pear’ more than ‘apple’.
‘Fruit Trees’ would also probably show after the more specific Pear one.

Ultimately there might be hundreds of actual matches from the percolate, may want to only show say the top 30 matches. So want to rank them

Haven’t tested if will scale, but have another site, where the percolate index could be hundreds of thousands of queries, so ranking them will be estiental.

Sergey · October 1, 2019, 12:30pm

@barryhunter, got it. The idea is clear now, thanks. I haven’t thought about using PQ for “related documents” implementation. Here’s what I think:

Speaking of quality I wouldn’t expect good results here as even TF-IDF can’t be applied in this case
Where TF-IDF/BM15/BM25 can be applied hence can give better ranking quality is if you just take each next title and use it as a query against @title in the same index. What drawbacks do you see in this approach? I’m afraid anyway the optimizations that come with PQ will be useless if you feed it with just a single title, not a batch of thousands of them.
Another approach (not implement yet though) would be to use not BM25, but vector space model and a difference between a given title and all the other titles, but I’m afraid for just titles it will give lower quality than just BM25 (at least our tests showed that). It might give something interesting for title + description though (like ranking high a document which doesn’t have most of the words from the current document, but still very relevant to it)

barryhunter · October 1, 2019, 1:00pm

I am wondering about taking the ‘queries’ (ie titles) from the percolate query and running them again against normal document index

SELECT WEIGHT() FROM documents WHERE MATCH(‘Apple Fruit Trees’) AND id=12
SELECT WEIGHT() FROM documents WHERE MATCH(‘Fruit Trees’) AND id=12
SELECT WEIGHT() FROM documents WHERE MATCH(‘Pear Fruit Trees’) AND id=12

(repeating for each query. 12 is the id of the document that just sent in the CALL PQ!)

… main drawback with this, might get thousands of results in the call PQ, so its running lots of separate small queries.

Were internally the CALL PQ, its already ‘parsed’ the document, and each query (to get which match!) so some weights could be computed, without lots of back and forth from the script to manticore.

Was planning on using CALL SNIPPET() to get a weight (sending the document, and query), but ultimately the above query should be faster, as it doesnt have to reparse the document each time (the document is already in the forward ‘documents’ index)

Sergey · October 1, 2019, 1:05pm

I’m not getting why you can’t just do

SELECT id, title FROM documents WHERE MATCH('@title <title of a new document in fuzzy mode>')

?

barryhunter · October 1, 2019, 2:21pm

Can do that, but need to send the ‘content’ of the article, not just the title (as the example above, Apple is only mentioned in content)
… and is in fact what the current version of the widget does. Trying to make a better one with Percolate!

Run into issues with length of document. Dont remeber exact wording of error but something with call stack.
… often need to trim the content, before making the query. Haven’t seen any restriction (other than packet length) on CALL PQ
Need make sure that all the words match from the title. Can get quite good results with something like
OPTION ranker=expr(‘sum(lcs)+( doc_word_count = title_len)*100’)
(simplied, in practice dont jus use lcs!)
… because as using ‘fuzzy’ (eg quorum) in the query, eg a document ‘apricot tree’ shouldnt match, because no mention of word ‘apricot’, even though document and title have ‘tree’ in common. Can’t just use a quorum of say 2, and the lengths of the title in the index vary. And decimal quorum dont work, as it based on length of query, not the length of the field!

Also if want phrase matches hard to do - ie only match titles that occur as a phrase in the document. Dont find lccs particully useful.

Using percolate, should allow for some negative queries (not fully explored this yet)

INSERT INTO pq(id,query) VALUES (88,“Eating Apples -Cooking”)

to exclude that one from matching a document about Cooking Apples. With a forward query can’t use the full matching syntax

Tweaking the queries manually to get better ‘cross matching’ is easier than trying to manually tweak the full document content.

Sergey · October 2, 2019, 10:18am

I see your point now. Thanks a lot!