"Cross Product" between Percolate and Standard Index?

barryhunter · June 24, 2020, 4:12pm

I’m guessing this is effectively a feature request, but is there a easy way to get a merge of the two type of index internally to Manticore? Checking not missing something.

ie get a list of documents in a particular document index that match queries in a percolate index.

I know can just do a ‘SELECT’ against the percolate index to get all the queries, and then run those queries against the normal index (effectively one by one).

… or the reverse, get each document from normal index (subject to text fields being ‘stored’!) and then running each against the percolate index (or at least in small batches)

The first is probably more efficient anyway, only getting the (relatively small) queries, from manticore, just to send it all back. Rather than having to fetch all the (typically large) documents.

tomat · June 25, 2020, 6:52am

it could be better to create ticket at Github where discuss syntax of the query and how to map select list result set columns to call pq documents

barryhunter · June 25, 2020, 10:23am

The more I think about it, it sounds like something best done just in ‘application’ code, because can tweak the exact params, with some ‘magic’ query syntax, there is always going to be a compromise.

While something like

 CALL PQ('pqindex', 'SELECT * FROM stdindex', 1 as docs, 1 as docs_query, 'id' as 'docs_id');

is relatively simple concept wise (although * could perhaps be replaced with just list of stored fields, and the attributes defined in the pqindex filters)

… it doesn’t sound like the most ‘efficient’ way to execute the overall process. This implies having to fetch and process all the documents, then run the queries against that inverted index. In case of millions of documents, could need high max_matches on the inner query for example.

(Although admit only guessing on how CALL PQ is actully executed! if CALL PQ jsut creates a ‘temporally’ RT index with the ‘docs’ then maybe it could just use the stdindex files directly, rather than this temporally index)

So its application code

foreach (getAll("SELECT * FROM pqindex") as $row) {
    $where = "MATCH(".quote($row['Query']).")";
    if (!empty($row['Filters']))
    	$where .= " AND ".$row['Filters'];
    $result = query("SELECT id FROM stdindex WHERE $where OPTION ranker=none");
    //..
}

Coming up with SphinxQL syntax to do that as ‘one query’ seems very contrived.

tomat · June 25, 2020, 10:28am

not quite clear the scenario of the 1st query you mentioned

CALL PQ('pqindex', 'SELECT * FROM stdindex', 1 as docs, 1 as docs_query, 'id' as 'docs_id');

why someone need to process whole RT (full-scan) and push it via pqindex every time \ on each invocation. As I see no filters or match conditions and now it pushed only max_matches -> limit rows but I think you mean to push whole table, ie fetching whole stored fields then push it via pqindex.

It is not quite clear the usage scenario for such pipeline.

barryhunter · June 25, 2020, 11:15am

Think of a ‘monthly’ report.

Have the normal PQ query for checking each document one by one, as you insert into the stdindex.

But occasionally might want to get statistics, to find how many documents match the percolate index overall.
It will be a slow process, but not run very often. Think of OLAP vs OLTP.

Yes, can keep track of stats every time insert a new document into the RT index (by saving the results of the CALL PQ), but the PQ index itself may be updating. New queries being added (or updated) all the time.

Ie now want to find all documents that match any of the queries. In particular so get documents created before the query was added to the percolate index

You could also run each query against the documents, when add the query to to the percolate. But documents are being crated/updated all the time as well, so running these full queries, every time a document is updated, and every time a percolate query is updated and saving all the results is a lot of work.

… so do it as a ‘offline’ batch process. to upload the bulk report.

Sergey · June 26, 2020, 4:10pm

How it could look in terms of SQL syntax and general functionality:

CALL PQ extension

Just use values from an existing index rather than processing an array of given documents:

CALL PQ('pq', (SELECT <single_field> FROM any_index WHERE conditions), options);

It would be mandatory that the sub-select returns exactly one text/string column. Internally we would also get each document’s id to display it if 1 as docs option is enabled.

MATERIALIZED VIEW + PQ rules

In some other databases there’s a what’s called materialized view, i.e. you have a table and have a query and say “I want this query to be applied to this table continuously (or on demand) and the result should be put to another table”.
It looks very similar to the idea of this topic. But in our case we already have the concept of PQ index, i.e. not one query, but multiple, so it might look like this:

CREATE MATERIALIZED VIEW index_pq_stream AS
SELECT f text, s string, i int FROM source_index // leave only these fields in the MV
WHERE PQ(pq_index) // PQ(idx) is like MATCH(), but means "use all rules from this PQ index"
AND MATCH(...) AND another_attribute > 123 // defines additional filtering besides the PQ filtering, may be also useful for sharding to parallelize processing etc.

This functionality would be great to have, but given we are quite limited in development resources we would need to make sure that a significant number of users would benefit from this or there’s a company willing to sponsor the development.

@barryhunter
If you think the above makes sense and have in mind some use cases for that with more details I’d appreciate if you elaborated more on them.

barryhunter · June 30, 2020, 2:02pm

Either of those would probably work, although I think would want more than a single field in the CALL PQ extension. same way as can specify multiple fields in documents in JSON format.

If you havent seen any other request, this sort of ‘bulk running’ of PQs may not of that much general interest.

In my case I successfully implemented it as a ‘loop’ in the application. Had plenty of control that way anyway. Also in particular, the result set can be quite large overall, lots of documents matching many percolate queries, so the big datasets, so running as lots of small queries, avoids passing around large datasets.