Check which of search values are in MVA

drewblin · December 4, 2019, 1:49pm

I have index idx with full text field and mva.

Now i make select:
SELECT mva_field
FROM idx
WHERE MATCH(‘text’) AND mva_field IN (1,2,3,4)

Then i parse result in my program and get which of requested values (1,2,3,4) were found.
But i want to simplify processing - it can be found bilions of documents by several values.

So i wrote such select:
SELECT @groupby as _x
FROM idx
WHERE MATCH(‘text’) AND mva_field IN (1,2,3,4)
GROUP BY mva_field
HAVING _x IN (1,2,3,4)
OPTION max_matches=1000000000

Now i get in result only that values that i need. But manticore during processing should create greate result (after group by) and only then collapse it to sevelar values. And i should to use max_matches with very large value.

Maybe some one know more elegant way to solve my task?

Sergey · December 6, 2019, 10:36am

By this do you mean ANY(mva_field) IN (1,2,3,4) ?

It should give a warning if you use neither ANY() nor ALL().

drewblin · December 6, 2019, 11:00am

Yes, i mean ANY(mva_field)
I saw warning, but out of habit i regulary forget to write ANY

Sergey · December 6, 2019, 3:34pm

If you use a lower max_matches do you lose the accuracy? I can’t reproduce it.

MySQL [(none)]> select @groupby as _x, count(*) from idx where match('where') and cpi in (99,120,0,224) group by cpi option max_matches=10;
+------+----------+
| _x   | count(*) |
+------+----------+
|    0 |   508706 |
|  224 |    82539 |
|  120 |    89195 |
|   99 |      190 |
+------+----------+
4 rows in set, 1 warning (0.70 sec)

MySQL [(none)]> select @groupby as _x, count(*) from idx where match('where') and cpi in (99,120,0,224) group by cpi option max_matches=100000;
+------+----------+
| _x   | count(*) |
+------+----------+
|    0 |   508706 |
|  224 |    82539 |
|  120 |    89195 |
|   99 |      190 |
+------+----------+
4 rows in set, 1 warning (0.23 sec)

drewblin · December 6, 2019, 5:21pm

In your example index idx in field cpi has only 4 values - 0, 224, 120, 99.
But fill it with other values.
And such query will return not only values you mention in IN list. It will return all values from field cpi that are in matched documents.
And so max_matches will limit number of found rows

Sergey · December 9, 2019, 3:18pm

In your case how many different values do you have in the MVA?

drewblin · December 9, 2019, 3:24pm

About 10 values per record.
About 60000 unique values in index.
Common SELECT matches about 300 documents.
Size of IN list about 50-100 values.

barryhunter · December 10, 2019, 11:30am

Seems, if have about 60,000 unique values, then tat is a good value for max_matches.

1 billion is way too big. Search should be faster if set it realistically, rather than just arbitary large number.

drewblin · December 10, 2019, 12:52pm

Yes, i understand that. But this is small optimization. But i want to find best solution.

barryhunter · December 10, 2019, 2:33pm

ye, can’t think of a better way really. But the INTERSECT function seems plausible to help. Would in theory filter earlier in the process.
Alas seems UDFs can’t return a MVA, so can’t implement it that way.

Only trick can think that MIGHT work is to shard the MVA into separate attributes. run a seperate GROUP BY query against each attribute separately.

Because the number of combinations of all the rows, with a single attibute, balloons. As a worse case, in theory could be about 225k (75 * 300 * 10) - 75 is average between 50&100.
… but sharded intowo three attributes, each query would be 3.3 * 300 * 25 or 24,750. Would need three queries, but that is still only 74k rows.
In real queries the savings might be even better!

tomat · December 10, 2019, 2:39pm

INTERSECT could be implemented however that mean we need create new type of expressions that produces blobs with MVA type then handle this new type at different parts of expressions, sorters, groupers, filters and all these seems not just only an another expression but quite complete feature