Per Field match count?

barryhunter · May 16, 2025, 12:06pm

Quick Sanity check, take there is no more efficient way of doing something like this?

SELECT COUNT(*) FROM sample WHERE MATCH('@tags keyword');
SELECT COUNT(*) FROM sample WHERE MATCH('@title keyword');
SELECT COUNT(*) FROM sample WHERE MATCH('@comment keyword');
SELECT COUNT(*) FROM sample WHERE MATCH('@realname keyword');

(actually have over 10 fields to check! - complicated by the fact sample is a distributed index)

… CALL KEYWORDS() of course can be used for global matches, but don’t think a similar method for field matches exist? (though if a multi word or phrase match etc, will of course need to use MATCH() anyway )

tomat · May 16, 2025, 12:28pm

you could enumerate all matched fields with fieldmask ranker or field_mask factor

select id, weight() as w from fld where match ( '( @(title,body) w1 w2 )' ) group by w option ranker='fieldmask';

then hard to keep the uint weight with particular bitset into separate groups.
Maybe use bitdot operator to move particular bit into its own group or recent added Functions > Arrays and conditions functions | Manticore Search Manual HISTOGRAM expression. However I doubt it as we can not map the single value into multiple groups

barryhunter · May 16, 2025, 2:11pm

An interesting idea. Not really used fieldmask ranker, but seems quite useful.

Also wouldnt of thought of grouping by the weight, that is a neat trick!

Will need to reaggregate by ‘bit’ in the weight, so need the count

select id,weight() as w,count(*) from sample8 where match('@(title,tags,imageclass,comment,snippets,contexts,groups,terms) construction') group by w limit 1000 option ranker='fieldmask';
+---------+---------+----------+
| id      | w       | count(*) |
+---------+---------+----------+
| 2131999 | 1722752 |        1 |
| 2131871 | 1722624 |        1 |
| 2842725 | 1720704 |        2 |
| 2392448 | 1589504 |        2 |
| 2015373 | 1573120 |        2 |
| 6951603 | 1458560 |        8 |
| 4439732 | 1458432 |        9 |
| 3503622 | 1458304 |       12 |
| 7047168 | 1442176 |        1 |
| 6983140 | 1442048 |        3 |
| 7047145 | 1441920 |        1 |
| 3258184 | 1327488 |       31 |
| 3268752 | 1327360 |       50 |
| 2691327 | 1327232 |      138 |
....
|    4325 |     128 |     1638 |
+---------+---------+----------+
120 rows in set (0.045 sec)

Seems might work out reasonably efficient

Thanks!

tomat · May 16, 2025, 2:20pm

Will need to reaggregate by ‘bit’ in the weight, so need the count

the issue that the field mask for fields 1,2 could be 1,2,3 and after you get the field_mask as w and counter for these it is hard to figure out from the w=3 what counter relates to field 1 and what relates to field 2, ie hard to decompose summed counters back into particular fields, ie

w, count
1, 5
2, 10
3, 3

could be

w, count
1, 7
2, 11

or

w, count
1, 6
2, 12

need something like mva grouper that enumerate all values at the the single document and produces multiple groups
field_mask ranker > weight > field_list(uint) > [field_index_x, field_index_y, …] > grouper_like_mva

but we lack of the operators for single value > multiple values

barryhunter · May 16, 2025, 2:43pm

Decoding the ‘w’ certainly seems to work in this PHP code:

gist.github.com

https://gist.github.com/barryhunter/bf3fab3a19f24b8147cf3753687d6798

gistfile1.txt

$cols = $sph->getAll("DESCRIBE sample8A");

$fields = array();
foreach ($cols as $row) {
        switch($row['Type']) {
                case 'field': $fields[] = $row['Field']; break;
        }
}

$query = "

This file has been truncated. show original

(only prototype quality!)

7 realname = 40056
8 title = 264890
9 comment = 273079
12 imageclass = 56767
18 tags = 80367
19 groups = 257587
20 terms = 119762
21 snippets = 8253