Performance for very low-selectivity conditions

drewblin · August 12, 2020, 10:03am

I have index with bilions of rows. In every row there is two very low-selectivity fields (integers) - country and city. Each country and city may be present in 20-30% of rows.

I know two approaches for storing data.
First - create interger fields and search by them as by attributes.
Second - create full-text field (geo) and encode this values there, like ‘co12 ci234’ and search using MATCH.

So question is: what approach is better for performance for selects like this:
select id, attr1, attr2 from index1 where country = 12 group by attr1 order by attr2 asc limit 100;
select id, attr1, attr2 from index1 where match(’@geo co12’) group by attr1 order by attr2 asc limit 100;

I’m not sure what is better - full index scan or full-text match 30% rows and then search them in index.

adrian · August 12, 2020, 1:50pm

Having keys as indexed text is in general much faster (even by 2 orders of magnitude) than attribute filters, even for low selectivity. If you have in the MATCH only key filterings, add OPTION ranker=none as there’s no point for the query to calculate ranking,

drewblin · August 19, 2020, 5:15pm

Maybe i understand something wrong, but look at my results:

mysql> select count(*) from chara_addr_v2;
+----------+
| count(*) |
+----------+
|  6899519 |
+----------+

mysql> select count(*) from chara_addr_v2 where match('co2');
+----------+
| count(*) |
+----------+
|  5747887 |
+----------+

mysql> select count(*) from chara_addr_v2 where country = 2;
+----------+
| count(*) |
+----------+
|  5747887 |
+----------+

And here is items from query log

/* Wed Aug 19 20:08:17.369 2020 conn 9 real 0.322 wall 0.322 found 5747887 */ SELECT id FROM chara_addr_v2 WHERE country=2;
/* Wed Aug 19 20:08:27.791 2020 conn 9 real 1.575 wall 1.575 found 5747887 */ SELECT id FROM chara_addr_v2 WHERE MATCH('co2');

Conditions match('co2') and country = 2 matches the same rows