different terms order in query brings different results

rlattuad · January 4, 2021, 4:20pm

Hi,

I have these 2 queries:

SELECT id, name, weight() FROM index1 WHERE MATCH (‘carbimazole macleods tablets 20mg’);
±--------±----------------------------------±---------+
| med_id | commercial_name | weight() |
±--------±----------------------------------±---------+
| 4650877 | Carbimazole Macleods Tablets 20mg | 10610 |
±--------±----------------------------------±---------+
1 row in set (0.25 sec)

all well, result is correct, then I do:

SELECT id, name, weight() FROM index1 WHERE MATCH (‘carbimazole macleods 20mg tablets’);
Empty set (0.00 sec)

Is this the correct result ? As I am not using a phrase operator “” but the default AND operator I would expect the different position of the terms to give different weight results but still return a result.

Am I missing something ?

Thanks
Roberto

adrian · January 5, 2021, 7:54am

You have any word transformations, like wordforms, regex filters etc.?
Do EXPLAIN QUERY and CALL KEYWORDS on both query strings to see how the query strings ends up .

rlattuad · January 5, 2021, 4:28pm

Hi Adrian,
thanks for the reply, here is some further information:

I used CALL KEYWORDS as you suggested and this is what I get:
CALL KEYWORDS (‘carbimazole macleods 20mg tablets’, ‘myHB_index1’);
±-----±--------------------------±--------------------+
| qpos | tokenized | normalized |
±-----±--------------------------±--------------------+
| 1 | carbimazole | carbimazole |
| 2 | macleods20mgtablets | macleods20mgtablets |
±-----±--------------------±--------------------+
Obviously the tokenization is incorrect.

I have a regex expression that standardizes ‘dosage dosage_units’ terms for medicines, so it converts ‘20 mg’ to ‘20mg’ (the standard format for me is without space between numbers and units).

This is the regex:
regexp_filter = (?i)(?:\s|\b)?(\pN*[.,]?\pN*)(?:\s|\b)?(mg/ml|ml/amp|mg|ml|ui|g|units|iu|mcg|µg)(?:\s|\b) => \1\2

I tested the regex and use it also in other apps so it should work fine, what may go wrong is the space before and after the regex ?

Is it possible that the regex parser does not handle non-capturing-groups like (?:\s|\b) correctly ?

Any ideas ?

Thanks
Roberto

adrian · January 5, 2021, 9:25pm

You remove the spaces before and after the regex and the capture will be glued to the adjacent words, resulting in a single token.

rlattuad · January 6, 2021, 11:08am

Hi Adrian,

I changed the non-capturing groups to capturing groups and added them to the result string (which now is \1\2\4\5) and all seems to be working fine.

Please note that I added the spaces before and after the regex but I did it in the result string: ’ \1\2 ’ obviously they were ignored or stripped. The parser leaves the spaces if they are between elements like ‘\1 \2’ but ignores them before and after.

May I suggest it would be better to have the regex output enclosed in ‘’ so that spaces can be added at beginning and end, otherwise the only solution is to have them as capturing groups in the regex.

Thanks for your help
Roberto