Regularization of scoring with high variance

evkes · May 14, 2024, 7:05pm

I’ve been working on a new manticore search query for a string matching project that I am doing for fun:

SELECT id, WEIGHT() as sim FROM grwghts WHERE match(“ocean freight visibility, freightliner logistic software, supply chain software as a service”) ORDER BY WEIGHT() DESC LIMIT 3000 OPTION ranker=expr(‘top(lcs)*1000+bm25’), max_matches=3000;

The issue is, a lot of my scoring goes into the thousands but I am trying to create a similarity algorithm where similarity scores are between 0.7 and 1. I tried a few normalization techniques, but due to the difference in magnitude between all of the results, I am not able to get a decent distribution without creating a custom minimax function. I was wondering if there was any computationally unexpensive normalization built in that can focus on the top(lcs) and get a better distribution of scores (perhaps some form of gaussian distribution is what I am asking for I guess).

Thanks

Sergey · May 15, 2024, 3:45am

Is it possible at all with top(lcs)? To normalize it,. you have to know the max lcs which depends on your query and your docs. But if you can, then probably smth like this (top(lcs)/12*1000+bm25)/2 can help you normalize:

/12 - to normalize by the max lcs
* 1000 - to align with the bm25 range
/2 to normalize the sum back to “up to 1000”

Then in your app you can divide by 1000 and you’ll get 0.741 and 0.408 in this case:

mysql> drop table if exists grwghts; create table grwghts(f text); insert into grwghts(f) values('a'),('ocean freight visibility, freightliner logistic software, supply chain software as a service'),('ocean freight visibility, logistic freightliner software, service supply chain software as a'); SELECT id, WEIGHT() as sim FROM grwghts WHERE match('ocean freight visibility, freightliner logistic software, supply chain software as a service') ORDER BY WEIGHT() DESC LIMIT 3000 OPTION ranker=expr('(top(lcs)/12*1000+bm25)/2'), max_matches=3000;
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 3 rows affected (0.01 sec)

+---------------------+------+
| id                  | sim  |
+---------------------+------+
| 1515870511710601272 |  741 |
| 1515870511710601273 |  408 |
+---------------------+------+
2 rows in set (0.00 sec)
--- 2 out of 2 results in 0ms ---

If you want to normalize this further from 0 - 1000 to 700 - 1000, I think it’s also easier to do in the app. I believe you can’t get it in the range from 0.7 to 1 right away out of the box since Manticore returns weight as integers.