Use custom ranking with proximity_bm25

Carlos_Cabrera_Jimen · April 12, 2021, 12:41pm

I’m trying to costumized proximity_bm25 . It works perfectly fine with payloads when I do this search:
SELECT id, weight() FROM index WHERE MATCH(‘query’)

I’m trying to costumize the ranker formula so I need to use:
OPTION ranker=expr(‘sum(lcs*user_weight)+bm25’)
However, results are different. I’m using payloads with this content:

id,text,relevance
1,word1,1000,
2,word2,2000
3,word1,100
4,word1,300
4,word2,200

One problem is that user_weight calling the custom function is 1 but when I’m using proximity_bm25 is getting payloads.

Thanks a lot!

Sergey · April 13, 2021, 4:40am

However, results are different

As stated in the manual the default ranker is sum(lcs*user_weight)*1000+bm25, then the weights do not differ:

mysql> select *, weight() from t where match('a') option ranker=expr('sum(lcs*user_weight)*1000+bm25');
+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 1514212650895016040 | a    |     1356 |
| 1514212650895016041 | a a  |     1302 |
+---------------------+------+----------+
2 rows in set (0.01 sec)

mysql> select *, weight() from t where match('a');
+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 1514212650895016040 | a    |     1356 |
| 1514212650895016041 | a a  |     1302 |
+---------------------+------+----------+
2 rows in set (0.01 sec)

field_weights also works fine to me:

mysql> select *, weight() from t where match('a') option field_weights=(f=10);
+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 1514212650895016040 | a    |    10356 |
| 1514212650895016041 | a a  |    10302 |
+---------------------+------+----------+
2 rows in set (0.00 sec)

mysql> select *, weight() from t where match('a') option ranker=expr('sum(lcs*user_weight)*1000+bm25'), field_weights=(f=10);
+---------------------+------+----------+
| id                  | f    | weight() |
+---------------------+------+----------+
| 1514212650895016040 | a    |    10356 |
| 1514212650895016041 | a a  |    10302 |
+---------------------+------+----------+
2 rows in set (0.01 sec)

Carlos_Cabrera_Jimen · April 13, 2021, 8:06am

MySQL [(none)]> select *, weight() from bi_search_boost where match('word2') option ranker=expr('sum(lcs*user_weight)*1000+bm25');
+------+-------------+----------+----------+
| id   | tags        | priority | weight() |
+------+-------------+----------+----------+
|    2 | word2       |     2000 |     2421 |
|    4 | word1 word2 |       50 |     2421 |
|    1 | word2       |     1000 |     1442 |
+------+-------------+----------+----------+
3 rows in set (0.00 sec)

MySQL [(none)]> select *, weight() from bi_search_boost where match('word2');
+------+-------------+----------+----------+
| id   | tags        | priority | weight() |
+------+-------------+----------+----------+
|    2 | word2       |     2000 |  2001421 |
|    4 | word1 word2 |       50 |   201421 |
|    1 | word2       |     1000 |     1442 |
+------+-------------+----------+----------+
3 rows in set (0.00 sec)

These are my results. Maybe could be an issue with the version.

Server version: 3.2.2 afd6046@191226 release

Thanks a lot for your time!

(edit for change the values to the originals in the post)

Sergey · April 19, 2021, 10:27am

Can you reproduce it with the latest release?

Carlos_Cabrera_Jimen · April 19, 2021, 10:49am

Yes, I tried downloading 3.5.4. But results are similar.

It looks like that the problem is using payload with a custom rank. I can’t reproduce the same result. If I don’t use payloads results are similar, however, I am interested in mix payloads with a custom rank to get better results in my use case.

Sergey · April 19, 2021, 10:52am

Can you provide a full reproducible case including how you generate the index?

Carlos_Cabrera_Jimen · April 19, 2021, 11:28am

This is my conf:

source base {
    type = mysql
    sql_host = {host}
    sql_user = {user}
    sql_pass = {pass}
    sql_db = {db}
}
source bi_search_boost_source : base {
    sql_query_pre = SET NAMES utf8
    sql_query = SELECT id, tags, priority FROM content
    sql_field_string = tags
    sql_attr_uint = priority
    sql_joined_field   = termweight from payload-query; \
        SELECT id,tag,relevance FROM payload ORDER BY id ASC
}
index bi_search_boost {
    path = /var/lib/manticore
    source = bi_search_boost_source
    type = plain
    html_strip = 1
    stopwords_unstemmed = 1
    phrase_boundary = U+2C
    morphology = libstemmer_en
    min_word_len = 1
    min_prefix_len = 1
    ngram_len = 1
    ignore_chars = -
    charset_table = 0..9, a..z, _, A..Z->a..z
}
indexer
{
    mem_limit = 1024M
}
searchd
{
    listen = 9306:mysql41
    log = /var/log/manticore/searchd.log
    query_log = /var/log/manticore/query.log
    query_log_format = sphinxql
    binlog_path =
    listen_backlog = 128
    read_timeout = 7
    max_children =  3
    seamless_rotate = 1
    preopen_indexes = 1
    unlink_old = 1
    workers = threads
    dist_threads = 2
    pid_file = /var/run/sphinx/searchd.pid
}

I am creating a database in my sql with the content of these csv:

content.csv:
id,tags,priority
1,word2,1000
1,word1,1000
2,word2,2000
3,word1,100
4,word1 word2, 50

payload.csv:

id,tag,relevance
1,word1,1000
2,word2,2000
3,word1,100
4,word1,300
4,word2,200

Indexer and searchd are generated from this url:
https://repo.manticoresearch.com/repository/manticoresearch/release/centos/7/x86_64/manticore-3.5.4_201211.13f8d08-1.el7.centos.x86_64.rpm

Sergey · April 20, 2021, 5:50am

I’ve reproduced your results. I’ve also found this in the manual:

Currently, the only method to account for payloads is to use SPH_RANK_PROXIMITY_BM25 ranker

and this code:

        ExtRanker_c * pRanker = nullptr;
        switch ( tQuery.m_eRanker )
        {
                case SPH_RANK_PROXIMITY_BM25:
                        if ( uPayloadMask )
                                pRanker = new ExtRanker_State_T < RankerState_ProximityPayload_fn<true>, true > ( tXQ, tTermSetup, bSkipQCache );
                        else if ( tXQ.m_bSingleWord )
                                pRanker = new ExtRanker_WeightSum_c<WITH_BM25> ( tXQ, tTermSetup, bSkipQCache );
                        else if ( bGotDupes )
                                pRanker = new ExtRanker_State_T < RankerState_Proximity_fn<true,true>, true > ( tXQ, tTermSetup, bSkipQCache );
                        else
                                pRanker = new ExtRanker_State_T < RankerState_Proximity_fn<true,false>, true > ( tXQ, tTermSetup, bSkipQCache );
                        break;

                case SPH_RANK_BM25:
                        pRanker = new ExtRanker_WeightSum_c<WITH_BM25> ( tXQ, tTermSetup, bSkipQCache );
                        break;

                case SPH_RANK_NONE:
                        pRanker = new ExtRanker_None_c ( tXQ, tTermSetup, bSkipQCache );
                        break;

                case SPH_RANK_WORDCOUNT:
                        pRanker = new ExtRanker_State_T < RankerState_Wordcount_fn, false > ( tXQ, tTermSetup, bSkipQCache );
                        break;

                case SPH_RANK_PROXIMITY:
                        if ( tXQ.m_bSingleWord )
                                pRanker = new ExtRanker_WeightSum_c<> ( tXQ, tTermSetup, bSkipQCache );
                        else if ( bGotDupes )
                                pRanker = new ExtRanker_State_T < RankerState_Proximity_fn<false,true>, false > ( tXQ, tTermSetup, bSkipQCache );
                        else
                                pRanker = new ExtRanker_State_T < RankerState_Proximity_fn<false,false>, false > ( tXQ, tTermSetup, bSkipQCache );
                        break;

so I guess indeed the payload is only effective with SPH_RANK_PROXIMITY_BM25, not with customer rankers or any other standard rankers.

Carlos_Cabrera_Jimen · April 20, 2021, 6:42am

Would it be possible to add it as an improvement for future releases?

Sergey · April 20, 2021, 9:36am

Can you please create a feature request on github GitHub - manticoresoftware/manticoresearch: Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon ?