CALL KEYWORDS doesn't work as expected

Hello.

I’m currently implementing “More Like This” algorithm, and I encountered unexpected behavior.

In my case, CALL KEYWORDS doesn’t return duplicates if I specify 1 as stats:

MySQL [(none)]> CALL KEYWORDS ('Risky things are risky', 'videos');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | risky     | riski      |
| 2    | things    | thing      |
| 4    | risky     | riski      |
+------+-----------+------------+
3 rows in set (0.001 sec)

MySQL [(none)]> CALL KEYWORDS ('Risky things are risky', 'videos', 1 AS `stats`);
+------+-----------+------------+-------+-------+
| qpos | tokenized | normalized | docs  | hits  |
+------+-----------+------------+-------+-------+
| 2    | things    | thing      | 47448 | 49831 |
| 4    | risky     | riski      | 27666 | 39158 |
+------+-----------+------------+-------+-------+
2 rows in set (0.003 sec)

Moreover, docs & hits change depending on number of times the word is repeated:

MySQL [(none)]> CALL KEYWORDS ('Risky things are risky-risky-risky!', 'videos', 1 AS `stats`);
+------+-----------+------------+-------+-------+
| qpos | tokenized | normalized | docs  | hits  |
+------+-----------+------------+-------+-------+
| 2    | things    | thing      | 47448 | 49831 |
| 6    | risky     | riski      | 55332 | 78316 |
+------+-----------+------------+-------+-------+
2 rows in set (0.002 sec)

Such queries works fine in the course console.

I’m on 3.6.0 version. What am I missing?

seems like a bug.

Could you create ticket at Github there to put reproducible example from this topic?

I tried now to determine what actions need to be done in order to reproduce it, and I didn’t succeed. On the same server, I created an absolutely identical index, added a couple of arbitrary documents, but it works well there. What should I do?

Just in case, I always can repopulate the index (12 Gb).

could you create empty index with settings same as at videos index then issue quires from the topic and provide its output?

Is videos index a plain or RT index or distributed index with agents or local indexes?

Sure. It’s RT index.

https://pastebin.com/W8dGQ21c

I could reproduce the issue. It has smth to do with the # of disk chunks. The simplified case is:

mysql> create table t(f text);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t values(0,'abc');
Query OK, 1 row affected (0.00 sec)

mysql> flush ramchunk t;
Query OK, 0 rows affected (0.01 sec)

mysql> call keywords('abc', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | abc       | abc        | 1    | 1    |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)

mysql> call keywords('abc abc', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 2    | abc       | abc        | 2    | 2    |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)

mysql> insert into t values(0,'abc');
Query OK, 1 row affected (0.00 sec)

mysql> flush ramchunk t;
Query OK, 0 rows affected (0.01 sec)

mysql> call keywords('abc', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | abc       | abc        | 2    | 2    |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)

mysql> call keywords('abc abc', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 2    | abc       | abc        | 4    | 4    |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)

i.e. the stats depend on the number of times you specify a keyword and the number of disk chunks as well.

And the effect remains even after merging to a single chunk:

mysql> select * from t.status;
+------+----------+--------------------------------+-------------------+---------------+-----------+------------+-------------+--------------------+----------------------+-----------------------------+----------------------+-----------------------------+------------------+
| id   | chunk_id | base_name                      | indexed_documents | indexed_bytes | ram_bytes | disk_bytes | disk_mapped | disk_mapped_cached | disk_mapped_doclists | disk_mapped_cached_doclists | disk_mapped_hitlists | disk_mapped_cached_hitlists | killed_documents |
+------+----------+--------------------------------+-------------------+---------------+-----------+------------+-------------+--------------------+----------------------+-----------------------------+----------------------+-----------------------------+------------------+
|    2 |        1 | /usr/local/var/manticore/t/t.1 |                 1 |             6 |      8296 |        541 |          85 |               4096 |                    0 |                           0 |                    0 |                           0 |                0 |
|    1 |        0 | /usr/local/var/manticore/t/t.0 |                 1 |             3 |      8296 |        541 |          85 |               4096 |                    0 |                           0 |                    0 |                           0 |                0 |
+------+----------+--------------------------------+-------------------+---------------+-----------+------------+-------------+--------------------+----------------------+-----------------------------+----------------------+-----------------------------+------------------+
2 rows in set (0.00 sec)

mysql> optimize index t option cutoff=1, sync=1;
Query OK, 0 rows affected (0.01 sec)

mysql> select * from t.status;
+------+----------+--------------------------------+-------------------+---------------+-----------+------------+-------------+--------------------+----------------------+-----------------------------+----------------------+-----------------------------+------------------+
| id   | chunk_id | base_name                      | indexed_documents | indexed_bytes | ram_bytes | disk_bytes | disk_mapped | disk_mapped_cached | disk_mapped_doclists | disk_mapped_cached_doclists | disk_mapped_hitlists | disk_mapped_cached_hitlists | killed_documents |
+------+----------+--------------------------------+-------------------+---------------+-----------+------------+-------------+--------------------+----------------------+-----------------------------+----------------------+-----------------------------+------------------+
|    1 |        2 | /usr/local/var/manticore/t/t.2 |                 2 |             9 |     20584 |        579 |          93 |              16384 |                    0 |                           0 |                    0 |                           0 |                0 |
+------+----------+--------------------------------+-------------------+---------------+-----------+------------+-------------+--------------------+----------------------+-----------------------------+----------------------+-----------------------------+------------------+
1 row in set (0.00 sec)

mysql> call keywords('abc', 't', 1 as stats); call keywords('abc abc', 't', 1 as stats);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | abc       | abc        | 2    | 2    |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)

+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 2    | abc       | abc        | 4    | 4    |
+------+-----------+------------+------+------+
1 row in set (0.00 sec)

Stats in show meta are correct before and after the OPTIMIZE:

mysql> select * from t where match('abc'); show meta;
+---------------------+------+
| id                  | f    |
+---------------------+------+
| 1514356453580734545 | abc  |
| 1514356453580734546 | abc  |
+---------------------+------+
2 rows in set (0.00 sec)

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total         | 2     |
| total_found   | 2     |
| time          | 0.000 |
| keyword[0]    | abc   |
| docs[0]       | 2     |
| hits[0]       | 2     |
+---------------+-------+
6 rows in set (0.00 sec)

Created an issue about it - CALL KEYWORDS vs RT index gives wrong stats · Issue #593 · manticoresoftware/manticoresearch · GitHub
Thanks for pointing this out, @bileslaw !

1 Like

Sergey, I’m glad to hear that the problem root has been discovered. I hope that fixing this will not take much effort so that the patch will be available at an early date. Thank you!