Diacritics insensitive search

Does Manticore supports it? For which languages and how?

Hi. Manticore by default does accent insensitive tokenization with

charset_table=non_cjk

which is also a default in RT mode.

For example:

mysql> create table t(f text);
Query OK, 0 rows affected (0.00 sec)

mysql> call keywords('café, cliché, façade, Chloë, Brontë, pādā for payday, São Paulo', 't');
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | cafe      | cafe       |
| 2    | cliche    | cliche     |
| 3    | facade    | facade     |
| 4    | chloe     | chloe      |
| 5    | bronte    | bronte     |
| 6    | pada      | pada       |
| 7    | for       | for        |
| 8    | payday    | payday     |
| 9    | sao       | sao        |
| 10   | paulo     | paulo      |
+------+-----------+------------+
10 rows in set (0.00 sec)

If you want to change the behaviour you can specify your own charset_table. See Manticore Search Manual for details.

Thanks, interesting. What about Greek extended? Does not appear to work here
https://tatoeba.org/en/sentences/search?from=&query=Αιγυπτον&to=

https://tatoeba.org/en/sentences/search?from=&query=Αἴγυπτον&to=

Not sure what charset_table Tatoeba are using (I remember they have a custom one), but here’s a one-liner which you can use to test how Manticore normalizes text:

snikolaev@dev:~$ docker stop manticore; docker rm manticore; docker run --name manticore --rm -d manticoresearch/manticore:latest && docker exec -it manticore mysql -e "create table t(f text); call keywords('Αἴγυπτον', 't');" && docker stop manticore
Error response from daemon: No such container: manticore
Error: No such container: manticore
a87f45eee602919422f244594aa1d99bd85ef317b0dc5f1e9cd6e3d16b2fca63
+------+------------------+------------------+
| qpos | tokenized        | normalized       |
+------+------------------+------------------+
| 1    | αιγυπτον         | αιγυπτον         |
+------+------------------+------------------+
manticore
snikolaev@dev:~$ docker stop manticore; docker rm manticore; docker run --name manticore --rm -d manticoresearch/manticore:latest && docker exec -it manticore mysql -e "create table t(f text); call keywords('Αιγυπτον', 't');" && docker stop manticore
Error response from daemon: No such container: manticore
Error: No such container: manticore
9141e9c6b4c5f374690427efa2491c01ce13eceed5b5ab5a3bac059bad05c634
+------+------------------+------------------+
| qpos | tokenized        | normalized       |
+------+------------------+------------------+
| 1    | αιγυπτον         | αιγυπτον         |
+------+------------------+------------------+
manticore

The md5s are the same:

snikolaev@dev:~$ echo αιγυπτον|md5sum
7c08eb91835831d37af45b977dac64af  -
snikolaev@dev:~$ echo αιγυπτον|md5sum
7c08eb91835831d37af45b977dac64af  -
snikolaev@dev:~$

Interesting. If that is the default, then you could update the description saying that it supports Greek extended. Would be safe first to test with all Greek extended characters to see if they are normalized, you can find them here: Ancient Greek polytonic letters/characters (accented, non-accented, lowercase, uppercase, capitals)

By the way, is it possible to have a setting that toggles between diacritics-insensitive and diacritics-sensitive mode so that the users can decide for themselves?

charset_table is an index option that set on creating index table and user can not change it at runtime.

It could be better to create different indexes with its own settings and query these indexes.