我怎样配置，可以把中文按单个字拆分

dtb · January 8, 2024, 11:03am

ngram_len=1这个应该设置在哪里？
是创建表的时候设置，还是写在配置文件（manticore.conf）中

dtb · January 8, 2024, 12:20pm

CREATE TABLE book (
title text,
auth text
) charset_table=‘cjk,non_cjk’ morphology=‘icu_chinese’ charset_type=‘utf-8’ ngram_len=‘1’ ngram_chars=‘U+3000…U+2FA1F’;
我这样创建表时，报错
ERROR 1064 (42000): error adding table ‘book’: ‘ngram_chars’: ngram characters must not be referenced anywhere else (code=U+3041)
我想创建表时，指定中文按单字拆分

Sergey · January 12, 2024, 12:51pm

Try:

charset_table=‘non_cjk’ charset_type=‘utf-8’ ngram_len=‘1’ ngram_chars=‘cjk’

Dawei_Zhao · March 19, 2025, 3:23am

如何将中文按单字、英文按字母、数字和字符按单字拆分呢？
How to split Chinese characters by individual character , English words by letter , and numbers and symbols by single character ?
like this:
search o ,reutrn book
search 书,return 书籍

like mysql、sqlite sql like %o% %书%

Sergey · March 21, 2025, 11:16am

How to split Chinese characters by individual character , English words by letter , and numbers and symbols by single character ?

Experiment with charset_table and ngram_chars, e.g.:

mysql> drop table if exists t; create table t(f text) charset_table='english' ngram_chars='chinese,0..9' ngram_len='1'; call keywords('abc 123 英文按字母', 't');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text) charset_table='english' ngram_chars='chinese,0..9' ngram_len='1'
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
call keywords('abc 123 英文按字母', 't')
--------------

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | abc       | abc        |
| 2    | 1         | 1          |
| 3    | 2         | 2          |
| 4    | 3         | 3          |
| 5    | 英        | 英         |
| 6    | 文        | 文         |
| 7    | 按        | 按         |
| 8    | 字        | 字         |
| 9    | 母        | 母         |
+------+-----------+------------+
9 rows in set (0.00 sec)

Dawei_Zhao · March 21, 2025, 12:14pm

CREATE TABLE IF NOT EXISTS videos (
id bigint,
collection_id string,
video_id string,
name text
)
charset_table = ‘russian’
charset_type = ‘utf-8’
ngram_len = ‘1’
ngram_chars = ‘cjk,english,0…9’
min_word_len = ‘1’
index_exact_words = ‘1’

like this。thanks!
charset_table 和 ngram_chars 不能有交集么？
Can’t charset table and ngram chars intersect?

Dawei_Zhao · March 21, 2025, 1:21pm

CREATE TABLE IF NOT EXISTS videos (
id bigint,
collection_id string,
video_id string,
name text
)
charset_table = ‘U+0020…U+002F, U+003A…U+0040, U+005B…U+0060, U+007B…U+007E’
charset_type = ‘utf-8’
ngram_len = ‘1’
ngram_chars = ‘cjk,english,0…9’
min_word_len = ‘1’
index_exact_words = ‘1’

from chatgpt

Sergey · March 26, 2025, 11:38am

Can’t charset table and ngram chars intersect?

You are right. They can’t intersect. Each character can be either a normal character or an ngram character, can’t be both.