ngram_len=1这个应该设置在哪里?
是创建表的时候设置,还是写在配置文件(manticore.conf)中
CREATE TABLE book (
title text,
auth text
) charset_table=‘cjk,non_cjk’ morphology=‘icu_chinese’ charset_type=‘utf-8’ ngram_len=‘1’ ngram_chars=‘U+3000…U+2FA1F’;
我这样创建表时,报错
ERROR 1064 (42000): error adding table ‘book’: ‘ngram_chars’: ngram characters must not be referenced anywhere else (code=U+3041)
我想创建表时,指定中文按单字拆分
Try:
charset_table=‘non_cjk’ charset_type=‘utf-8’ ngram_len=‘1’ ngram_chars=‘cjk’
如何将中文按单字、英文按字母、数字和字符按单字拆分呢?
How to split Chinese characters by individual character , English words by letter , and numbers and symbols by single character ?
like this:
search o ,reutrn book
search 书,return 书籍
like mysql、sqlite sql like %o% %书%
How to split Chinese characters by individual character , English words by letter , and numbers and symbols by single character ?
Experiment with charset_table and ngram_chars, e.g.:
mysql> drop table if exists t; create table t(f text) charset_table='english' ngram_chars='chinese,0..9' ngram_len='1'; call keywords('abc 123 英文按字母', 't');
--------------
drop table if exists t
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
create table t(f text) charset_table='english' ngram_chars='chinese,0..9' ngram_len='1'
--------------
Query OK, 0 rows affected (0.01 sec)
--------------
call keywords('abc 123 英文按字母', 't')
--------------
+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1 | abc | abc |
| 2 | 1 | 1 |
| 3 | 2 | 2 |
| 4 | 3 | 3 |
| 5 | 英 | 英 |
| 6 | 文 | 文 |
| 7 | 按 | 按 |
| 8 | 字 | 字 |
| 9 | 母 | 母 |
+------+-----------+------------+
9 rows in set (0.00 sec)
CREATE TABLE IF NOT EXISTS videos (
id bigint,
collection_id string,
video_id string,
name text
)
charset_table = ‘russian’
charset_type = ‘utf-8’
ngram_len = ‘1’
ngram_chars = ‘cjk,english,0…9’
min_word_len = ‘1’
index_exact_words = ‘1’
like this。thanks!
charset_table 和 ngram_chars 不能有交集么?
Can’t charset table and ngram chars intersect?
CREATE TABLE IF NOT EXISTS videos (
id bigint,
collection_id string,
video_id string,
name text
)
charset_table = ‘U+0020…U+002F, U+003A…U+0040, U+005B…U+0060, U+007B…U+007E’
charset_type = ‘utf-8’
ngram_len = ‘1’
ngram_chars = ‘cjk,english,0…9’
min_word_len = ‘1’
index_exact_words = ‘1’
from chatgpt
Can’t charset table and ngram chars intersect?
You are right. They can’t intersect. Each character can be either a normal character or an ngram character, can’t be both.