数据量40亿左右,采用ngram=1 分词,查询太慢

数据量40亿左右,采用ngram=1 分词,match命中keyword多,查询太慢,查询耗时几分钟

Did you try morphology=icu_chinese?

icu_chinese不符合我们的场景,会有些查询查不到啊

使用 set profiling=1;发现get_docs Durition 150.01多

icu_chinese不符合我们的场景,会有些查询查不到啊

There’s an issue about adding support for Jieba ICU is not a good choice for chinese · Issue #931 · manticoresoftware/manticoresearch · GitHub

Pull requests are welcome. If it’s mission critical for you and you are ready to sponsor the development, the core team can prioritize this task. Let me know then.

那关于使用 set profiling=1;调试发现get_docs Durition 150.01多,这是什么原因导致耗时这么长呢

It may be just too complex computation, because each character becomes a separate token in case of ngram_len=1, Manticore has to find all documents for each token and then intersect/combin the lists, sort them etc.

thank you very munch,i understand

嗨,能联系到你么?有些问题想请教

我有邮箱:dingtianbao1983@gmail.com

I’m not sure who you meant: @xyls82 or me, but you can contact me at contact@manticoresearch.com