数据量40亿左右，采用ngram=1 分词，查询太慢

xyls82 · October 11, 2023, 7:34am

数据量40亿左右，采用ngram=1 分词，match命中keyword多，查询太慢，查询耗时几分钟

Sergey · October 12, 2023, 3:04am

Did you try morphology=icu_chinese?

xyls82 · October 12, 2023, 3:13am

icu_chinese不符合我们的场景，会有些查询查不到啊

xyls82 · October 12, 2023, 3:16am

使用 set profiling=1；发现get_docs Durition 150.01多

Sergey · October 12, 2023, 3:18am

icu_chinese不符合我们的场景，会有些查询查不到啊

There’s an issue about adding support for Jieba Jieba integration · Issue #931 · manticoresoftware/manticoresearch · GitHub

Pull requests are welcome. If it’s mission critical for you and you are ready to sponsor the development, the core team can prioritize this task. Let me know then.

xyls82 · October 12, 2023, 3:26am

那关于使用 set profiling=1；调试发现get_docs Durition 150.01多，这是什么原因导致耗时这么长呢

Sergey · October 12, 2023, 7:32am

It may be just too complex computation, because each character becomes a separate token in case of ngram_len=1, Manticore has to find all documents for each token and then intersect/combin the lists, sort them etc.

xyls82 · October 12, 2023, 7:42am

thank you very munch,i understand

dtb · January 8, 2024, 11:00am

嗨，能联系到你么？有些问题想请教

我有邮箱：dingtianbao1983@gmail.com

Sergey · January 10, 2024, 7:04am

I’m not sure who you meant: @xyls82 or me, but you can contact me at contact@manticoresearch.com