Is it possible to use icu_chinese and ngram_len=1 settings simultaneously?

Dlad · August 19, 2024, 4:26am

I noticed that with icu_chinese segmentation, even a small change in word order or missing a character can cause relevant results to not show up. But when using single-character segmentation, the ranking of results isn’t very accurate—word combinations don’t get the proper weight based on how rare they are.

I would really appreciate your advice on a few questions:

1.Can I use both settings in one index, or do I need to create two indexes, search separately, and then combine the results?

2.If I need to create two indexes, could you suggest a ranking mode that would make it easier to combine the results?

3.I’m thinking about making a tool that creates a weighted segmentation dictionary based on word frequency in a corpus. If this is useful, do you think it could be added to the main branch? Where would be a good place to start in the code?

Sergey · August 21, 2024, 8:19am

I believe not since a chinese character either has to become a separate token (if it’s in the ngram_chars list) or can be combined with another/multiple characters to then both become a token (what the icu_chinese morphology gives)
Do you mean you’d have the same documents with the same IDs in 2 tables with different tokenization settings and then would query them both at the same time (using a distributed table or tbl1, tbl2)?
We are currently working on integrating the Jieba library - Jieba integration · Issue #931 · manticoresoftware/manticoresearch · GitHub . We are not experts in its quality compared to ICU, but were told it makes sense. Hopefully, it will increase the chinese search quality. As for “weighted segmentation dictionary based on word frequency in a corpus”, I’m not sure how it would work. Please elaborate more on the idea.

Dlad · September 8, 2024, 7:08am

thanks
yes
Jieba is an excellent engine for word segmentation. Thank you.

Regarding the third point:
I am referring to the “Markov model”, which performs word segmentation by calculating Bayesian probability based on a word library.

Allow users to provide a word segmentation dictionary in the form of “phrase => weight”. The engine defaults to using the word segmentation method with the highest score.
Provide a tool for generating a word segmentation dictionary. According to the given corpus, automatically generate the weight dictionary mentioned in 1.

I think this can solve most problems and is convenient to use. It also provides users with means to handle corner cases.

=============== detail ===================
Sorry, I haven’t been able to find an example that can be accurately conveyed to “non-native speakers of Chinese”. Chapters 4 and 5 of the book “The Beauty of Mathematics” outline “the evolution of Chinese word segmentation” and “hidden Markov models”, which are the sources of my knowledge.

For a Chinese text, there are multiple ways of word segmentation.
Different usage scenarios have different needs for word segmentation:
a) Word segmentation by calculating Bayesian probability based on a word library often can obtain word segmentation that conforms to the context.
b) Single-character word segmentation or the use of multiple word segmentation methods together meets people’s expectations for “search” and can also partially handle the search experience when users “incorrectly select” keywords.
Therefore, I think it would be more convenient in use if “multiple word segmentation methods can be run in parallel and merged and sorted according to weighted calculation rank”.