Use of N-gram or morphology for cjk languages

rlattuad · March 9, 2020, 1:59pm

Hi,

I am trying out the new Chinese icu morphology available in version 3.1.
I previously used ngram_len=1 and ngram_chars = cjk.

Question is: if I set morphology = icu_chinese do I need to disable ngram ?
What about other languages: Japanese, Korean, etc…

Thanks
Roberto

Sergey · March 10, 2020, 8:42am

Yes, just use charset_table = cjk, non_cjk

BTW here’s an interactive course on the basics of that - Tokenization of Chinese texts

Only Chinese is supported now.

rlattuad · March 10, 2020, 9:09am

Hi Sergey,

I think my setup is a bit more complicated, I need to support all cjk languages (non just Chinese). I have see the course but I need to understand how I can handle also Japanese and Korean if I disable ngram.

Can I use morphology for Chinese and ngram for Japanese and Korean ?

Which setting takes precedence ? I need a bit of a insight into how these settings are actually handled during the indexing phase.

Can you suggest an optimal setup to handle all cjk languages and not just Chinese ?

Thanks
Roberto