I am looking for design suggestions on how to add proper word segmentation support for languages with continuous scripts. Currently Manticore can only segment Chinese through the help of ICU or Jieba. I want to segment other languages such as Korean, Japanese, Thai, Tibetan, and other Chinese languages.
I have looked at how this could be done using plugins. Segmentation can be implemented at indexation-time with a custom plugin, because a plugin can output more than one token given a single token as input, using xxx_get_extra_token()
callback. However, it looks like a query-time plugin can only output a single token for each token produced by the base tokenizer. This prevents doing segmentation at query-time as well.
This is a problem because in continuous scripts, word boundaries are kind of vague, so a typical search query can easily include text that the indexer would have split into several tokens, preventing a match. Example in Japanese: 食べた is reasonably considered as a single word, but most tokenizers (such as Mecab or Sudachi) split it into two tokens 食べ and た.
Another requirement that I have is to be able to query multiple indexes at once, with each index configured to use a different language-specific tokenizer, something like SELECT * FROM japanese_index, korean_index where MATCH('here could be Japanese or Korean')
. However, I cannot do that because the query-time plugin needs to be specified as OPTION token_filter='mylib.so:mylib'
, in other words it is not index-dependent.