Custom word segmentation for languages with continuous scripts

I am looking for design suggestions on how to add proper word segmentation support for languages with continuous scripts. Currently Manticore can only segment Chinese through the help of ICU or Jieba. I want to segment other languages such as Korean, Japanese, Thai, Tibetan, and other Chinese languages.

I have looked at how this could be done using plugins. Segmentation can be implemented at indexation-time with a custom plugin, because a plugin can output more than one token given a single token as input, using xxx_get_extra_token() callback. However, it looks like a query-time plugin can only output a single token for each token produced by the base tokenizer. This prevents doing segmentation at query-time as well.

This is a problem because in continuous scripts, word boundaries are kind of vague, so a typical search query can easily include text that the indexer would have split into several tokens, preventing a match. Example in Japanese: 食べた is reasonably considered as a single word, but most tokenizers (such as Mecab or Sudachi) split it into two tokens 食べ and た.

Another requirement that I have is to be able to query multiple indexes at once, with each index configured to use a different language-specific tokenizer, something like SELECT * FROM japanese_index, korean_index where MATCH('here could be Japanese or Korean'). However, I cannot do that because the query-time plugin needs to be specified as OPTION token_filter='mylib.so:mylib', in other words it is not index-dependent.

Hi @gillux,

I’ve talked about this with the dev team, and here’s what we think:

  • You can check out how GitHub - manticoresoftware/lemmatizer-uk: UK lemmatizer for Manticore Search is implemented, especially the xxx_get_extra_token() function.
  • When we worked on Chinese segmentation, we had to move it from the tokenizer to the preprocessor level. It would be great to have a plugin that works at the preprocessing stage, which you could use.
  • Unfortunately, there’s no single library that works well for all CJK languages. But there are C++ libraries for Japanese and Korean that can be integrated. You can use Jieba as a reference. If you’d like to open a PR for this, we’d be happy to review it and help with suggestions.

Hi @Sergey,

Thank you very much for your feedback. I understand that my needs require some modification of Manticore itself, either by enabling new plugin hooks, or by implementing support in Manticore for each language. I could give it a try.

I believe it makes more sense for our project to go with the plugin solution because we need flexibility. Our platform Tatoeba includes content in 400+ languages (11 with continuous scripts) and we want to segment and lemmatize most of them, ideally all of them. I think it would be easier for me in terms of maintenance and flexibility to completely separate segmentation/lemmatization from Manticore’s core, and to have it handled by a separate, actively developed library that supports as many languages as possible. A bit like snowball but with segmentation and more languages. I am thinking about https://spacy.io/, because accuracy is more important for me than speed.

My idea is to make a single Manticore plugin that just takes the language ISO code as parameter and let it do all the NLP work. This way, I can just add support for any new language simply by adding a line in my Manticore configuration, without having to modify or update Manticore itself.

Please let me know if that makes sense. Also, in terms of development work, I am not sure how hard it would be to add preprocessor plugin hooks, compared to adding support for, say, Japanese with a specific library? Maybe I could add support for Japanese as starters to get my hand in it, and then try to add plugin hooks? Any advice is much appreciated!