Recommended multilanguage setup

Hi,

we are reviewing our Manticore index and search setup and wonder if we could improve on what we have now.

Current setup:

  • documents in 50+ languages
  • no use of morphology
  • all stopwords files used together (so a stop word in one language may in theory be a “good” term in another)
  • no language-specific regex rules (we have some generic ones)

Desired solution

  • use of morphology
  • each language with its own set of stop words and regex
  • basically we would like to improve our search quality by defining language-specific parameters while currently all languages are indexed and searched together (we do have a language attribute though so we know what language a document is in).

Our idea

  • define language specific indices (each with its own morphology, stop words and regex rules)
  • this would mean having 50+ indices but also being able to tune each language better
  • search an index depending on the language(s) used (potentially 1 to 50+ indices would need to be searched)

This is as far as we got in our idea of a new implementation, we are still trying to get a grasp of its pros and cons, whether it would work at all and what we are missing.

Any ideas or experience of actual multi-language setups that you can share with us would definitely help.

What do you think ? Are other multi-language documents implementations working on a per-language index or with 1 catch-all index ?

Thanks for any input.

Roberto

I think if in your app it’s natural to filter by language (and it’s uncommon to search among texts in many languages at once) the above is a must as it should be much more efficient at least performance wise. And in addition to that you can benefit from different NLP settings.

Once you have all the sub-indexes and if you still need to make a query to all of them you can combine them all in a distributed index.