Recommended multilanguage setup

rlattuad · November 24, 2020, 4:00pm

Hi,

we are reviewing our Manticore index and search setup and wonder if we could improve on what we have now.

Current setup:

documents in 50+ languages
no use of morphology
all stopwords files used together (so a stop word in one language may in theory be a “good” term in another)
no language-specific regex rules (we have some generic ones)

Desired solution

use of morphology
each language with its own set of stop words and regex
basically we would like to improve our search quality by defining language-specific parameters while currently all languages are indexed and searched together (we do have a language attribute though so we know what language a document is in).

Our idea

define language specific indices (each with its own morphology, stop words and regex rules)
this would mean having 50+ indices but also being able to tune each language better
search an index depending on the language(s) used (potentially 1 to 50+ indices would need to be searched)

This is as far as we got in our idea of a new implementation, we are still trying to get a grasp of its pros and cons, whether it would work at all and what we are missing.

Any ideas or experience of actual multi-language setups that you can share with us would definitely help.

What do you think ? Are other multi-language documents implementations working on a per-language index or with 1 catch-all index ?

Thanks for any input.

Roberto

Sergey · November 27, 2020, 4:22am

I think if in your app it’s natural to filter by language (and it’s uncommon to search among texts in many languages at once) the above is a must as it should be much more efficient at least performance wise. And in addition to that you can benefit from different NLP settings.

Once you have all the sub-indexes and if you still need to make a query to all of them you can combine them all in a distributed index.