Hi,
we are reviewing our Manticore index and search setup and wonder if we could improve on what we have now.
Current setup:
- documents in 50+ languages
- no use of morphology
- all stopwords files used together (so a stop word in one language may in theory be a “good” term in another)
- no language-specific regex rules (we have some generic ones)
Desired solution
- use of morphology
- each language with its own set of stop words and regex
- basically we would like to improve our search quality by defining language-specific parameters while currently all languages are indexed and searched together (we do have a language attribute though so we know what language a document is in).
Our idea
- define language specific indices (each with its own morphology, stop words and regex rules)
- this would mean having 50+ indices but also being able to tune each language better
- search an index depending on the language(s) used (potentially 1 to 50+ indices would need to be searched)
This is as far as we got in our idea of a new implementation, we are still trying to get a grasp of its pros and cons, whether it would work at all and what we are missing.
Any ideas or experience of actual multi-language setups that you can share with us would definitely help.
What do you think ? Are other multi-language documents implementations working on a per-language index or with 1 catch-all index ?
Thanks for any input.
Roberto