Chinese and other national languages


#1

How to support Chinese search?and other national languages, Japan …


#2

By default, only english and russian characters are indexed.
You need to create a custom charset_table to include Chinese or other language characters you need.
For example to include also Swedish characters, the charset table should look like

charset_table =  0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451,U+C5->U+E5, U+E5, U+C4->U+E4, U+E4, U+D6->U+F6, U+F6

On Sphinx Wiki there is a page with lists for various languages: http://sphinxsearch.com/wiki/doku.php?id=charset_tables
For CJK languages you might want to use the ngram feature (for unsegmented texts).


#3

I tried to import CJK , but it does no work
select * from test1 where match('@title =网站');
index test1
{

	source			= src1
	path			= manticore/data/test1
	ngram_len = 1
	charset_table   = 0..9, english,U+F900->U+8C48, U+F901->U+66F4, U+F902->U+8ECA, U+F903->U+8CC8, U+F904->U+6ED1, U+F905->U+4E32, \
		U+F906->U+53E5, U+F907->U+9F9C, U+F908->U+9F9C, U+F909->U+5951, U+F90A->U+91D1, U+F90B->U+5587, U+F90C->U+5948, U+F90D->U+61F6, \
		U+F90E->U+7669, U+F90F->U+7F85, U+F910->U+863F, U+F911->U+87BA, U+F912->U+88F8, U+F913->U+908F, U+F914->U+6A02, U+F915->U+6D1B, \
		U+F916->U+70D9, U+F917->U+73DE, U+F918->U+843D, U+F919->U+916A, U+F91A->U+99F1, U+F91B->U+4E82, U+F91C->U+5375, U+F91D->U+6B04, \
...
..

I don’t need a word segmentation.only need to find out


#4

Hi, jfyi we have a new article about using Manticore in CJK languages, I hope it would be useful https://bit.ly/2Ll9cyJ