Error initializing ICU:

rlattuad · February 15, 2022, 12:07pm

While trying to index some chinese content I get the following:
FATAL: index ‘myhb_zh’: Error initializing ICU: Unable to initialize ICU break iterator: U_MISSING_RESOURCE_ERROR. Make sure ICU data file is accessible (using ‘/usr/share/manticore/icu’ folder)

I thought ICU was built in in the binaries.

I am running latest version (4.2.0)

tomat · February 15, 2022, 12:22pm

there is a manticore-icudata package that you have to install since 322 version described here

rlattuad · February 15, 2022, 4:08pm

I installed the package and now the indexer starts but crashes on ICUPreprocessor:
-------------Main thread:
#0 0x00007f949055b60c in waitpid () from /lib64/libc.so.6
#1 0x00000000004d34db in sphDumpGdb(int, char const, char const) ()
#2 0x00000000004d3b0a in sphBacktrace(int, bool) ()
#3 0x0000000000419e2f in sigsegv(int) ()
#4
#5 0x00007f94904cc387 in raise () from /lib64/libc.so.6
#6 0x00007f94904cda78 in abort () from /lib64/libc.so.6
#7 0x00007f949050ef67 in __libc_message () from /lib64/libc.so.6
#8 0x00007f9490518b36 in _int_malloc () from /lib64/libc.so.6
#9 0x00007f949051b78c in malloc () from /lib64/libc.so.6
#10 0x0000000000c13c17 in utext_setup_65.part.19 ()
#11 0x0000000000c15035 in utext_openUTF8_65 ()
#12 0x0000000000bca196 in ICUPreprocessor_c::ProcessBufferICU(unsigned char const, int) ()
#13 0x0000000000bc9c7c in ICUPreprocessor_c::AddTextChunk(unsigned char const, int, sph::Vector_T<unsigned char, sph::DefaultCopy_T, sph::DefaultRelimit, sph::DefaultStorage_T >&, bool, bool) ()
#14 0x0000000000bc9beb in ICUPreprocessor_c::Process(unsigned char const, int, sph::Vector_T<unsigned char, sph::DefaultCopy_T, sph::DefaultRelimit, sph::DefaultStorage_T >&, bool) ()
#15 0x0000000000bca46d in FieldFilterICU_c::Apply(unsigned char const, int, sph::Vector_T<unsigned char, sph::DefaultCopy_T, sph::DefaultRelimit, sph::DefaultStorage_T >&, bool) ()
#16 0x0000000000baff79 in CSphSource::IterateDocument(bool&, CSphString&) ()
#17 0x0000000000433edc in CSphIndex_VLN::Build(sph::Vector_T<CSphSource, sph::DefaultCopy_T<CSphSource>, sph::DefaultRelimit, sph::DefaultStorage_T<CSphSource> > const&, int, int, CSphIndexProgress&) ()
#18 0x0000000000417bb2 in DoIndex(CSphConfigSection const&, char const, CSphOrderedHash<CSphConfigSection, CSphString, CSphStrHashFunc, 256> const&, _IO_FILE*) ()
#19 0x000000000041ccc5 in main ()

2 questions:

ICU should have been installed as part of the main package (I checked with yum and it is there) but in my case it was not there, mine was an upgrade though, shall I do a re-install ?
is ICU version specific or the manticore-icudata package works with any version ?

Thanks

Sergey · February 16, 2022, 2:54am

What’s your OS?

rlattuad · February 16, 2022, 10:55am

Centos 7

Sergey · February 16, 2022, 10:56am

How do we reproduce the crash?

rlattuad · February 16, 2022, 11:26am

We can start with looking at the conf file just to make sure there is not something odd in the configuration. This is the call to indexer:
sudo -H -u manticore /usr/bin/indexer --config /etc/manticoresearch/manticore_langs.conf myhb_zh

Here is the conf.

#############################################################################

index definition

#############################################################################
index common
{
type = plain
dict = keywords

wordforms		= /var/lib/manticore/wordforms.txt

min_word_len		= 2

min_infix_len		= 3

expand_keywords		= 1

blend_chars		= +, &, U+0023, -, U+002F

blend_mode		= trim_none, skip_pure

html_strip		= 1

html_remove_elements	= style, script, title, head

preopen			= 1

    index_exact_words     = 1

# Special terms equivalence
regexp_filter		= (β) => beta
regexp_filter		= (α) => alpha
regexp_filter		= (percent) => %

#
# Dosages common form: bring all dosages into common form
# i.e. 150 mg to 150mg etc... with no space between numeric and units parts
#
regexp_filter = (?i)(\s|\b)+(\pN*[.,]?\pN*)(\s|\b)+(mg\/ml|ml\/amp|mg|ml|ui|g|units|iu|mcg|µg)(\s|\b)+ => \1\2\4\5

}

Chinese

index myhb_zh:common
{
source = zh_products
path = /var/lib/manticore/myhb_zh
stopwords = zh

# Use ICU
charset_table 		= chinese 
morphology		= icu_chinese

# or use ngram
#ngram_chars		= chinese
#ngram_len		= 1

}

#############################################################################

indexer settings

#############################################################################

indexer
{
# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
# optional, default is 128M, max is 2047M, recommended is 256M to 1024M
mem_limit = 1024M

# how to handle IO errors in file fields
# known values are 'ignore_field', 'skip_document', and 'fail_index'
# optional, default is 'ignore_field'
#
on_file_field_error = ignore_field

# lemmatizer cache size
# improves the indexing time when the lemmatization is enabled
# optional, default is 256K
#
lemmatizer_cache 	= 128M

max_file_field_buffer   = 128M

}

–eof–

rlattuad · February 16, 2022, 11:28am

After this I can prepare a test database so that you can try to replicate the problem at your end.

Or you can send me a development version of indexer and I can send you any output/info I get.

Whatever is easier for you

Sergey · February 16, 2022, 1:15pm

The config looks ok. Please prepare a test config with which it crashes. Best if you can provide it in a minimal form like this:

snikolaev@dev:~$ cat csv_min.conf
searchd {
    listen = 9315:mysql41
    log = searchd.log
    pid_file = searchd.pid
    binlog_path =
}

source src {
    type = csvpipe
    csvpipe_command = echo "1,abc" && echo "2,abc" && echo "3,abc abc"
    csvpipe_field = f
}

index idx {
    type = plain
    source = src
    path = idx
}