malloc(): Issues when Indexing large plain tables

ryker · August 14, 2024, 2:43pm

Hi there!
I tried quite a while to get a running and reliable setup, but apparently I’m missing something.
I try to setup manticorewith plain tables we used an older version of sphinxsearch before but want to upgrade to manticore.
I’ve managed to create a working setup but it fails with the large tables we actually have.

The Setup:
I use the manticore:6.3.0 Image and build upon it to use a custom CMD script in which the manticore.conf gets generated with specific parameters for mariadb connection and for which tables to include, if the main tables are not present in the /var/lib/manticore folder already the indexer gets executed and finally searchd gets started as this"searchd -c /etc/manticoresearch/manticore.conf --nodetach --logdebugv"

This docker container works with smaller data sizes in a local docker compose environment. The user executing the indexing (we use a main+delta schema) is the manticore user as well as the one executing the searchd process.

For the kubernetes setup I use a statefulset and excluded the /var/lib/manticore/ folder as a volume to persist the data, should the container crash.

For the Indexer I created Kubernetes Cronjobs, and from those pods I kubectl exec to the running pods to execute the indexers. I have main indexer running once everyday and delta indexer running every few minutes.

The failure behavior I currently notice is, that the inital deployment works fine and everytime. So starting up the Container, in which firstly the indexer gets executed without a running searchd process and afterwards the searchd process gets started.

But the subsequent times I try to execute any Indexer I get errors like the following, but not really consistenly:
DEBUG: will rotate GENERIC_TABLENAME1
double free or corruption (out)
DEBUG: TaskRotation starts with 11 deferred tables
DEBUG: seamless rotate local table GENERIC_TABLENAME2
rotating table GENERIC_TABLENAME2: started
Crash!!! Handling signal 6
DEBUG: prealloc enough RAM and lock new table
…
DEBUG: all went fine; swap them
…
rotating table ‘GENERIC_TABLENAME2’: success
…
rotating table ‘GENERIC_TABLENAME3’: started
DEBUG: prealloc enough RAM and lock new table
mmalloc(): unaligned tcache chunk detected

I can’t really pinpoint this to any specific reason. The main Indexer process takes up to 30 gib of memory while executing, but the memory limit I have is far greater than that. Also the volume has enough space for the data i’m trying to save.

My manticore.conf has the following settings for searchd:
listen = 0.0.0.0:9313:sphinx
listen = 0.0.0.0:9306:mysql
listen = 0.0.0.0:9308:http
log = /var/lib/manticore/searchd.log
network_timeout = 5
pid_file = /var/run/manticore/searchd.pid
seamless_rotate = 1
preopen_tables = 0
secondary_indexes = 0
unlink_old = 1
threads = 4
net_workers = 4
binlog_path = # disable logging
My Indexer the following:
mem_limit = 1024M
lemmatizer_cache = 256M
write_buffer = 256M

I have the feeling the issue comes up after the sighup from the indexer process to the searchd process, but can’t find any root cause, why

tomat · August 16, 2024, 7:06am

could you check your index that cause this error with indextool to make sure the index is valid?

tomat · August 16, 2024, 7:09am

that looks like a memory corruption due to the bad index data or unusual daemon work

ryker · August 21, 2024, 7:14am

I’ll do that. Anything specific to look out for?

tomat · August 21, 2024, 7:38am

just run indextool -c your.conf --check table_name

ryker · August 21, 2024, 8:04am

Looks like I have some issues in one table with doc-ids:

checking doc-id lookup…
FAILED, invalid docid delta 0 at row 760544, checkpoint 0, doc 3, last docid 5
FAILED, invalid docid 0(5) at row 760544, checkpoint 0, doc 3, last docid 5
FAILED, invalid docid 2(7) at row 3, checkpoint 0, doc 4, last docid 0

And around 60 more of those in this table.

tomat · August 21, 2024, 8:15am

seems your index data are invalid - you need to reindex your data from scratch or restore it from the backup

ryker · August 21, 2024, 8:38am

I see…
Does that hint to a faulty index definition in the mantciore.conf?

tomat · August 21, 2024, 9:07am

I sure you can not define index invalid or you can run indextool right after indexer to make sure.

ryker · August 22, 2024, 7:19am

Since the issue occures quite reliable, also after clearing all the indexes, I think the issue is caused by something in my setup.

Yesterday I noticed, that the value I use to define which data goes into the main or delta table is problematic for multiple services being up at the same time, since the main indexing, which changes this timestamp gets changed in three services and with that causes some of the indexes to have gaps.

So if I have 3 services which run the main indexing once a day at 1, 3, and 5 am the timestamp I use gegts changed for all services at those three times, which creates data gaps between the delta and main indexing for the services.

Could something like this cause the indexes to become invalid?

tomat · August 22, 2024, 7:54am

if you not sure there issue happens you could run indextool right after every indexer operation to make sure your indexes are valid and catch step that breaks the index

ryker · August 22, 2024, 8:01am

I’ll try that. Thanks for your help by the way=)

ryker · August 22, 2024, 9:26am

The indextool seems to indicate, that all the indexes are alright. The searchd process seems to die in the process of rotating. Initial indexing runs trough, than searchd gets started and the first delta indexing already crashes my service. The output before dying looks something like this:

DEBUG: will rotate delta_table1
DEBUG: will rotate delta_table2
DEBUG: will rotate delta_table3
DEBUG: will rotate delta_table4
DEBUG: will rotate delta_table5
DEBUG: will rotate main_table6
1900K … … … … … 100% 4.34M=0.2sdouble free or corruption (out)
DEBUG: TaskRotation starts with 6 deferred tables
DEBUG: seamless rotate local table delta_table1
rotating table ‘delta_table1’: started
Crash!!! Handling signal 6
DEBUG: prealloc enough RAM and lock new table
DEBUG: Locking the table via file /var/lib/manticore/delta_table1.new.spl
DEBUG: lock /var/lib/manticore/delta_table1.new.spl success
DEBUG: CSphIndex_VLN::Preread invoked ‘delta_table1’(/var/lib/manticore/delta_table1.new)
DEBUG: Preread successfully finished
DEBUG: activate new table
RW-idx for rename to .old, acquiring…
RW-idx for rename to .old, acquired…
DEBUG: rotating table ‘delta_table1’: applying other tables killlists
DEBUG: rotating table ‘delta_table1’: applying other tables killlists… DONE
DEBUG: rotating table ‘delta_table1’: apply killlist from this table to other tables (killlist_target)
DEBUG: rotating table ‘delta_table1’: apply killlist from this table to other tables (killlist_target)… DONE
DEBUG: all went fine; swap them
DEBUG: unlink /var/lib/manticore/delta_table1.old
DEBUG: Unlocking the table (lock /var/lib/manticore/delta_table1.old.spl)
DEBUG: File ID ok, closing lock FD 98, unlinking /var/lib/manticore/delta_table1.old.spl
rotating table ‘delta_table1’: success
DEBUG: seamless rotate local table delta_table2
rotating table ‘delta_table2’: started
DEBUG: prealloc enough RAM and lock new table
Crash!!! Handling signal 11

I also can provide crash dump if needed

tomat · August 22, 2024, 9:49am

yes it could be better to create ticket at Github there upload the crash log and indextool report on check of the index files and upload crash dump as described at the manual

ryker · August 22, 2024, 9:50am

I’ll do so.
Thank you for your help and patience!