High CPU usage after upgrade to 6.2.12

After upgrading to the latest release we noticed a growing CPU usage for searchd, after it reached 800% (100% utilization of 8 cores) we had to restart it but after a while the same happened. We are running MariaDB 10.3.39.
Our manticore configuration uses local distributed indices.

We started playing with the searchd configuration and after disabling “pseudo_sharding” the problem disappeared.

It seemed that pseudo_sharding was causing searchd to keep cpu-cores constantly busy running a connection to the MariaDB server and never releasing it.

Sorry this is not a great explanation, I am interested in investigating the issue further but need some pointers or suggestions for tests.

Any ideas ?
Roberto

keep cpu-cores constantly busy running a connection to the MariaDB server and never releasing it.

Unlikely since indexer doesn’t care about pseudo_sharding

Any ideas ?

From 6.2.0 changelog:

  • Enabled multithreaded execution of queries containing secondary indexes, with the number of threads limited to the count of physical CPU cores. This should considerably improve the query execution speed.
  • pseudo_sharding has been adjusted to be limited to the number of free threads. This update considerably enhances the throughput performance.
  • The query optimizer has been enhanced to support full-text queries, significantly improving search efficiency and performance.

In short: Manticore 6.2.x can utilize CPU more intensively to make queries faster. Is it not the case in your case? Can you show the graphs showing that the CPU load increased a lot after the upgrade, but the response time didn’t change?

we had to restart it but after a while the same happened.

Does the CPU load graph look like a gradual increase after each restart?

Need to correct some stuff I wrote:

  • the problem is only related to searchd
  • the busy connections are the searchd threads (there is no connection to the DB)

How does the allocation/release of threads work with pseudo-sharding ? is it possible that once a thread is allocated it is not released ? Maybe following some error condition on the communication channel (mysql protocol) ?

The CPU load is a gradual increase after each restart, sorry I do not know much more at this stage. We are trying to work out where to look, but disabling pseudo-sharding brings back cpu usage to old patterns.

How does the allocation/release of threads work with pseudo-sharding ?

It allocates as many workers as possible. If some workers are already busy it doesn’t use them.

Not by design. If only there’s a bug which causes that.

The CPU load is a gradual increase after each restart, sorry I do not know much more at this stage. We are trying to work out where to look, but disabling pseudo-sharding brings back cpu usage to old patterns.

If you don’t have performance graphs, can you please:

  • run dstat -at 60 in the background (separate terminal window)
  • restart Manticore
  • wait until the load is high
  • run show threads option format=all
  • run select * from @@system.threads
  • run show status
  • provide the outputs
  • provide the searchd log
  • provide the query log if possible

?

Try setting
pseudo_sharding = 0

Reference link suggesting possible performance problems with default pseudo_sharding setting in
version 6.2.12:

This is exactly what I did to fix the issue, setting pseudo_sharding = 0 fixed the CPU problem without any noticeable query response degradation. We also set a limit on the number a threads that can be used. There is obviously something different between 6.0 (our previous version) and 6.2 (when we started seeing the problem). I will collect some data and share hoping it will provide some clues as to what is happening.

We also set a limit on the number a threads that can be used.

Of note, we are also doing this.

We’ll appreciate if anyone tells us how exactly we can reproduce it, e.g. provide their table files, query log and config by sending them to our write-only S3 - Manticore Search Manual: Reporting bugs

I’m afraid without that we won’t be able to fix it since our tests didn’t show any overload/performance issues.

Arrived here after googling for my new found problem after a new installation, which seems to be the same as described by others. Here is the output of top, on an otherwise not busy server, seems like manticore went through the roof at some point and doesnt get off there. Trying to restart manticore hangs. Will report via bugs when i get a bit of time with more info. thanks.

top - 10:27:35 up 1 day, 14:22, 1 user, load average: 7.10, 7.18, 7.14
Tasks: 207 total, 1 running, 206 sleeping, 0 stopped, 0 zombie
%Cpu(s): 87.4 us, 0.1 sy, 0.0 ni, 12.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 15610.5 total, 5423.8 free, 5120.4 used, 5066.3 buff/cache
MiB Swap: 4096.0 total, 2191.7 free, 1904.2 used. 10143.6avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
882 mantico+  20   0 1661664 381492 265008 S 699.3   2.4  10996:51 searchd

1010 mysql 20 0 3479460 1.7g 19228 S 0.7 10.9 64:02.00 mysqld
797 root 20 0 8095304 300040 14440 S 0.3 1.9 2:12.51 java
177615 root 20 0 0 0 0 I 0.3 0.0 0:00.15 kworker/u16:1-events_power_efficient
178107 root 20 0 10496 3968 3360 R 0.3 0.0 0:00.01 top

This are the last lines before trying to restart manticore, query log doesnt show anything abnormal:

[Sat Dec 9 10:24:17.323 2023] [889] rotating table ‘spc_304001001_delta’: success
[Sat Dec 9 10:24:17.323 2023] [889] rotating table: all tables done
[Sat Dec 9 10:27:41.162 2023] [882] caught SIGTERM, shutting down
[Sat Dec 9 10:27:44.174 2023] [882] WARNING: still 5 alive tasks during shutdown, after 3.008 sec
[Sat Dec 9 10:30:41.574 2023] [178267] watchdog: main process 178268 forked ok
[Sat Dec 9 10:30:41.577 2023] [178268] FATAL: failed to lock ‘/var/lib/manticore//binlog.lock’: 11 ‘Resource temporarily unavailable’
[Sat Dec 9 10:30:41.578 2023] [178267] watchdog: main process 178268 exited cleanly (exit code 1), shutting down

I have the same problem. System load gradually increases up to 50 (limited by thread count I suppose) because some queries hang. These queries are batch or simple, but what they have in common is that a filter is used on mva attribute.
The very simple queries like these hangs.

SELECT id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') AND host_id = 5 LIMIT 0, 1 OPTION max_matches = 1

What’s interesting is that different search terms hang with some mva attribute values and work fine with others, and all the time with the same values, not randomly.

It is also interesting that queries stop hanging when there are few free threads left in the system.

Also unusual is that when I run queries manually, one at a time, from the mysql client, they hang, but when I test with mysqlslap with concurrency value = 1, they work fine and hang at higher concurrency values.

The command I run to give a load on manticore. I ran it several times until all the threads were clogged. The initial 4-5 attempts did not yield any results, but subsequent attempts were successful.

mysqlslap --verbose --query=/query/hang.log --port=9306 --host=manticore --concurrency=2 --number-of-queries=20 --detach=1 --delimiter="\n"

Mysqlslap with these queries works only with concurrency value = 1, and stucks if concurrency > 1

SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') and any(host_id) = 7 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') and any(host_id) = 3 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) tiktok)') and any(host_id) = 3 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) tiktok)') and any(host_id) = 9 LIMIT 0, 3000 OPTION max_matches = 3000;

Mysqlslap with these queries works with any concurrency value

SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') and any(host_id) = 7 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) tiktok)') and any(host_id) = 3 LIMIT 0, 3000 OPTION max_matches = 3000;

The query log shows that the first 25 requests are processed successfully, but then the system becomes unresponsive until most of the threads are stucked and Manticore starts functioning (but stucked threads provides high system LA).

Testing without pseudo sharding works up to concurrency values no more then 24. But it should be different issue because I can’t see hung queries in threads in this scenario.

Also I noticed that the environment variable searchd_pseudo_sharding=0 does not work for the Docker container, although some other variables do. However, this is also a separate issue.

I use Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822) on docker swarm.

I’ve attached the logs and some debug output at the link, but it would probably be better to create an issue on github.

I can confirm that our queries also use attributes (mva)

@maxim can you check if it works fine in the latest dev version? Manticore Search Manual

I confirm that the problem with clogged threads does not exist in dev-6.2.13-27c3259.

I also confirm that the problem with the searchd_pseudo_sharding environment variable not working is fixed in dev-6.2.13-27c3259.

In general, manticore works fine when tested with mysqlslap with the following parameters: --concurrency=40 --number-of-queries=2000. If I increase the concurrency or the number of queries, only a fraction of the queries will work (86-92% of the queries are executed and then they stop executing). But there is no problem when queries hang in threads. Probably some buffer overflow is happening.

Thank you for your product :slight_smile:

1 Like

Did the fix in dev-6.2.13-27c3259 make it to 6.3 ?
Anyone has details of the bug/fix/patch ?

Thanks

according to the commit log Bump backup version to: 1.3.5-24022217-d6cd26d · manticoresoftware/manticoresearch@27c3259 · GitHub is in the 6.3.0 release

I can confirm that this problem was fixed in release 6.3.0.

CPU load on release 6.2.12

CPU load on release 6.3.0

1 Like