High CPU usage after upgrade to 6.2.12

rlattuad · November 23, 2023, 10:19am

After upgrading to the latest release we noticed a growing CPU usage for searchd, after it reached 800% (100% utilization of 8 cores) we had to restart it but after a while the same happened. We are running MariaDB 10.3.39.
Our manticore configuration uses local distributed indices.

We started playing with the searchd configuration and after disabling “pseudo_sharding” the problem disappeared.

It seemed that pseudo_sharding was causing searchd to keep cpu-cores constantly busy running a connection to the MariaDB server and never releasing it.

Sorry this is not a great explanation, I am interested in investigating the issue further but need some pointers or suggestions for tests.

Any ideas ?
Roberto

Sergey · November 24, 2023, 2:00am

keep cpu-cores constantly busy running a connection to the MariaDB server and never releasing it.

Unlikely since indexer doesn’t care about pseudo_sharding

Any ideas ?

From 6.2.0 changelog:

Enabled multithreaded execution of queries containing secondary indexes, with the number of threads limited to the count of physical CPU cores. This should considerably improve the query execution speed.

pseudo_sharding has been adjusted to be limited to the number of free threads. This update considerably enhances the throughput performance.

The query optimizer has been enhanced to support full-text queries, significantly improving search efficiency and performance.

In short: Manticore 6.2.x can utilize CPU more intensively to make queries faster. Is it not the case in your case? Can you show the graphs showing that the CPU load increased a lot after the upgrade, but the response time didn’t change?

we had to restart it but after a while the same happened.

Does the CPU load graph look like a gradual increase after each restart?

rlattuad · November 24, 2023, 1:34pm

Need to correct some stuff I wrote:

the problem is only related to searchd
the busy connections are the searchd threads (there is no connection to the DB)

How does the allocation/release of threads work with pseudo-sharding ? is it possible that once a thread is allocated it is not released ? Maybe following some error condition on the communication channel (mysql protocol) ?

The CPU load is a gradual increase after each restart, sorry I do not know much more at this stage. We are trying to work out where to look, but disabling pseudo-sharding brings back cpu usage to old patterns.

Sergey · November 24, 2023, 3:12pm

How does the allocation/release of threads work with pseudo-sharding ?

It allocates as many workers as possible. If some workers are already busy it doesn’t use them.

Not by design. If only there’s a bug which causes that.

The CPU load is a gradual increase after each restart, sorry I do not know much more at this stage. We are trying to work out where to look, but disabling pseudo-sharding brings back cpu usage to old patterns.

If you don’t have performance graphs, can you please:

run dstat -at 60 in the background (separate terminal window)
restart Manticore
wait until the load is high
run show threads option format=all
run select * from @@system.threads
run show status
provide the outputs
provide the searchd log
provide the query log if possible

?

scalingsearch · November 30, 2023, 9:35am

Try setting
pseudo_sharding = 0

Reference link suggesting possible performance problems with default pseudo_sharding setting in
version 6.2.12:

github.com/manticoresoftware/manticoresearch

Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase

opened 02:35AM - 05 Nov 23 UTC

digirave

waiting

**Describe the bug** Upgrading manticore from 6.0.4_230314.1a3a4ea82-1.el8 to 6….2.12_230822.dc5144d35-1.el8 results in significant cpu usage increase **To Reproduce** Steps to reproduce the behavior: 1. Have many search queries using ngram searches using plain indexes with 20 shards on same local server 2. conf file has max_threads_per_query = 20 threads = 200 **Expected behavior** A clear and concise description of what you expected to happen. Similar performance is expected **Describe the environment:** Linux [REDACTED] 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux # rpm -qa|grep manti manticore-repo-0.1.0-1.noarch manticore-buddy-1.0.18_23080408.2befdbe-1.noarch manticore-common-6.2.12_230822.dc5144d35-1.el8.noarch manticore-extra-0.7.8_23082209.810d7d3-1.noarch manticore-columnar-lib-2.2.4_230822.5aec342-1.el8.x86_64 manticore-server-6.2.12_230822.dc5144d35-1.el8.x86_64 manticore-server-6.0.2_230210.89c7a5139-1.el8.x86_64 manticore-6.2.12_230822.dc5144d35-1.el8.x86_64 manticore-icudata-5.0.3_221123.d2d9e5e56-1.el7.centos.noarch manticore-executor-0.7.8_23082210.810d7d3-1.x86_64 manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64 manticore-server-core-6.2.12_230822.dc5144d35-1.el8.x86_64 manticore-devel-6.2.12_230822.dc5144d35-1.el8.noarch manticore-tools-6.2.12_230822.dc5144d35-1.el8.x86_64 manticore-backup-1.0.8_23080408.f7638f9-1.noarch **Messages from log files:** Messages from searchd.log and query.log (if applicable). **Additional context** We have two servers with exact hardware and exact same setup with manticore. They create indexes independently of each but have the same setup. They have queries sent to them in the same amount. ServerA was upgraded from manticore server 6.0.4_230314.1a3a4ea82-1.el8 to 6.2.12_230822.dc5144d35-1.el8 during the black lines. ServerB is manticore-server-6.0.4_230314.1a3a4ea82-1.el8.x86_64 in the whole graph You can easily see ServerA has increased CPU load. Query response time seem overall similar. One difference between the servers is that ServerA was booted while ServerB has same rpm's for everything other than manticore but was not booted. serverA: ![serverA](https://github.com/manticoresoftware/manticoresearch/assets/6678253/42a67f2a-9d11-4d77-9ebc-ee42e4ed2152) serverB: ![serverB](https://github.com/manticoresoftware/manticoresearch/assets/6678253/2a5cd518-d12a-43b6-85cc-cbad2f655db7)

rlattuad · November 30, 2023, 9:54am

This is exactly what I did to fix the issue, setting pseudo_sharding = 0 fixed the CPU problem without any noticeable query response degradation. We also set a limit on the number a threads that can be used. There is obviously something different between 6.0 (our previous version) and 6.2 (when we started seeing the problem). I will collect some data and share hoping it will provide some clues as to what is happening.

scalingsearch · November 30, 2023, 10:39am

We also set a limit on the number a threads that can be used.

Of note, we are also doing this.

Sergey · November 30, 2023, 10:58am

We’ll appreciate if anyone tells us how exactly we can reproduce it, e.g. provide their table files, query log and config by sending them to our write-only S3 - Manticore Search Manual: Reporting bugs

I’m afraid without that we won’t be able to fix it since our tests didn’t show any overload/performance issues.

WDA · December 9, 2023, 10:42am

Arrived here after googling for my new found problem after a new installation, which seems to be the same as described by others. Here is the output of top, on an otherwise not busy server, seems like manticore went through the roof at some point and doesnt get off there. Trying to restart manticore hangs. Will report via bugs when i get a bit of time with more info. thanks.

top - 10:27:35 up 1 day, 14:22, 1 user, load average: 7.10, 7.18, 7.14
Tasks: 207 total, 1 running, 206 sleeping, 0 stopped, 0 zombie
%Cpu(s): 87.4 us, 0.1 sy, 0.0 ni, 12.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 15610.5 total, 5423.8 free, 5120.4 used, 5066.3 buff/cache
MiB Swap: 4096.0 total, 2191.7 free, 1904.2 used. 10143.6avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
882 mantico+  20   0 1661664 381492 265008 S 699.3   2.4  10996:51 searchd

1010 mysql 20 0 3479460 1.7g 19228 S 0.7 10.9 64:02.00 mysqld
797 root 20 0 8095304 300040 14440 S 0.3 1.9 2:12.51 java
177615 root 20 0 0 0 0 I 0.3 0.0 0:00.15 kworker/u16:1-events_power_efficient
178107 root 20 0 10496 3968 3360 R 0.3 0.0 0:00.01 top

This are the last lines before trying to restart manticore, query log doesnt show anything abnormal:

[Sat Dec 9 10:24:17.323 2023] [889] rotating table ‘spc_304001001_delta’: success
[Sat Dec 9 10:24:17.323 2023] [889] rotating table: all tables done
[Sat Dec 9 10:27:41.162 2023] [882] caught SIGTERM, shutting down
[Sat Dec 9 10:27:44.174 2023] [882] WARNING: still 5 alive tasks during shutdown, after 3.008 sec
[Sat Dec 9 10:30:41.574 2023] [178267] watchdog: main process 178268 forked ok
[Sat Dec 9 10:30:41.577 2023] [178268] FATAL: failed to lock ‘/var/lib/manticore//binlog.lock’: 11 ‘Resource temporarily unavailable’
[Sat Dec 9 10:30:41.578 2023] [178267] watchdog: main process 178268 exited cleanly (exit code 1), shutting down

maxim · February 22, 2024, 6:52pm

I have the same problem. System load gradually increases up to 50 (limited by thread count I suppose) because some queries hang. These queries are batch or simple, but what they have in common is that a filter is used on mva attribute.
The very simple queries like these hangs.

SELECT id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') AND host_id = 5 LIMIT 0, 1 OPTION max_matches = 1

What’s interesting is that different search terms hang with some mva attribute values and work fine with others, and all the time with the same values, not randomly.

It is also interesting that queries stop hanging when there are few free threads left in the system.

Also unusual is that when I run queries manually, one at a time, from the mysql client, they hang, but when I test with mysqlslap with concurrency value = 1, they work fine and hang at higher concurrency values.

The command I run to give a load on manticore. I ran it several times until all the threads were clogged. The initial 4-5 attempts did not yield any results, but subsequent attempts were successful.

mysqlslap --verbose --query=/query/hang.log --port=9306 --host=manticore --concurrency=2 --number-of-queries=20 --detach=1 --delimiter="\n"

Mysqlslap with these queries works only with concurrency value = 1, and stucks if concurrency > 1

SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') and any(host_id) = 7 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') and any(host_id) = 3 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) tiktok)') and any(host_id) = 3 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) tiktok)') and any(host_id) = 9 LIMIT 0, 3000 OPTION max_matches = 3000;

Mysqlslap with these queries works with any concurrency value

SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) video)') and any(host_id) = 7 LIMIT 0, 3000 OPTION max_matches = 3000;
SELECT id,host_id FROM post WHERE MATCH('(@(tags,user_screen_name,text,user_name) tiktok)') and any(host_id) = 3 LIMIT 0, 3000 OPTION max_matches = 3000;

The query log shows that the first 25 requests are processed successfully, but then the system becomes unresponsive until most of the threads are stucked and Manticore starts functioning (but stucked threads provides high system LA).

Testing without pseudo sharding works up to concurrency values no more then 24. But it should be different issue because I can’t see hung queries in threads in this scenario.

Also I noticed that the environment variable searchd_pseudo_sharding=0 does not work for the Docker container, although some other variables do. However, this is also a separate issue.

I use Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822) on docker swarm.

I’ve attached the logs and some debug output at the link, but it would probably be better to create an issue on github.

rlattuad · February 23, 2024, 11:13am

I can confirm that our queries also use attributes (mva)

Sergey · February 23, 2024, 11:25am

@maxim can you check if it works fine in the latest dev version? Manticore Search Manual

maxim · February 23, 2024, 2:57pm

I confirm that the problem with clogged threads does not exist in dev-6.2.13-27c3259.

I also confirm that the problem with the searchd_pseudo_sharding environment variable not working is fixed in dev-6.2.13-27c3259.

In general, manticore works fine when tested with mysqlslap with the following parameters: --concurrency=40 --number-of-queries=2000. If I increase the concurrency or the number of queries, only a fraction of the queries will work (86-92% of the queries are executed and then they stop executing). But there is no problem when queries hang in threads. Probably some buffer overflow is happening.

Thank you for your product

rlattuad · June 5, 2024, 10:32am

Did the fix in dev-6.2.13-27c3259 make it to 6.3 ?
Anyone has details of the bug/fix/patch ?

Thanks

tomat · June 5, 2024, 11:28am

according to the commit log Bump backup version to: 1.3.5-24022217-d6cd26d · manticoresoftware/manticoresearch@27c3259 · GitHub is in the 6.3.0 release

Andrey3 · June 24, 2024, 7:38am

I can confirm that this problem was fixed in release 6.3.0.

CPU load on release 6.2.12

CPU load on release 6.3.0