performance down when increasing cpu cores and memory

I can’t reproduce this:

for testing ,I inserted about 450k rows of text into a table on a 4 cores 4gb vps, then I did a search test for its full text search performance, finally I get a decent result about 1700 queries/ second,
however, when I ported exactly the same data to a 8 cores 16gb vps, the final search performance went down to 1200 queries / second

with the data and queries you provided. Here’s my load command and the results:

8 cores - ~ 360 qps:

root@8cores-32gb:~# (while true; do shuf -n 1 test_query.log; done) | pv -l | parallel --round-robin -j24 --progress "mysql -P9306 -h 0 -e {} > /dev/null"

...

local:24/25512/100%/0.0s 25.5k 0:01:10 [ 362 /s]

4 cores - ~185 qps:

root@4cores-16gb:~# (while true; do shuf -n 1 test_query.log; done) | pv -l | parallel --round-robin -j8 --progress "mysql -P9306 -h 0 -e {} > /dev/null"

...

local:8/35630/100%/0.0s 35.6k 0:03:01 [ 184 /s]

I changed nothing in your:

  • config
  • data
  • queries

I used these VPSes in Hetzner with dedicated CPUs:

The OS I used was Debian 11.

Manticore version:

root@8cores-32gb:~# searchd -v
Manticore 6.3.9 dd29aca47@24120423 dev (columnar 2.3.1 edadc69@24112219) (secondary 2.3.1 edadc69@24112219) (knn 2.3.1 edadc69@24112219)

You mentioned “for its full text search performance” while the queries you provided are not full-text:

root@8cores-32gb:~# grep -i match test_query.log
root@8cores-32gb:~#

Perhaps if you provide different queries I can reproduce your issue.

first post that I mentioned is fulltext search, it had the same issue as knn search which I provided in query log, but full text searh is much faster than knn, so the result variance is not that noticeable.

in your test , clients and server are running on the same machine? perhaps we need to separate them

8 cores 32 gb only get a 362 qps for my knn queries, is far behind the expected result, coz in my test with 4 cores 4gb vps , I can get over 800 qps if cpus are fully utilized.

I will check it again later

Yes

8 cores 32 gb only get a 362 qps for my knn queries, is far behind the expected result, coz in my test with 4 cores 4gb vps , I can get over 800 qps if cpus are fully utilized.

Maybe:

  • your CPUs run at a higher frequency,
  • or the loading script should be moved to another server,
  • or it needs optimization to reduce overhead,
  • or the test should be run with a different level of concurrency.

My focus wasn’t on the absolute numbers but rather on the difference in throughput between the two VPSes.

I will check it again later

Great! You should be able to fully replicate my tests using the exact same VPSs. I believe I’ve shared all the necessary details.

in this way to stress manticore server, client will reconnect to server every time it excute a line from query.log, and server cpu cannot get loaded at more than 50% . so I know why my test result is different .
I need to do some more experiments to address this issue, persistent connections lead to more qps,
and maybe it takes advantages of query cache? so in some way I get performance boost due to query cache?

this issue actually exists after some more tests, maybe what I encountered can not be discribed as one thing , I need to address one first:

A tricky way to improve query performance (maybe a bug):

server: a four cores vps with 4GB ram running on proxmox,
table: text field and 768 dimension vector fields , 450k rows (which can be restored from manticore/write-only/issue-cpu-load-bug/backup.zip)
test queries: 8 mysql clients keep requesting server randomly choose one of the 24 queries to do a knn search (query log is in manticore/write-only/issue-cpu-load-bug/test_query.log)

test result before tricks played (about 280 qps):

root@debian12:~# (while true; do shuf -n 1 test_query.log; done) | pv -l | parallel --round-robin -j12 --progress "mysql -P9306 -h 192.168.1.31 -e {} > /dev/null"

local:4/9123/100%/0.0s 9.13k 0:00:31 [ 280 /s]

cpu usage:

%Cpu(s): 60.2 us,  2.8 sy,  0.0 ni, 35.6 id,  0.0 wa,  0.0 hi,  1.4 si,  0.0 st

now,we are going to do this trick, run sql command:

ALTER TABLE users optimize_cutoff='16';

waiting optimization finished, then run command:

ALTER TABLE users optimize_cutoff='8';

to change it back to default disk chunk count,then we query test again,and get better performance(about 317 qps) with less cpu load:

root@debian12:~# (while true; do shuf -n 1 test_query.log; done) | pv -l | parallel --round-robin -j12 --progress "mysql -P9306 -h 192.168.1.31 -e {} > /dev/null"

local:5/1531/100%/0.0s 1.54k 0:00:05 [ 317 /s]

cpu load:

%Cpu(s): 42.7 us,  3.0 sy,  0.0 ni, 53.2 id,  0.0 wa,  0.0 hi,  1.1 si,  0.0 st

this test result is reproducable,I can provide a video later, and it only reflect performance of non-persistent connection client, in persistent connection, the performance difference is even bigger ,nearly 90% boost in persistent connection

Please check the actual number of disk chunks using SHOW TABLE users STATUS. I don’t fully understand why ALTER TABLE users optimize_cutoff='16' would make any changes in your case. If you already had 8 disk chunks, it shouldn’t create 16 of them — it should do nothing unless it was merging from > 16 chunks to 16.

What might be happening is that during your first test, there was background merging in progress, and the disk chunk count was higher than 8 (and possibly even higher than 16), which could have reduced the QPS. After waiting for the optimization to finish and running the test again, the QPS might have improved because there was no merging load and the disk chunk count was lower.

To ensure accurate results in your tests, please pay attention to these two things:

  • Check if merging is running (look for SYSTEM OPTIMIZE in SHOW THREADS OPTION FORMAT=ALL).
  • Monitor the disk chunk count (SHOW TABLE users STATUS LIKE 'disk_chunks').

That’s interesting. Usually, a persistent connection improves response time only slightly. If you can reproduce this with a simple example, I’d be very interested in looking into it further.

I’v done quite a few tests, and payed attention to table status, before I played that trick, the table had 8 disk chunks, I could confirm it in “show table users status”, and optimization was 0, no background merging was happenning.

I will add a video later

1 Like

now,I reproduced it with an easy example, add reproduce_detail.avi video in manticore/write-only/issue-cpu-load-bug/reproduce_detail.avi

please check that video, I think this bug can be tracked now

Hi @zeven,

Thank you for sharing the recording. It’s really impressive, and it does seem like something unusual is happening.

The most surprising part was when you ran alter table ... optimize_cutoff='16'. It seemed to trigger some kind of merging process since it was using around one CPU core. But why would it do that if there were already 8 chunks? Could you check the searchd log to see what’s happening while alter table ... optimize_cutoff='16' is being processed?

Also, I didn’t quite understand how this connects to persistent connections. You mentioned it in the beginning of the recording, but didn’t provide an example. Or were you referring to mysqlslap using persistent connections and how that impacts performance compared to parallel + mysql, which I used?

Yes, something was running in the background after running that command, but I am not sure whether it was merging. coz, this background process ran just for about one minute, then stoped,
After that, still 8 chucks, I can confirm it when looked into the actual disk chunk files in “/var/lib/manticore/…”, file count and and files size were not changed.

As I said in previous post, it doesn’t mean only persistent connection benefits from the trick (bug), my test shows parallel + mysql also gets better performance,and server cpu usage drops

I use mysqlslap for the showcase only because the client process will not reconnect to server after each execution of sql line,thus,won‘t waste resource in socket IO,and make the difference in test results more obvious

@zeven can you please share the searchd log to check yourself what’s going on there at the moment you do alter table ... optimize_cutoff='16'?

here it is:

[Tue Dec 10 12:44:25.952 2024] [468] prereading 2 tables
[Tue Dec 10 12:44:25.975 2024] [468] preread 2 tables in 0.023 sec
[Tue Dec 10 12:44:26.017 2024] [466] accepting connections
[Tue Dec 10 12:44:26.160 2024] [470] [BUDDY] started v3.0.1 '/usr/share/manticore/modules/manticore-buddy/bin/manticore-buddy --listen=http://127.0.0.1:9305 --bind=127.0.0.1 --disable-telemetry --threads=4' at http://127.0.0.1:39479
[Tue Dec 10 12:44:26.160 2024] [470] [BUDDY] Loaded plugins:
[Tue Dec 10 12:44:26.160 2024] [470] [BUDDY]   core: empty-string, backup, emulate-elastic, fuzzy, create, drop, insert, alias, select, show, cli-table, plugin, test, alter-column, alter-distributed-table, alter-rename-table, modify-table, knn, replace, queue, sharding, update, autocomplete
[Tue Dec 10 12:44:26.160 2024] [470] [BUDDY]   local: 
[Tue Dec 10 12:44:26.160 2024] [470] [BUDDY]   extra: 
[Tue Dec 10 13:21:53.311 2024] [469] rt: table users: diskchunk 94(9), segments 26 forcibly saved in 17.189663 (17.191345) sec, RAM saved/new 112844514/0 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
[Tue Dec 10 13:23:55.968 2024] [469] rt: table users: merged chunks /var/lib/manticore/users/users.74 and /var/lib/manticore/users/users.94 to /var/lib/manticore/users/users.95 in 2m (progressive mode). Remaining chunk count: 8
[Tue Dec 10 13:23:57.671 2024] [469] rt: table users: optimized progressive chunk(s) 9 ( left 8 ) in 2m 4.4s

but strangely,before I ran that alter table ... optimize_cutoff='16' , I had 8 disk chunks,
when this command was executed, a new disk chunk (94) appears, then been merged with the first disk chunk (74) to a new disk chunk (95)

before ran command:

"chunk_names":[74,77,80,83,86,89,92,93],

after ran command:

"chunk_names":[77,80,83,86,89,92,93,95],

but total file size stayed the same

@zeven I’ve reproduced your test results locally. The reason it’s faster after applying the trick is clear: when you run alter table ... optimize_cutoff='16', it flushes the RAM chunk. You can see this in the log as well:

[Tue Dec 10 13:21:53.311 2024] [469] rt: table users: diskchunk 94(9), segments 26 forcibly saved in 17.189663 (17.191345) sec, RAM saved/new 112844514/0 ratio 0.950000 (soft limit 127506841, conf limit 134217728)

This flushing, not chunk merging, is what makes the difference. For example:

snikolaev@dev2:~/issue-cpu-load-bug$ mysql -P29315 -h0 -e "show table users status like '%chunk%'"
+--------------------------+-----------+
| Variable_name            | Value     |
+--------------------------+-----------+
| ram_chunk                | 112856370 |
| ram_chunk_segments_count | 26        |
| disk_chunks              | 8         |
+--------------------------+-----------+

snikolaev@dev2:~/issue-cpu-load-bug$ mysqlslap -h0 -P29315 -c 6 --create-schema=users --query="test_query.log" --number-of-queries=1000 -i 3
Benchmark
	Average number of seconds to run all queries: 2.144 seconds
	Minimum number of seconds to run all queries: 2.130 seconds
	Maximum number of seconds to run all queries: 2.157 seconds
	Number of clients running queries: 6
	Average number of queries per client: 166

snikolaev@dev2:~/issue-cpu-load-bug$ mysql -P29315 -h0 -e "flush ramchunk users"

snikolaev@dev2:~/issue-cpu-load-bug$ mysql -P29315 -h0 -e "show table users status like '%chunk%'"
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| ram_chunk                | 0     |
| ram_chunk_segments_count | 0     |
| disk_chunks              | 9     |
+--------------------------+-------+

snikolaev@dev2:~/issue-cpu-load-bug$ mysqlslap -h0 -P29315 -c 6 --create-schema=users --query="test_query.log" --number-of-queries=1000 -i 3
Benchmark
	Average number of seconds to run all queries: 1.551 seconds
	Minimum number of seconds to run all queries: 1.544 seconds
	Maximum number of seconds to run all queries: 1.558 seconds
	Number of clients running queries: 6
	Average number of queries per client: 166

As you can see, after flushing, the RAM chunk becomes empty, the disk chunk count increases by one, and the average response time improves significantly. This happens because the RAM chunk can sometimes underperform. There’s a related issue on GitHub: how to search fast with ramchunk(rt table) in High-Frequency Write Systems · Issue #2787 · manticoresoftware/manticoresearch · GitHub. However, since that issue has its own specifics, feel free to open a new one. This might motivate the core team to address it sooner.

thank you for the explanation,however as you said

It’s only temporary.,right after flushing, a merging happend

so, after running alter table ... optimize_cutoff='16' , disk chunks was still 8, but ram chunk remains empty even some new records being inserted?

so , as of now, how could I get off this ram chunk low performance issue? Any way to disable using ram chunk ?

It’s only temporary.,right after flushing, a merging happend

Of course, if your chunk count is exceeded, auto-merging starts.

so, after running alter table … optimize_cutoff=‘16’ , disk chunks was still 8, but ram chunk remains empty even some new records being inserted?

No, if some records were inserted it shouldn’t have been empty.

so , as of now, how could I get off this ram chunk low performance issue? Any way to disable using ram chunk ?

No way to disable it, but you can experiment with its size (rt_mem_limit), the performance degradation it causes and the merging overhead. Increasing optimize_cutoff may also make sense to lower the merging overhead.