Hey everyone,
I have been working with Manticore Search for a while and wanted to discuss some best practices for optimizing full-text search performance. I am handling a dataset with millions of records…, and while the default settings work well;, I am looking for ways to improve query speed and efficiency.
Here are some areas I am focusing on :-
Index Optimization: Besides real-time indexes, would it be better to use plain indexes for heavy read workloads: ??
Query Performance: How effective are ranker options like proximity_bm25 vs expr for large datasets: ??
Sharding & Replication: At what point does horizontal scaling become necessary for high-concurrency workloads: ??
Best Storage Engine: Would using columnar over row-based indexing make a significant difference in certain scenarios: ??
If you have implemented any of these optimizations or have insights…, I would love to hear your experience !!
Let’s make Manticore even more efficient.
Looking forward to your thoughts !!
With Regards,
DerekFlutter
Index Optimization:
In general if you call ‘OPTIMISE’ on a RT index, it should offer comparible performance to a ‘plain’ index. Calling OPTIMISE will ‘merge’ disk chunks, so you effectively have large chunks comparable to build the index with ‘indexer’.
(on the flipside, get automatic ‘sharding’ for free, which can help with performance, with plain indexes, have to shard manually)
Query Performance:
Hard to answer in generic sense, as it will depend a lot on requirements. Eg with expr might be able to use a ‘simple’ expression, that is simpler (hence quicker) than the built in rankers.
Sharding & Replication:
again hard to answer in generic sense, as will vary a lot depending on many factors, a lot will depend on how ‘selective’ queries (ie, very specific queries that return small result set, vs queries that return lots fo results, will lots of data to sort) - some are better to ‘shard’, others may benefit from horizontal. Any experience I have on this is unlikly to directly transfer to a different workload.
… although would say sharding is particularly effective, if can pre filter - such that dont need to query all shards. (eg if split the index into shards, such that many queries only need one shard - even if some queries still need all shards)
Best Storage Engine:
Dont actully have any experice with columnar, but as I understand it only really helps (performance wise) in very specific circumstances, particlly ‘full table scans’ where the ‘column’ is part of filtering, ordering or grouping. (ie the initial part of the query can be done with just that one column. ) (although there are some more niche functions that rely on it, like KNN)
btw, one thing not mentioned, is benchmarking is typically one of the best ways to find good settings.
What I found really helps to create two key scripts: Firstly one that can create a ‘config file’ to create a sharded index, with arbitrary number of shards (and create the distributed index to query them!). Then can use indexer
to recreate the indexes on demand. Then can test many different scenarios quickly.
(even if you actually use RT indexes in ‘production’, indexer offers a very simple quick way to recreate indexes with various splits. You can create plain indexes whch mimic very closely a RT index. From a query perspective a auto-sharded RT index, behaves very similar to similarly manually sharded plain index - if just ignore the RAM chuck, which is often negiblble in grand scheme)
The second is one that can ‘replay’ a query log, and compare the results. ie rerun the same queries, and time them, and see a difference to the ‘baseline’. The original query log used could be one captured from production.
… the only tricky bit of this, is may want to mimic concurrent queries (ie don’t just want just to replay queries one by one). An ideal system would be multithreaded, and perhaps try to look at timestamps in the log, and mimic the real concurrency.
… to be honest, I just took a simpler approach, and ‘sharded’ the query log. ie a figure out a realistic concurrency level, eg 4, and then just run 4 processors in parallel, and consolidate the results.