Manticore or ElasticSearch >> Decision info needed please?

p8092040 · April 29, 2020, 8:13am

Hello,

I am architecting/designing a new ecommerce project that will be a multi-tenant marketplace (multiple merchants and product catalogs). I am making a decision between Manticore and Elasticsearch (on my own cloud servers). I would very much appreciate your responses and experiences – with one or both of these search engines. Please answer one or more of my questions below. Thanks.

Project info: Initially (Initial_Deployment), there will be 20 web stores, each having between 200 to 20,000 products with up to 2,000 characters of text describing the product in fields such as MerchantCode, ProductCode, ProductName, ProductCategories (list), short description, long description, price and fields for association with facets. Facets will be used heavily in the product search pages. For the Initial_Deployment we expect 40 MB of text representing the initial products. Eventually, we hope this marketplace will grow to hundreds and thousands of merchants. For the sake of this post, lets say the Eventual_Deployment will have 40 GB of text that must be indexed.

#1. Intel CPU Resources on typical cloud based Virtual Private Server (“VPS”) – Given the Initial_Deployment and the Eventual_Deployment scenarios, (a) how many Initial CPU(s); and (b) Eventual CPU(s) will be required for sub 0.5 second responses for product searches using multiple keywords and 5 facets?

#2. Number of Servers (VPS(s)) Required – I am assuming ONE server that’s capable of having up to 16 CPUs can handle the Initial and Eventual workloads, yes?

#3. RAM for the VPS server – What is the (a) minimum recommended RAM? and the (b) ideal RAM required for each of the Initial_Deployment and the Eventual_Deployment scenarios? If more than one server is required, please adjust your answer accordingly.

#4. SSD Disk Space – Given the
(a) Initial_Deployment of 40 MB text to index; and (b) Eventual_Deployment of 40 GB of text to index, how much disk space will be needed for the Search Engine?

#5. Recommended Server Configurations — I favour Ubuntu 18.x operating systems. Given the above info and questions what server specs are recommended for:
(a) Initial_Deployment. CPUs: __, RAM: __, SSD Disk Space: __ GB
(b) Eventual_Deployment. CPUs: __, RAM: __, SSD Disk Space: __ GB

#6. I believe that Manticore and Elasticsearch are both capable of quickly updating indexes as new or modified text is available for indexing. For instance, as merchants update product info, prices, descriptions… the search engine will need to immediately index the new/changed text such that customer searches give results with the new text.
(a) Can such quick (partial?) indexing be done efficiently (without rebuilding the entire index)?
(b) How quick are such index updates?
© How much load is put on the server?

#7. One of my main concerns about Manticore, when compared to Elasticsearch, is that Manticore index file sizes seem to be at least as big as the text that is being indexed. My experience with other databases typically has index sizes at 50% or less of the data size. I read that Elasticsearch index sizes are about 30% to 40% of the text being indexed.
(a) Please confirm index file sizes for optimal search performance?
(b) How much of the index(s) must be in RAM, if SSD drives are used?
© What kind of tuning is available? and what impact on performance?

#8. Are there any (a) must-have features; and (b) should-have features missing from Manticore for an ecommerce marketplace – that you are aware of ?

#9. It seems that there are a lot of resources (books, consultants, programmers, example code… ) available for Elasticsearch. My concern is about my learning curve to design and implement the search engine myself; and my ability to find subcontractors if/when I need help.
(a) Are there sufficient learning resources available for Manticore for (i) entry level programmer?; (ii) an experienced systems analyst and programmer that competent in C# ?: (iii) for a competent PHP programmer?
(b) Are there available programmers that are competent with Manticore?

#10. I saw a C# interface from a 3rd Party for Manticore. Does anybody have any experience with it? Is it a complete implementation? Recommendations?

#11. What are the scenarios where Manticore is NOT the best choice, where Elasticsearch might be a better choice?

Thank you for your responses to one or more of my questions.

Regards, Peter

Sergey · April 29, 2020, 11:38am

@p8092040 Hi

Disclaimer: I’m a Manticore core team’s member.

Please find my comments below.

#1. Intel CPU Resources on typical cloud based Virtual Private Server (“VPS”) – Given the Initial_Deployment and the Eventual_Deployment scenarios, (a) how many Initial CPU(s); and (b) Eventual CPU(s) will be required for sub 0.5 second responses for product searches using multiple keywords and 5 facets?

I’ve made a test on an index whose raw data is about 40G (37G in my case). Some info about it:

Schema:

  rt_attr_uint = int1
  rt_attr_uint = int2:8
  rt_attr_bigint = bigint1
  rt_attr_uint = int3:8
  rt_attr_uint = int4:8
  rt_attr_uint = int5:31
  rt_attr_uint = int6:1
  rt_attr_uint = int7
  rt_attr_uint = int8
  rt_attr_uint = int9:3
  rt_attr_uint = int10:1
  rt_attr_uint = int11:22
  rt_attr_uint = int12:24
  rt_attr_uint = int13:3
  rt_attr_uint = int14:1
  rt_attr_timestamp = timestamp1
  rt_attr_timestamp = timestamp2
  rt_attr_timestamp = timestamp3
  rt_attr_multi = multi1
  rt_field = text1
  rt_field = text2
  rt_field = text3
  rt_field = text4
  rt_field = text5
  rt_field = text6
  rt_field = text7

sizes without docstore (37G of raw becomes 7.26G of indexed data, 2.2G are to be stored in RAM):

root@perf /perf/test_brse02 # ls -lah 3m.xml
-rw-r--r-- 1 snikolaev snikolaev 37G Feb 25 20:25 3m.xml

root@perf /perf/test_brse02 # ls -lah v3/3m.*
-rw-r--r-- 1 root root 1.6G Feb 26 16:28 v3/3m.spa
-rw-r--r-- 1 root root 146M Feb 26 16:28 v3/3m.spb
-rw-r--r-- 1 root root 4.0G Feb 26 16:33 v3/3m.spd
-rw-r--r-- 1 root root 113M Feb 26 16:33 v3/3m.spe
-rw-r--r-- 1 root root  46K Feb 26 16:33 v3/3m.sph
-rw-r--r-- 1 root root  48K Feb 26 16:28 v3/3m.sphi
-rw-r--r-- 1 root root 381M Feb 26 16:33 v3/3m.spi
-rw-r--r-- 1 root root 2.1M Feb 26 16:28 v3/3m.spk
-rw-r--r-- 1 root root 3.1M Feb 26 16:28 v3/3m.spm
-rw-r--r-- 1 root root 983M Feb 26 16:33 v3/3m.spp
-rw-r--r-- 1 root root 151M Feb 26 16:28 v3/3m.spt

Then I’ve tested performance of a query like this against it:

mix 1000x select * from idx where match('{d(100,200)} ({d(1000,2000)} | {d(1000,2000)}) -{d(100,200)} -{d(100,200)}') facet author_id facet table_group facet country_id facet isthread facet mod_is

{d(x,y)} means the keyword is from interval x to y of top frequent keywords. Few examples:

select * from idx where match('sure (screen | down.) -xfr482240612 -them') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('only (banned | before.) -made -said') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('better (drink | perfect) -see -still') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('us (write | wow) -very -xst814616') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('5a445d710ae24cd276062b0c84850838 (xfr480275246 | reasons) -even -it.') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('never (pull | же) -4 -60d2f4fe0275d790764f40abc6734499') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('xst814616 (except | songs) -take -xst670102704') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('o (mods | hasn) -were -being') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('4 (hotels | opinions) -could -60d2f4fe0275d790764f40abc6734499') facet author_id facet table_group facet country_id facet isthread facet mod_is
select * from idx where match('did (reality | 00) -even -back') facet author_id facet table_group facet country_id facet isthread facet mod_is

I’ve tested with concurrency 1 and 8. The results are:

Concurrency 1:

root@perf /perf/test_brse02 # ../tools/stress_tester/test.php -c=1 --plugin=plugin_test.php --data=/tmp/selects.sql
Time elapsed: 0 sec, throughput (curr / from start): 0 / 0 rps, 0 children running, 1000 elements left
Time elapsed: 1.001 sec, throughput (curr / from start): 10 / 10 rps, 1 children running, 988 elements left
Time elapsed: 2.002 sec, throughput (curr / from start): 12 / 11 rps, 1 children running, 975 elements left
...
Time elapsed: 72.045 sec, throughput (curr / from start): 11 / 13 rps, 1 children running, 30 elements left
Time elapsed: 73.045 sec, throughput (curr / from start): 13 / 13 rps, 1 children running, 16 elements left
Time elapsed: 74.046 sec, throughput (curr / from start): 14 / 13 rps, 1 children running, 1 elements left

FINISHED. Total time: 74.151 sec, throughput: 13 rps
Latency stats:
	count: 1000 latencies analyzed
	avg: 73.635 ms
	median: 72.908 ms
	95p: 106.697 ms
	99p: 165.452 ms

Concurrency 8:

root@perf /perf/test_brse02 # ../tools/stress_tester/test.php -c=8 --plugin=plugin_test.php --data=/tmp/selects.sql
Time elapsed: 0 sec, throughput (curr / from start): 0 / 0 rps, 0 children running, 1000 elements left
Time elapsed: 1.001 sec, throughput (curr / from start): 64 / 64 rps, 8 children running, 926 elements left
Time elapsed: 2.001 sec, throughput (curr / from start): 82 / 73 rps, 8 children running, 844 elements left
Time elapsed: 3.002 sec, throughput (curr / from start): 88 / 78 rps, 8 children running, 755 elements left
Time elapsed: 4.003 sec, throughput (curr / from start): 79 / 79 rps, 8 children running, 675 elements left
Time elapsed: 5.012 sec, throughput (curr / from start): 82 / 79 rps, 8 children running, 592 elements left
Time elapsed: 6.012 sec, throughput (curr / from start): 85 / 80 rps, 8 children running, 505 elements left
Time elapsed: 7.013 sec, throughput (curr / from start): 94 / 82 rps, 8 children running, 411 elements left
Time elapsed: 8.013 sec, throughput (curr / from start): 86 / 83 rps, 8 children running, 324 elements left
Time elapsed: 9.014 sec, throughput (curr / from start): 80 / 83 rps, 8 children running, 243 elements left
Time elapsed: 10.014 sec, throughput (curr / from start): 84 / 83 rps, 8 children running, 158 elements left
Time elapsed: 11.015 sec, throughput (curr / from start): 89 / 83 rps, 8 children running, 68 elements left

FINISHED. Total time: 11.887 sec, throughput: 84 rps
Latency stats:
	count: 1000 latencies analyzed
	avg: 93.085 ms
	median: 91.15 ms
	95p: 142.235 ms
	99p: 232.58 ms

Plugin's output:
	Count: 1000
	Results: 54048

Overall 54048 of rows were fetched during the test, i.e. the queries found something, about 54 rows per query.

So to answer your question about the number of CPUs needed for sub 0.5 second latency - it looks like one CPU should be enough, but it of course depends on:

your schema
your queries complexity
hardware (including RAM)
concurrency

#2. Number of Servers (VPS(s)) Required – I am assuming ONE server that’s capable of having up to 16 CPUs can handle the Initial and Eventual workloads, yes?

Yes. 500ms is actually a lot and unless you need HA 1 server having 16CPUs shoud be enough for an index made of 40G of raw data.

#3. RAM for the VPS server – What is the (a) minimum recommended RAM? and the (b) ideal RAM required for each of the Initial_Deployment and the Eventual_Deployment scenarios? If more than one server is required, please adjust your answer accordingly.

It depends on your schema and requires calculation. In my case 2.2G is required, but I have quite many attributes as you can see above.

#4. SSD Disk Space – Given the
(a) Initial_Deployment of 40 MB text to index; and (b) Eventual_Deployment of 40 GB of text to index, how much disk space will be needed for the Search Engine?

In my case for 37G of raw data the index takes 7.26G, but I would reserver 2x, i.e. 15G more for temporary data. This does not include storing original values of the full-text fields.

I’ve made another test on a smaller dataset to see how much it would take WITH the docstore on the same schema: 4.3G of raw data requires 1.48G of disk and 442M of RAM:

root@bench /bench/perf/test_indexation # ls -lah data/3M.xml
-rw-r--r-- 1 root root 4.3G Apr 13 06:20 data/3M.xml

root@bench /bench/perf/test_indexation # ls -lt data/idx_plain.sp*
-rw-r--r-- 1 root root  11435388 Apr 29 11:42 data/idx_plain.spe
-rw-r--r-- 1 root root 443584382 Apr 29 11:42 data/idx_plain.spd
-rw-r--r-- 1 root root     49347 Apr 29 11:42 data/idx_plain.sph
-rw-r--r-- 1 root root 231663452 Apr 29 11:42 data/idx_plain.spi
-rw-r--r-- 1 root root 117301629 Apr 29 11:42 data/idx_plain.spp
-rw-r--r-- 1 root root     48288 Apr 29 11:40 data/idx_plain.sphi
-rw-r--r-- 1 root root  18670784 Apr 29 11:40 data/idx_plain.spt
-rw-r--r-- 1 root root 551928776 Apr 29 11:40 data/idx_plain.spds
-rw-r--r-- 1 root root    375000 Apr 29 11:40 data/idx_plain.spm
-rw-r--r-- 1 root root  18131080 Apr 29 11:40 data/idx_plain.spb
-rw-r--r-- 1 root root 195000192 Apr 29 11:40 data/idx_plain.spa

#5. Recommended Server Configurations — I favour Ubuntu 18.x operating systems.

Manticore works fine in Ubuntu 18. We run our regular performance tests in this OS.

Given the above info and questions what server specs are recommended for:
(a) Initial_Deployment. CPUs: __, RAM: __, SSD Disk Space: __ GB

I would say: 1 CPU and a little bit of RAM and SSD. 40MB of data is too little, probably 2GB of RAM should be more than enough for the index and the OS.

(b) Eventual_Deployment. CPUs: __, RAM: __, SSD Disk Space: __ GB

It depends on the concurrency, but 1-2 CPUs and 8GB of RAM should be a realistic recommendation for 40GB of raw data which should give some reserve for further growth. It’s better to make a better estimate though based on your schema/data.

#6. I believe that Manticore and Elasticsearch are both capable of quickly updating indexes as new or modified text is available for indexing. For instance, as merchants update product info, prices, descriptions… the search engine will need to immediately index the new/changed text such that customer searches give results with the new text.
(a) Can such quick (partial?) indexing be done efficiently (without rebuilding the entire index)?

Yes, Manticore supports real-time indexes. Here is a course about that Manticore Introduction in RealTime tables

(b) How quick are such index updates?

As you perhaps know in Lucene-based systems (inc. Elasticsearch) there’s “refresh interval” which is not zero by default and updates may be not visible for “SELECTs” during some time. In Manticore it’s different: it’s real real-time, all updates are available as soon as they are made. A single document insert/update is very quick.

© How much load is put on the server?

What puts load on the server is segments merging (both in Elasticsearch and Manticore). The load heavily depends on the amount of inserts/updates.

#7. One of my main concerns about Manticore, when compared to Elasticsearch, is that Manticore index file sizes seem to be at least as big as the text that is being indexed.

It’s not true. Data is stored compressed mostly. See some stats from my test above.

My experience with other databases typically has index sizes at 50% or less of the data size. I read that Elasticsearch index sizes are about 30% to 40% of the text being indexed.
(a) Please confirm index file sizes for optimal search performance?

See some stats from my test above.

(b) How much of the index(s) must be in RAM, if SSD drives are used?

All fields that you want to be able to filter by / sort by should be in RAM. In your case it seems to be MerchantCode, ProductCode, ProductName, ProductCategories (list), price and fields for association with facets. Full-text fields may be on disk and may be “stored” (which doesn’t require disk).

© What kind of tuning is available? and what impact on performance?

You can store everything in RAM if you like - Manticore Search Manual: Creating a table > Local tables
You can use only N bits for some fields which may lower the RAM consumption significantly
For lower latency (in case of idling CPUs) you can split your index into multiple ones and then use a distributed index to search among them in parallel (what is called “sharding” in Elasticsearch)

#8. Are there any (a) must-have features; and (b) should-have features missing from Manticore for an ecommerce marketplace – that you are aware of ?

None we are aware of.

#9. It seems that there are a lot of resources (books, consultants, programmers, example code… ) available for Elasticsearch. My concern is about my learning curve to design and implement the search engine myself; and my ability to find subcontractors if/when I need help.
(a) Are there sufficient learning resources available for Manticore for (i) entry level programmer?;

We have a unique platform for learning - interactive courses https://play.manticoresearch.com/ which covers many things. Elasticsearch don’t have it. You can also refer to resources about Sphinx (books, Q&A on stackoverflow etc.) as Manticore still has lots in common with Sphinx.

It’s also worth to mention that Manticore Search supports JSON interface as well as Elasticsearch (actually it’s quite similar to Elastic’s), but Manticore’s native language is SQL which is much easier for getting started and designing your queries.

(ii) an experienced systems analyst and programmer that competent in C# ?: (iii) for a competent PHP programmer?
(b) Are there available programmers that are competent with Manticore?

Besides ourselves yes, there are some, but of course you can find way more experts in Elasticsearch.

#10. I saw a C# interface from a 3rd Party for Manticore. Does anybody have any experience with it? Is it a complete implementation? Recommendations?

You probably mean https://www.sphinxconnector.net/. I don’t have experience with it, but most likely it’s not a complete implementation as we are constantly adding new functionality.
We provide SQL (via mysql protocol) and JSON-over-HTTP protocols and official clients for few programming languages, but C# is not in the list yet.

#11. What are the scenarios where Manticore is NOT the best choice, where Elasticsearch might be a better choice?

It’s hard for me to answer this since I’m biased.

If you decide to give Manticore a try drop us an email to contact@manticoresearch.com and we’ll be glad to have a call with you to answer your other questions and think if we can be of any help with your project.