new to Manticore, some questions about distributed table

Hi, just recently came across this amazing DB, which has a lot of awesome features that make me want to try it out in my next projects, but I have some questions

distributed table look cool, but parallelly requesting sth like ”LIMIT n, m“ to shards will make 1000 items being fetched and recalculated,it seems, distributed table is quite heavy if not in a necessary scenario, so my question is:

1、how big size of data can a non-distributed columnar table hold before it goes into a slow response time and high latency ? say, for a billion rows of data, do I have to make a distributed table for them?

2、distributed shards has such ”LIMIT n, m“ drawback, so, in deep paging scenario, will it be a bad choice ?

3、when I start some thing small,a non-distributed table is good,but when data grows, can I alter this non-distributed table to a distributed one with some few more shards. and when data keeps growing on, could I dynamicly add more shards to it? or do I have to re-define a new table and port those data to it?

4、for distributed table, there are two terms, shards and mirrors, in my understanding, different shards hold different part of a logic table, and they do not contain duplicate data of each other; but mirrors are just duplicates of certain shards for data redundancy. am I right?

5、for shards and mirrors, how can I differentiate them? only by table name? say, in a distributed table, child tables “table1” “table2” are shards of parent table, but remote agent also has a child table “table1”, is a mirror of table1?

thx so much for these answers

Hi

1、how big size of data can a non-distributed columnar table hold before it goes into a slow response time and high latency ? say, for a billion rows of data, do I have to make a distributed table for them?

This benchmark includes 1.7 billion documents across 32 shards all on the same physical server, but theoretically, performance without physical shards should be similar since Manticore implements pseudo sharding to utilize all CPU resources. Review the results and consider their relevance to your scenario. If you have more data or require better performance, you might need to add more nodes and use a distributed table with remote agents to search across all nodes.

2、distributed shards has such ”LIMIT n, m“ drawback, so, in deep paging scenario, will it be a bad choice ?

You can manage this by increasing the max_matches setting, which determines how many documents a distributed table will receive from a remote node. More details are available at Manticore Search Manual.

3、when I start some thing small,a non-distributed table is good,but when data grows, can I alter this non-distributed table to a distributed one with some few more shards. and when data keeps growing on, could I dynamicly add more shards to it? or do I have to re-define a new table and port those data to it?

Currently, this functionality is not available, but autosharding is under development:

5、for shards and mirrors, how can I differentiate them? only by table name? say, in a distributed table, child tables “table1” “table2” are shards of parent table, but remote agent also has a child table “table1”, is a mirror of table1?

No, a table doesn’t become a mirror of another table automatically. You have to specify all in a distributed table. You can also use different table names when defining a mirror, e.g.

mysql> drop table if exists table1; drop table if exists table2; drop table if exists table3; create table table1; create table table2; create table table3; drop table if exists dist; create table dist type='distributed' local='table1' agent='127.0.0.1:9312:table2|127.0.0.1:9312:table3'; desc dist;
--------------
drop table if exists table1
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
drop table if exists table2
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
drop table if exists table3
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table table1
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table table2
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table table3
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
drop table if exists dist
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table dist type='distributed' local='table1' agent='127.0.0.1:9312:table2|127.0.0.1:9312:table3'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
desc dist
--------------

+-----------------------+-------------------+
| Agent                 | Type              |
+-----------------------+-------------------+
| table1                | local             |
| 127.0.0.1:9312:table2 | remote_1_mirror_1 |
| 127.0.0.1:9312:table3 | remote_1_mirror_2 |
+-----------------------+-------------------+
3 rows in set (0.00 sec)
create table dist type='distributed' local='table1' agent='127.0.0.1:9312:table2|127.0.0.1:9312:table3'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
desc dist
--------------

+-----------------------+-------------------+
| Agent                 | Type              |
+-----------------------+-------------------+
| table1                | local             |
| 127.0.0.1:9312:table2 | remote_1_mirror_1 |
| 127.0.0.1:9312:table3 | remote_1_mirror_2 |
+-----------------------+-------------------+
3 rows in set (0.00 sec)

ok, I get this, so mirrors can only be defined in “agent=” statement? is there a possible type like “local_1_mirror_2” for local copies of shards like in this case, table2 and table3 are actually in the same instance of server

I found it possible to include a child table to different distributed table, is this valid?

 create table dist type='distributed' agent='127.0.0.1:9312:table1|127.0.0.1:9312:table2' agent='127.0.0.1:9312:table3|127.0.0.1:9312:table4';

create table dist2 type='distributed' agent='127.0.0.1:9312:table1|127.0.0.1:9312:table2|127.0.0.1:9312:table3';

is there a possible type like “local_1_mirror_2” for local copies of shards like in this case, table2 and table3 are actually in the same instance of server

If you mean smth like agent = local_table|remote_host:port:remote_table, no, it’s not possible since it makes little sense:

  • if local_table exists: it makes sense to use it
  • if it for some reason disappears, most likely the distributed table will be inaccessible too

I found it possible to include a child table to different distributed table, is this valid?

Yes, you can have multiple distributed tables. You can also combine local = and agent = in the same table.

thank you for these answers!

and there is another one that I’m a little confused according to Manuel :

” All transactions such as INSERT , REPLACE , DELETE , TRUNCATE on any percolate or real-time table that belongs to a cluster are replicated to all the other nodes in that cluster. Replication is multi-master, so writes to any node or multiple nodes simultaneously will work just as well“

so I think cluster writes support transactions。

but for distributed table, Manuel says :

“Manticore supports basic transactions for deleting and inserting data into real-time and percolate tables, except when attempting to write to a distributed table which includes a real-time or percolate table.”

so, here it say distributed table does not have write transaction?

why cluster supports it but distributed table does‘nt ?

why cluster supports it but distributed table does‘nt ?

Because replication and distributed tables are two absolutely different things. It’s like comparing apples and oranges. What’s meant in the documentation is that:

  1. You do START TRANSACTION; ... ; COMMIT and once you commit - the whole transaction is propagated to other nodes of the replication cluster. Until then it doesn’t.
  2. Distributed tables don’t support transactions. Distributed tables also don’t support writes (related task is auto-sharding: distributed writes · Issue #1087 · manticoresoftware/manticoresearch · GitHub). You can only DELETE or UPDATE through them.

thank you, I get this.

still questions about it:

1、what if I need handling big data as well as requiring transaction feature ? distributed table is not a choice, since it does not support transaction.

2、For transaction feature, Putting a billion rows in a non-distributed table, is it practical in this way? if pseudo sharding is enabled, will this non-distributed table support transaction?

what if I need handling big data as well as requiring transaction feature ? distributed table is not a choice, since it does not support transaction.

Do you mean distributed tables with remote agents, right? Distributed transactions are not a common feature of databases, especially search engines. If it’s crucial, you should look for databases supporting ‘distributed transactions’. For example, MySQL provides documentation on this topic: https://dev.mysql.com/doc/refman/8.0/en/xa.html.

2、For transaction feature, Putting a billion rows in a non-distributed table, is it practical in this way?

If there’s enough hardware in terms of performance and resource consumption (both depend on your data) - yes, why not.

if pseudo sharding is enabled, will this non-distributed table support transaction?

Yes, it will, transactions don’t care about the number of rows.