Large datasets setup

cracanut · November 13, 2023, 9:36am

Hello,

We currently have a distributed index with about 100 plain indexes spread out over 4 nodes.
The indexes are between 20GB and 90GB (all index files combined) each.

We have been having some issues lately with system memory and were wondering what the most efficient setup would be:

a small number of very large indexes on few large nodes
a large number of smaller indexes (20GB) on more ‘lighter’ nodes

What would be considered an optimal size for a plain index?

Thanks in advance

Sergey · November 13, 2023, 10:56am

In theory, fewer shards may be beneficial as the full-text dictionaries would be duplicated fewer times. In practice, RAM overconsumption is often caused by the attributes. It’s worth considering whether switching to columnar storage can help, or if reviewing and optimizing the schema is a better approach.

cracanut · November 20, 2023, 2:44pm

Thank you for the quick reply, I took a look at the columnar storage but it’s not completely clear what’s meant when saying ‘when the index is larger than RAM’: 1 index size or all indexes on the node combined size?

Also, what kind of performance decrease could we expect when switching to columnar storage?

Sergey · November 21, 2023, 9:23am

1 index size or all indexes on the node combined size?

.spa, .spb and .spi files of the tables that are used in searches.

Also, what kind of performance decrease could we expect when switching to columnar storage?

Please review the benchmarks on https://db-benchmarks.com/. For example: