Advice on scaling Manticore Search for ~500M documents and ~100TB dataset

truezjz · March 5, 2026, 6:26am

Hello Manticore community,

I’m currently evaluating Manticore for a large-scale search platform and would appreciate advice from people who have experience running Manticore at very large scale.

Dataset Size

Projected dataset:

Documents: 400–500 million
Total indexed size: ~100 TB
Daily ingestion: ~2–5 million new documents
Documents contain:
- metadata fields
- text body fields
- timestamps
- user identifiers

Search is primarily full-text search combined with structured filters.

Typical queries include:

keyword search in subject/body
filtering by sender/recipient
date range filters
metadata filtering

Table Design (Testing)

We are currently testing columnar tables.

Example structure:

CREATE TABLE documents_xxxx
(
    id BIGINT,
    doc_uid STRING,
    subject STRING,
    body_text TEXT,
    attachment_text TEXT,
    sender STRING,
    recipients STRING,
    created_date TIMESTAMP,
    category STRING
)
ENGINE = columnar;

The goal is to balance:

storage efficiency
ingestion speed
full-text search performance

Sharding Strategy

Because of the dataset size, we are considering time-based partitioning.

For example:

documents_2023
documents_2024
documents_2025
documents_2026

And querying via a distributed table:

documents_all

Questions

1. Shard Size

For a dataset of roughly 500M documents / 100TB, what shard size is recommended?

Would yearly tables be appropriate, or would monthly tables be better for:

query performance
index maintenance
ingestion speed

2. Columnar vs RT Tables

For large-scale full-text search workloads, is columnar storage the recommended approach?

Or would RT tables provide better performance for mixed workloads (continuous ingestion + search)?

3. Query Optimization Across Shards

If using multiple time-based shards with a distributed table, will queries automatically avoid scanning irrelevant shards when filtering by date?

Example:

SELECT * FROM documents_all
WHERE created_date >= '2025-01-01'
AND MATCH('example search')
LIMIT 100

Will only relevant shards be searched?

4. Large Text Fields

Some documents contain large text bodies and attachment text.

Is it recommended to:

store the full text in the index
store only truncated text
or apply other optimization strategies?

5. Cluster Topology

For this scale, what cluster layout would typically work well?

Example hardware per node:

8–16 CPU cores
32–64 GB RAM
NVMe SSD

Any advice or examples of large-scale deployments would be greatly appreciated.

Thank you!

Sergey · March 6, 2026, 5:04am

Hello @truezjz

Documents: 400–500 million

Total indexed size: ~100 TB

~214 KB per document?

keyword search in subject/body

subject STRING won’t work. You need subject TEXT or `subject TEXT indexed` if you want to perform keyword search on it and also sort by it.

Because of the dataset size, we are considering time-based partitioning.
…
And querying via a distributed table:
documents_all

If your use case includes several date filtering modes (e.g., last week / month / year, etc.), then having multiple corresponding distributed tables might make sense so you can avoid querying older data entirely.

For a dataset of roughly 500M documents / 100TB, what shard size is recommended?

I would use GitHub - manticoresoftware/manticore-load: Manticore Load Emulator · GitHub to test that.

For large-scale full-text search workloads, is columnar storage the recommended approach?

Columnar storage is not related to full-text search. Full-text search works the same way. With your schema, only these fields would be stored in columnar format:

doc_uid STRING,
subject STRING,
sender STRING,
recipients STRING,
created_date TIMESTAMP,
category STRING

At first glance, since you will have a large amount of data, I would also try columnar storage to save RAM.

If using multiple time-based shards with a distributed table, will queries automatically avoid scanning irrelevant shards when filtering by date?

No. As I mentioned above, it may make sense to create multiple distributed tables if possible.

Is it recommended to: store the full text in the index

If you don’t need the original text, you can save a lot of space by not storing it in the index. For example:

body_text text indexed

instead of:

body_text text

For this scale, what cluster layout would typically work well?

I don’t like guessing. We have a great tool for experimenting with different loads — manticore-load. I highly recommend testing it on at least part of your data and then extrapolating.

truezjz · March 6, 2026, 5:33am

Thanks for the prompt response.

~214 KB per document?

A: yes, average size, we need store and index email message.

If your use case includes several date filtering modes (e.g., last week / month / year, etc.), then having multiple corresponding distributed tables might make sense so you can avoid querying older data entirely.

Q: If we just allow the date range search eg. 01/01/2010 - 01/01/2015, major are keywords search and filter on header information like sender and recipients, would one distribute table enough?

Q: columnar storage ( ENGINE = columnar; ), my understanding it will save space on the TEXT field like body_text and attachment_text, that take more space , is that correct?

truezjz · March 6, 2026, 5:40am

This specific query will still scan all tables link in document_all, correct?

Sergey · March 10, 2026, 10:19am

Q: If we just allow the date range search eg. 01/01/2010 - 01/01/2015, major are keywords search and filter on header information like sender and recipients, would one distribute table enough?

It may be enough, but if you’re looking for ultimate performance and you always filter by date, then skipping tables that will definitely return nothing using multiple distributed tables still makes sense.

Q: columnar storage ( ENGINE = columnar; ), my understanding it will save space on the TEXT field like body_text and attachment_text, that take more space , is that correct?

No, columnar storage has nothing in common with full-text fields.

This specific query will still scan all tables link in document_all, correct?

Yes