Advice on scaling Manticore Search for ~500M documents and ~100TB dataset

Hello Manticore community,

I’m currently evaluating Manticore for a large-scale search platform and would appreciate advice from people who have experience running Manticore at very large scale.


Dataset Size

Projected dataset:

  • Documents: 400–500 million

  • Total indexed size: ~100 TB

  • Daily ingestion: ~2–5 million new documents

  • Documents contain:

    • metadata fields

    • text body fields

    • timestamps

    • user identifiers

Search is primarily full-text search combined with structured filters.

Typical queries include:

  • keyword search in subject/body

  • filtering by sender/recipient

  • date range filters

  • metadata filtering


Table Design (Testing)

We are currently testing columnar tables.

Example structure:

CREATE TABLE documents_xxxx
(
    id BIGINT,
    doc_uid STRING,
    subject STRING,
    body_text TEXT,
    attachment_text TEXT,
    sender STRING,
    recipients STRING,
    created_date TIMESTAMP,
    category STRING
)
ENGINE = columnar;

The goal is to balance:

  • storage efficiency

  • ingestion speed

  • full-text search performance


Sharding Strategy

Because of the dataset size, we are considering time-based partitioning.

For example:

documents_2023
documents_2024
documents_2025
documents_2026

And querying via a distributed table:

documents_all

Questions

1. Shard Size

For a dataset of roughly 500M documents / 100TB, what shard size is recommended?

Would yearly tables be appropriate, or would monthly tables be better for:

  • query performance

  • index maintenance

  • ingestion speed


2. Columnar vs RT Tables

For large-scale full-text search workloads, is columnar storage the recommended approach?

Or would RT tables provide better performance for mixed workloads (continuous ingestion + search)?


3. Query Optimization Across Shards

If using multiple time-based shards with a distributed table, will queries automatically avoid scanning irrelevant shards when filtering by date?

Example:

SELECT * FROM documents_all
WHERE created_date >= '2025-01-01'
AND MATCH('example search')
LIMIT 100

Will only relevant shards be searched?


4. Large Text Fields

Some documents contain large text bodies and attachment text.

Is it recommended to:

  • store the full text in the index

  • store only truncated text

  • or apply other optimization strategies?


5. Cluster Topology

For this scale, what cluster layout would typically work well?

Example hardware per node:

  • 8–16 CPU cores

  • 32–64 GB RAM

  • NVMe SSD


Any advice or examples of large-scale deployments would be greatly appreciated.

Thank you!

Hello @truezjz

Documents: 400–500 million

Total indexed size: ~100 TB

~214 KB per document?

keyword search in subject/body

subject STRING won’t work. You need subject TEXT or `subject TEXT indexed` if you want to perform keyword search on it and also sort by it.

Because of the dataset size, we are considering time-based partitioning.

And querying via a distributed table:

documents_all

If your use case includes several date filtering modes (e.g., last week / month / year, etc.), then having multiple corresponding distributed tables might make sense so you can avoid querying older data entirely.

For a dataset of roughly 500M documents / 100TB, what shard size is recommended?

I would use GitHub - manticoresoftware/manticore-load: Manticore Load Emulator · GitHub to test that.

For large-scale full-text search workloads, is columnar storage the recommended approach?

Columnar storage is not related to full-text search. Full-text search works the same way. With your schema, only these fields would be stored in columnar format:

doc_uid STRING,
subject STRING,
sender STRING,
recipients STRING,
created_date TIMESTAMP,
category STRING

At first glance, since you will have a large amount of data, I would also try columnar storage to save RAM.

If using multiple time-based shards with a distributed table, will queries automatically avoid scanning irrelevant shards when filtering by date?

No. As I mentioned above, it may make sense to create multiple distributed tables if possible.

Is it recommended to: store the full text in the index

If you don’t need the original text, you can save a lot of space by not storing it in the index. For example:

body_text text indexed

instead of:

body_text text

For this scale, what cluster layout would typically work well?

I don’t like guessing. We have a great tool for experimenting with different loads — manticore-load. I highly recommend testing it on at least part of your data and then extrapolating.

Thanks for the prompt response.

~214 KB per document?

A: yes, average size, we need store and index email message.

If your use case includes several date filtering modes (e.g., last week / month / year, etc.), then having multiple corresponding distributed tables might make sense so you can avoid querying older data entirely.

Q: If we just allow the date range search eg. 01/01/2010 - 01/01/2015, major are keywords search and filter on header information like sender and recipients, would one distribute table enough?

Q: columnar storage ( ENGINE = columnar; ), my understanding it will save space on the TEXT field like body_text and attachment_text, that take more space , is that correct?

This specific query will still scan all tables link in document_all, correct?

Q: If we just allow the date range search eg. 01/01/2010 - 01/01/2015, major are keywords search and filter on header information like sender and recipients, would one distribute table enough?

It may be enough, but if you’re looking for ultimate performance and you always filter by date, then skipping tables that will definitely return nothing using multiple distributed tables still makes sense.

Q: columnar storage ( ENGINE = columnar; ), my understanding it will save space on the TEXT field like body_text and attachment_text, that take more space , is that correct?

No, columnar storage has nothing in common with full-text fields.

This specific query will still scan all tables link in document_all, correct?

Yes