Hello Manticore community,
I’m currently evaluating Manticore for a large-scale search platform and would appreciate advice from people who have experience running Manticore at very large scale.
Dataset Size
Projected dataset:
-
Documents: 400–500 million
-
Total indexed size: ~100 TB
-
Daily ingestion: ~2–5 million new documents
-
Documents contain:
-
metadata fields
-
text body fields
-
timestamps
-
user identifiers
-
Search is primarily full-text search combined with structured filters.
Typical queries include:
-
keyword search in subject/body
-
filtering by sender/recipient
-
date range filters
-
metadata filtering
Table Design (Testing)
We are currently testing columnar tables.
Example structure:
CREATE TABLE documents_xxxx
(
id BIGINT,
doc_uid STRING,
subject STRING,
body_text TEXT,
attachment_text TEXT,
sender STRING,
recipients STRING,
created_date TIMESTAMP,
category STRING
)
ENGINE = columnar;
The goal is to balance:
-
storage efficiency
-
ingestion speed
-
full-text search performance
Sharding Strategy
Because of the dataset size, we are considering time-based partitioning.
For example:
documents_2023
documents_2024
documents_2025
documents_2026
And querying via a distributed table:
documents_all
Questions
1. Shard Size
For a dataset of roughly 500M documents / 100TB, what shard size is recommended?
Would yearly tables be appropriate, or would monthly tables be better for:
-
query performance
-
index maintenance
-
ingestion speed
2. Columnar vs RT Tables
For large-scale full-text search workloads, is columnar storage the recommended approach?
Or would RT tables provide better performance for mixed workloads (continuous ingestion + search)?
3. Query Optimization Across Shards
If using multiple time-based shards with a distributed table, will queries automatically avoid scanning irrelevant shards when filtering by date?
Example:
SELECT * FROM documents_all
WHERE created_date >= '2025-01-01'
AND MATCH('example search')
LIMIT 100
Will only relevant shards be searched?
4. Large Text Fields
Some documents contain large text bodies and attachment text.
Is it recommended to:
-
store the full text in the index
-
store only truncated text
-
or apply other optimization strategies?
5. Cluster Topology
For this scale, what cluster layout would typically work well?
Example hardware per node:
-
8–16 CPU cores
-
32–64 GB RAM
-
NVMe SSD
Any advice or examples of large-scale deployments would be greatly appreciated.
Thank you!