Suggestions for an email archive

I develop an email archiving application, currently using sphinx with main - delta - delta-delta scheme, and I’m investigating the options to switching to manticore.

However, I’m confused what index options I should use: real-time or plain index type? RT mode or plain mode?

Note that some users want to delete aged emails (eg. after 5 or 7 years) from the archive (and from the index as well), while some users want to keep their archived emails forever.

If the scheme you use works fine for you, why don’t you keep using it? Plain indexes are supported only in plain mode (the only one which existed in open source versions of Sphinx), so you should be familiar with it and the migration should go smoothly.

Well, it works, sure. However, there are a few drawbacks. Several archives store millions of messages, and the main index grows to several 10 x GBs over the time which makes the delta merge more and more time consuming. Also when the aged emails are purged from the store the kill list also grows over time, though I assume both sphinx and manticore handle it efficiently. Deleting a main index, then reindexing is time consuming, so it’s a last resort. And finally the need from some users to see emails instantly in the gui, and they don’t want to wait until a cron job runs the delta indexer.

So that’s why I thought to give a shot to the rt index, but wanted to ask for a review on the idea, and frankly I was somewhat lost in the relation between index type (rt or plain) and index mode (rt or plain).

On thing to be aware, for the most part ‘RT’ indexes, are just ‘managed’ main+delta indexes under the hood. Although the main is often sharded. It still uses things like kill lists to manage deletions.

(although the is the ‘RAM chunk’ which is a small delta index maintained in memory, so individual inserts can be tacked on to the end quickly)

So in general switching to RT won’t give you much benefit. It’s the same stuff, just handled differently.

Although there is something to be said for making your ‘delta-delta’ a RT index. so you can quickly insert new records - as you say without waiting for the cronjob.
(that been said moving records from the RT delta a plain index is not easy (but there are some tricks) - so you might be best just converting to whole RT index to avoid the issue, let manticore manage the sharding. )

As for mode (above is just talking about index type) - the RT mode, only allows RT indexes, and they defined dynamically at runtime, not in the ‘.conf’ file. Whereas the ‘plain’ mode, supports both RT and Plain indexes, but indexes are defined in a ‘static’ config file.
(frankly, think the modes would be better called static and dynamic config, to not confuse with rt/plain indexes, but that a separate discussion!)

Thank you, Barry, it makes sense by now. I think I’ll give a try to the dynamic config, I like the idea of not having to restart the searchd daemon when I add a new customer with their own index.

However, one more question. Let’s say, I have 2 GB memory in the host, and I set rt_mem_limit = 256 MB per index (let’s neglect the memory requirements of the OS, etc for now). Does it mean that I may create only 8 customers on the host (ie. 8 * 256 MB = 2 GB)? Or is it just for the most chunk limit kept in memory? In other words: can I create even 50 index tables with rt_mem_limit = 256 MB if the host has only 2 GB physical memory?

That does mean that each individual index, would allow the RAM chunk to grow to 256Mb before ‘flushing’ the index to change a block to a ‘disk’ chunk. So yes, 8 indexes could grow to 2GB depending on the insert patterns.

Each index wont immediately take the full 250Mb, thats just the limit so in practice more than 8 would fit. But there is a risk of overflow, as each index grows.

(dont forget the disk-chunks still use memory too. rt_mem_limit is just the limit for the RAM chunk portion, the index as a whole can take more memory!)