Get stats on documents which were never matched in an index

Vishnu_shettigar · December 26, 2019, 2:53am

Is there any way to find out the documents which were never matched. It would really help in filtering out useful data out of large indexes.

Sergey · December 26, 2019, 2:57am

Hi. No, there’s no built-in way to do that now. But it’s an interesting topic.
Can you elaborate more on what would you do with the data consisting of # of matches for each doc?

Vishnu_shettigar · December 26, 2019, 3:28am

Consider I have many large RT indexes which in total takes few hundred GB of RAM, I know for sure that clients won’t be interested in all the data which are indexed , so my indexes contain documents which are really irrelevant to my clients. Now if there was a way to find out the all the documents which were never matched (Or All the documents which were matched) I could significantly decrease my index size by deleting the irrelevant documents or by not indexing them at all.

Sergey · December 26, 2019, 5:19am

Interesting idea, we’ll discuss it. If RAM is the only problem then unless you use access_*=mlock there is a probability that your OS can partly do what you want. But it heavily depends on the fragmentation of the documents that are not to be in RAM. Sure under Manticore’s control the effect would be higher.

Vishnu_shettigar · December 26, 2019, 5:38am

Thanks @Sergey . I do have another query, this is more like a feature request. Like I mentioned above, if I was able to find out the irrelevant documents then I would delete the documents manually but is it possible to have a feature in manticoresearch which , if any documents are not queried or matched for n number of days or hours manticoresearch could gradually move these documents out of ram and keep it in disk storage only ,and later if these documents get matched for any queries then put them back into RAM?

Sergey · December 26, 2019, 6:20am

I had the same idea (which I wanted to discuss with the team). What we could do is add support for .spa(c,h), .spb(c,h) and so on files where “c” stands for cold and “h” - for hot and when an OPTIMIZE happens the searchd would create those .spac, .spah etc. and then .spah would be mlocked (for best performance) and preread while .spac would be just mmaped w/o preread and given the documents from there would be still not requested the OS would keep them mostly on disk until they’re requested. Then next time they would appear in .spah, some other would be moved to .spac and so on.

Vishnu_shettigar · December 26, 2019, 7:18am

That sounds great @Sergey . This could solve the problem that I am having ,thank you.