Where Value IN Search

hatemjaber · February 25, 2022, 3:29pm

Hi Everyone,

I was looking in the documentation but possibly not wording it correctly… I’m looking for a way to match n of items that exact match the passed in array of values. For example, I have an index that has (id, url) and I want to send an array of urls to see if they exist in the index. Is there a “WHERE IN [a, b, c, …]” type of search? I would be passing in anywhere from 25 to 100 urls at a time and would like to pass in as an array if that is possible.

The other question is: does the “url” column have to be a full-text column or can it be a attribute column since we just want to match against it?

Thank you in advance for your feedback!

L4G · March 1, 2022, 7:51pm

Using full text search on multiple items at once will make your query very slow. Especially going over 25 as you need to. Full text search is most effective (when compared to attribute search) when you are searching for single (or just 2-3) items. In some cases, depending on data it can even be faster using full text search up to 80 items. But over that number - it is equal or faster to use attributes.

https://manual.manticoresearch.com/Creating_an_index/Data_types

has some explanation you need.

Your best bet is to assign an ID to each URL and use that ID as an attribute. You may wish to read a bit about MVA attributes as well which would allow you using second, third and so on attribute on each URL to provide further sorting and grouping.

barryhunter · March 4, 2022, 10:06am

In theory yes, the URL could be an attribute - the IN(…) should work with string attribute.

But attribute queries tend to not be efficient. Most of the time its a ‘full table scan’ (but still typically quick, because attributes are held entirely in RAM!)

So using a full text filed (to exploit the inverted index) might be more efficient. But with so many terms, might be marginal (ie the overhead of so many queryes) might negate the benefit, and atttributes quicker.

Frankly would say, try both, and benchmark!

There also is

which aims to make attribute queries more efficient. While still full-table scans, being column based in theory is more efficient.
(Although not entirely clear if string queries are supported by columnar)

Sergey · March 4, 2022, 10:15am

(Although not entirely clear if string queries are supported by columnar)

They are supported and there are some skip-block algorithms for strings that indeed can decrease latency in some cases, but as with attributes vs full-text it makes sense to benchmark, since the skip-block logic can skip a block, but non-skipped blocks will have to be read from disk and uncompressed while with the row-wise storage the attributes are already in memory and uncompressed. So what will work best depends on data cardinality, data volume and other things.