How to effiently index multi-word tags

barryhunter · December 5, 2022, 1:21pm

I have a bunch documents that have ‘tags’ - multiword text labels, and there can be multiple tags per document.

At the moment I index then with in a full text field (also stored as a string attribute, but will change to doc-store at some point!)

| id      | tags                                                               | tag_ids     |
| 1192526 | _SEP_ listed building _SEP_                                        | 581         |
| 1624955 | _SEP_ crescent _SEP_                                               | 8434        |
|  230677 | _SEP_ near:Wickham Market _SEP_ taken from:Loudham Hall Road _SEP_ | 28270,28595 |
|  753854 | _SEP_ Snow _SEP_                                                   | 1678        |
| 1063848 | _SEP_ dual carriageway _SEP_                                       | 407         |
| 1133884 | _SEP_ at:Martlesham Heath _SEP_ Tesco Extra Superstore _SEP_       | 35312,37013 |
| 1134128 | _SEP_ Blackpool Tower _SEP_                                        | 54725       |
| 1378565 | _SEP_ park _SEP_ floral display _SEP_                              | 672,2571    |

But also put the tag-ids into a MVA attribute.

This works in that an filter like match('@tags "_SEP_ Tesco Extra Superstore _SEP_"')
(use this seperator as DONT want say searching [Tesco Extra] to match [Tesco Extra Superstore] - they are differnt tags!)

But its a bit slow, because the _SEP_ exists on every document. and so searchd is loading the doc-list for all documents.

Of course can do where ANY(tag_ids) = 37013 which works but can end up even slower. Particully if the query is complex (ie there are other keyword matches, or want to combine with OR etc!)

Field start/end modifiers would almost be perfect, but there are multiple tags per image,

The index is built with GROUP_CONCAT

 CONCAT('_SEP_ ',GROUP_CONCAT(DISTINCT tag ORDER BY tag_id SEPARATOR ' _SEP_ '),' _SEP_') AS tags

As I understand it start/end modifiers work by including extra words in the keyword index, with extra control characters. Seems like if could ‘add’ them manually to the input string.

So could do something like SEPARATOR '\\b . \\b'
… so the \b would be added an an extra field end modifer to te last word in the tag, and also as to the first word in the tag. (the actual start/end would get the real one)
Don’t think this will get thought charset_table (or many all control chars are treated as seperator)

Or maybe could configure indexer to treat each line as a full string, and hence add start/end tokens for each ‘line’ of the input text. Kinda like the MultiLine flag in regular expressions.

btw, don’t want to index tags as one word, as still want to be able to do part word matches. (eg match any tag with the word ‘blackpool’ int it, so it DOES match ‘blackpool tower’)

THe only concession I can think is to use fake single words. Eg index as tag35312 tag37013, then can search for they keyword tag37013, without worrying about matching parts of other tags.
(just then don’t get the morphology on the keyword matching)

Thanks for any thoughts!

tomat · December 5, 2022, 3:01pm

could you try secondary index with a tags (MVA attributes) filtering?
as SI has built in optimizer and query hints
these allow to use only some of filter values for SI fetching, ie
MVA from such words Tesco Extra Superstore should match well for filter

... WHERE tags IN ( APP_GET_INT_LIST_FROM_STRINGS ( 'Tesco Extra Superstore' ) )`

barryhunter · December 5, 2022, 3:27pm

Ah, I havent really tried columnar yet, (production instance is tool old - and am experiementing with 5.x via the helm chart, but havent got columnar working yet. I dont think it supported yet)

… but using an attribute query will work sometimes, as say have a tag_ids multi-attribute that stores the ids of the tags.

But also need to support ‘OR’ queries. Sometimes it would work - if just a list of tags searching, but other times, its a OR with some other field.

MATCH(‘@title blackpool | @tags “SEP Blackpool _SEP”’)

Also use tag filters with MAYBE (to boost results that have certain tags)

tomat · December 5, 2022, 3:38pm

You need to install MCL but no need to use columnar storage as secondary index works either with row and columnar storages.

However there is no way to use MATCH OR FILTER for now but by using OR operator at match \ filter I not sure that any optimization could work as OR fetch both nodes and intersects them.

barryhunter · December 5, 2022, 4:11pm

My main issue is trying to avoid they keyword _SEP_ which matches every document, so does load a ‘node’ that contains all documents, only to discard most as dont contain other words from the

Ideally would be some way to index a ‘multiple value’ string, as seperate fields or so
Ie instead of indexing _SEP_ at:Martlesham Heath _SEP_ Tesco Extra Superstore _SEP_
woyuld index something like at:Martlesham Heath$ ^Tesco Extra Superstore
so that the document would match ^at Martlesham Heath and/or ^Tesco Extra Superstore$

Ie make ^Tesco match, even though it not the first word in teh field, its the first word of a tag.

_SEP_at Martlesham Heath_SEP_ . _SEP_Tesco Extra Superstore_SEP_

which would allow a query "_SEP_Tesco Extra Superstore_SEP_" as _SEP_Tesco node would be much more selective keyword

But then harder to match ‘tesco’ on its own, nor as part of phrase match.

barryhunter · December 5, 2022, 4:16pm

I suppose I’ve kinda answered my own question really, and should just do

CONCAT('_SEP_ ',GROUP_CONCAT(DISTINCT tag ORDER BY tag_id SEPARATOR ' _SEP_ '),' _SEP_',
' tag',GROUP_CONCAT(DISTINCT tag_id ORDER BY tag_id SEPARATOR ' tag')
) AS tags

Which would give me fake keywords (as well as the full text version, for partial matching, keep the sep just in case too)

Would create

_SEP_ at:Martlesham Heath _SEP_ Tesco Extra Superstore _SEP_ tag35312 tag37013

So can match

 MATCH(‘@title blackpool | @tags tag37013’)

tag37013 guarantees only to match the specific tag, but a much more selective query, and can still exploit the inverted keyword index.