I have a bunch documents that have ‘tags’ - multiword text labels, and there can be multiple tags per document.
At the moment I index then with in a full text field (also stored as a string attribute, but will change to doc-store at some point!)
| id | tags | tag_ids |
| 1192526 | _SEP_ listed building _SEP_ | 581 |
| 1624955 | _SEP_ crescent _SEP_ | 8434 |
| 230677 | _SEP_ near:Wickham Market _SEP_ taken from:Loudham Hall Road _SEP_ | 28270,28595 |
| 753854 | _SEP_ Snow _SEP_ | 1678 |
| 1063848 | _SEP_ dual carriageway _SEP_ | 407 |
| 1133884 | _SEP_ at:Martlesham Heath _SEP_ Tesco Extra Superstore _SEP_ | 35312,37013 |
| 1134128 | _SEP_ Blackpool Tower _SEP_ | 54725 |
| 1378565 | _SEP_ park _SEP_ floral display _SEP_ | 672,2571 |
But also put the tag-ids into a MVA attribute.
This works in that an filter like match('@tags "_SEP_ Tesco Extra Superstore _SEP_"')
(use this seperator as DONT want say searching [Tesco Extra] to match [Tesco Extra Superstore] - they are differnt tags!)
But its a bit slow, because the _SEP_
exists on every document. and so searchd is loading the doc-list for all documents.
Of course can do where ANY(tag_ids) = 37013
which works but can end up even slower. Particully if the query is complex (ie there are other keyword matches, or want to combine with OR etc!)
Field start/end modifiers would almost be perfect, but there are multiple tags per image,
The index is built with GROUP_CONCAT
CONCAT('_SEP_ ',GROUP_CONCAT(DISTINCT tag ORDER BY tag_id SEPARATOR ' _SEP_ '),' _SEP_') AS tags
As I understand it start/end modifiers work by including extra words in the keyword index, with extra control characters. Seems like if could ‘add’ them manually to the input string.
So could do something like SEPARATOR '\\b . \\b'
… so the \b
would be added an an extra field end modifer to te last word in the tag, and also as to the first word in the tag. (the actual start/end would get the real one)
Don’t think this will get thought charset_table
(or many all control chars are treated as seperator)
Or maybe could configure indexer to treat each line as a full string, and hence add start/end tokens for each ‘line’ of the input text. Kinda like the MultiLine flag in regular expressions.
btw, don’t want to index tags as one word, as still want to be able to do part word matches. (eg match any tag with the word ‘blackpool’ int it, so it DOES match ‘blackpool tower’)
THe only concession I can think is to use fake single words. Eg index as tag35312 tag37013
, then can search for they keyword tag37013
, without worrying about matching parts of other tags.
(just then don’t get the morphology on the keyword matching)
Thanks for any thoughts!