How to filter on data indexed using "html_index_attrs" setting?

Hi,
I’m new to Manticore and I’m working on some research to find out if Manticore features will work for my project.
Basically, what I’m looking for is

  1. can I index the text between xml/html tags
  2. can I index certain custom attributes on those xml/html tags
  3. can I do a full text match query to fetch results based on data indexed
  4. can I filter using the indexed data from html_index_attrs
    I was able to get the first three working. However, I’m unable to find any documentation as to how I can filter results using the attributes listed under “html_index_attrs”.

For example:
CREATE TABLE test4(int_id int, doc_id int, content text, name string, title text indexed) index_zones = ‘html, body, a’ html_index_attrs = ‘a=href,title; body=title’ html_strip = ‘1’ morphology = ‘libstemmer_en, libstemmer_es’;

insert into test4(int_id,doc_id,content,name,title) values (2407481, 96401, ‘<html><body title="This is Body 1"><a href="https://google.com" title="Google this">This is inside anchor.</a></body></html>', 'Google this name', 'Google this title’);

insert into test4(int_id,doc_id,content,name,title) values (2407482, 96402, ‘<html><body title="This is Body 2"><a href="https://google.com" title="Google that">This is inside anchor.</a></body></html>', 'Google that name', 'Google that title’);

insert into test4(int_id,doc_id,content,name,title) values (2407483, 96403, ‘<html><body title="This is Body 2"><a href="https://google.com" title="Google this and that">This is inside anchor.</a></body></html>', 'Google this and that name', 'Google this and that title’);

As per above, what query syntax do I use to fetch me only those rows that have a body title of “This is Body 2”?

You do need to enable indexing of ‘zones’

https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Advanced_HTML_tokenization#index_zones

But I think, it only allows you to scope the query to a particular tag, not the specific attribute. html_index_attrs just defines which attributes are indexed (because by default attributes are stripped).

So I think:

 ZONE:body This is Body 2

Would find the documents having with ‘This is Body 2’ in the <body>, but it could be in the content and/or the (retained) attributes. Rather than specifically the attribute(s)

Its possible (but not sure) that the word ‘title’ is actually indexed as a word, so could do

 ZONE:body "title This is Body 2"

for an exact phrase match, but franky not sure if that will work!

Hi @barryhunter,
Thanks for replying back!

As per the original example that I posted, I am enabling indexing of “zones” (index_zones = ‘html, body, a’)
But when I try using the zone operator to try to query but it just returns an empty row:
image

As per the documentation provided, setting the “html_index_attrs” option does index the attributes specified inside that option. I verified this by using the highlight option as you can see below:

Just wondering if it’s possible to filter using those attribute values.

Sorry, did miss you already had specify index_zones!

‘highlight()’ is not a good indication of what is indexed as such. It’s a ‘stored’ copy of the text (in the DocStore) - it’s not the data in the actual ‘full text index’. The actual keywords index is seperate.

That been said, don’t know then. Maybe the Zone Operator is tuned for searching the content of tags, not the attributes.

Just to prove it have tried?

ZONE:a This is inside anchor

to see if you can query the text inside the anchor. (beyond a normal unspecified query)

Hi @barryhunter,
Yes, I’ve tried querying the text inside the anchor using the Zone operator:

But I’ve been unable to use the Zone operator to filter on the tag attributes. Like you said, maybe the Zone operator is limited to just searching the content of the tags and not the attributes.
But I’m wondering is there any other way of querying those attributes. Like, what would be the purpose of having the “html_index_attrs” option to index the attributes if we cannot query on it?

Sorry I dont have any other ideas :frowning:

As for “html_index_attrs” - you can query the attribute text, eg with just “This is Body 2” - you cant just query specifically the attribute. Ie would get matches even if in a content somewhere.

The words are still indexed, just not in a way that can tell the words from from the attribute. (other than NOT matching zone!)

Thanks @barryhunter, I really appreciate your help.
@staff,
Any suggestions as to what I can do for my use-case?

It looks like index_zones disables html_index_attrs and it seems wrong:

index_zones + html_index_attrs:

mysql> create table t(f text) html_strip='1' html_index_attrs='a=href' index_zones='a'; 

insert into t values(0, '<a href="href1" title="title1">a1</a> between <a href="href2">a2</a>'),(0,'<a href="href3">between</a>');

mysql> select * from t where match('href1');
Empty set (0.00 sec)

mysql> select * from t where match('ZONE:a a1');
+---------------------+----------------------------------------------------------------------+
| id                  | f                                                                    |
+---------------------+----------------------------------------------------------------------+
| 1514175226210943028 | <a href="href1" title="title1">a1</a> between <a href="href2">a2</a> |
+---------------------+----------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select * from t where match('ZONE:a href2');
Empty set (0.00 sec)

mysql> select * from t where match('ZONE:a href1');
Empty set (0.00 sec)

I.e. there seems to be no way to find href1, href2.

Without index_zones:

mysql> create table t(f text) html_strip='1' html_index_attrs='a=href'; 

insert into t values(0, '<a href="href1" title="title1">a1</a> between <a href="href2">a2</a>'),(0,'<a href="href3">between</a>');

mysql> select * from t where match('href1');
+---------------------+----------------------------------------------------------------------+
| id                  | f                                                                    |
+---------------------+----------------------------------------------------------------------+
| 1514175226210943030 | <a href="href1" title="title1">a1</a> between <a href="href2">a2</a> |
+---------------------+----------------------------------------------------------------------+
1 row in set (0.00 sec)

It looks like a bug to me. I’ll discuss with the team to make sure I’m not missing smth.

In terms of the design: there was no idea to make it possible to index each attribute of each tag as a different full-text field so you can address each separately. As Barry said index_html_attrs just lets you not skip indexing the attributes you specify.

Hi @Sergey,
Thanks for looking into this. It would be really beneficial to have html_index_attrs function as a filter rather than allowing for full-text matching on their values.

I like how you can filter using a combination of index_zones and the ZONE/ZONESPAN operators. Something similar to that maybe …

It looks like a bug to me. I’ll discuss with the team to make sure I’m not missing smth.

Discussed. It could be by design and not a bug. It requires deeper research. Issue about that - index_zones vs html_index_attrs · Issue #520 · manticoresoftware/manticoresearch · GitHub