search in html text and attributes

hi, suppose we have such html in each field of a table:

<sentence>
  <word lemma="they">They</word>
  <word lemma="stretch">stretched</word>
  <word lemma="their">their</word>
  <word lemma="hand">hands</word>
  <word lemma="toward">towards</word>
  <word lemma="different">different</word>
  <word lemma="food">foods</word>
</sentence>

We want to allow the user to search as in the example below:
word: foods & (lemma: hand | lemma: air)


In more advanced mode, we can allow the user to use the expression as in the example below:
word: "different foods" & (lemma: hand | lemma: air)


we want to highlight corresponding words in that html field
for example, such result:

<sentence>
  <word lemma="they">They</word>
  <word lemma="stretch">stretched</word>
  <word lemma="their">their</word>
  <word lemma="hand"><b>hands</b></word>
  <word lemma="toward">towards</word>
  <word lemma="different"><b>different</b></word>
  <word lemma="food"><b>foods</b></word>
</sentence>

What should we do for table (field) design, indexing and querying and highlighting?

You should parse the html, normalize it and insert it as a series of docs split into the fields you need:

  • word
  • lemma

Manticore can process your html using:

but it will still be the same field while what your are looking for looks like queries against multiple fields. Perhaps the ZONE operator could help, but it seems to conflict with the html_index_attrs:

mysql> drop table if exists t; create table t(f text) html_strip='1' html_index_attrs='word=lemma;'; insert into t values(1, '<sentence><word lemma="hand">hands</word></sentence>'); select * from t where match('hand');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text) html_strip='1' html_index_attrs='word=lemma;'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(1, '<sentence><word lemma="hand">hands</word></sentence>')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('hand')
--------------

+------+------------------------------------------------------+
| id   | f                                                    |
+------+------------------------------------------------------+
|    1 | <sentence><word lemma="hand">hands</word></sentence> |
+------+------------------------------------------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---
mysql> drop table if exists t; create table t(f text) html_strip='1' html_index_attrs='word=lemma;' index_zones='word'; insert into t values(1, '<sentence><word lemma="hand">hands</word></sentence>'); select * from t where match('hand');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text) html_strip='1' html_index_attrs='word=lemma;' index_zones='word'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(1, '<sentence><word lemma="hand">hands</word></sentence>')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t where match('hand')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

I am not saying that the data must be structured in html and be like this. I just want to say that we have a set of words where each word has its own lemma. The user may want to search for a phrase in words and may want to search in lemma. with the possibility of using & and | operators.
I would be grateful if you could tell me with what data structure this work can be done.

mysql> drop table if exists t; create table t(word text, lemma text); insert into t values(1, 'They stretched their hands towards different foods', 'they stretch their hand toward different food'); select highlight() from t where match('@word foods @lemma hand|air'); select highlight() from t where match('@word "different foods" @lemma hand | air');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(word text, lemma text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(1, 'They stretched their hands towards different foods', 'they stretch their hand toward different food')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select highlight() from t where match('@word foods @lemma hand|air')
--------------

+------------------------------------------------------------------------------------------------------------------+
| highlight()                                                                                                      |
+------------------------------------------------------------------------------------------------------------------+
| They stretched their hands towards different <b>foods</b> | they stretch their <b>hand</b> toward different food |
+------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
--- 1 out of 1 results in 0ms ---

--------------
select highlight() from t where match('@word "different foods" @lemma hand | air')
--------------

+------------------------------------------------------------------------------------------------------------------+
| highlight()                                                                                                      |
+------------------------------------------------------------------------------------------------------------------+
| They stretched their hands towards <b>different foods</b> | they stretch their <b>hand</b> toward different food |
+------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

Thank you very much for your reply.
But the word is highlighted separately and the lemma is also highlighted separately.
I want if a lemma is found, the corresponding word for that lemma will be highlighted.

For example, the answer should look like this:
They stretched their <b>hands</b> towards <b>different foods</b>

cat << 'EOF' > /tmp/wf
They > they
stretched > stretch
their > their
hands > hand
towards > toward
different > different
foods > food
EOF

➜  ~ mysql -P9306 -h0 -v -e "drop table if exists t; create table t(f text) wordforms='/tmp/wf'; insert into t values(1, 'They stretched their hands towards different foods'); select highlight() from t where match('foods hand|air'); select highlight() from t where match('\"different foods\" hand | air');"
--------------
drop table if exists t
--------------

--------------
create table t(f text) wordforms='/tmp/wf'
--------------

--------------
insert into t values(1, 'They stretched their hands towards different foods')
--------------

--------------
select highlight() from t where match('foods hand|air')
--------------

+------------------------------------------------------------------+
| highlight()                                                      |
+------------------------------------------------------------------+
| They stretched their <b>hands</b> towards different <b>foods</b> |
+------------------------------------------------------------------+
--------------
select highlight() from t where match('"different foods" hand | air')
--------------

+------------------------------------------------------------------+
| highlight()                                                      |
+------------------------------------------------------------------+
| They stretched their <b>hands</b> towards <b>different foods</b> |
+------------------------------------------------------------------+

Thank you.