no spaces between some block elements when highlight with snippet_boundary sentence or paragraph

Perrschy · October 20, 2023, 9:22am

I use html_strip=1 and index_sp=1 in my rt index. I retrieve search results using the highlight() function like this:

->highlight([], [
 'around'=>7,
 'snippet_boundary'=>'sentence', // or paragraph
 'before_match'=>'<strong>',
 'after_match'=>'</strong>',
])

Now, if the document had list items or headings which often do not end with a punctuation mark (e.g. li, h2) which got stripped (due to html_strip=1), no spaces are inserted between such elements within the highlight. If snippet_boundary is not set (default) then it works.

Example with snippet_boundary = sentence:

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
</ul>

when searching for “item” this will result in a highlight such as this
“Item 1Item 2” (no space between “1” and “Item”

when snippet_boundary is not set, then the highlight will be
when searching for “item” this will result in a highlight such as this
"Item 1 Item 2 " (note the space between “1” and “Item”)

According to Manticore Search Manual: Creating a table > NLP and tokenization > Advanced HTML tokenization
it should insert empty spaces between block elements such as li and h2:

“Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.”

How can I fix this?

tomat · October 22, 2023, 8:21pm

it could be better to create ticket at Github there to put reproducible example, ie daemon config and query that reproduces issue locally