I use html_strip=1 and index_sp=1 in my rt index. I retrieve search results using the highlight() function like this:
->highlight([], [
'around'=>7,
'snippet_boundary'=>'sentence', // or paragraph
'before_match'=>'<strong>',
'after_match'=>'</strong>',
])
Now, if the document had list items or headings which often do not end with a punctuation mark (e.g. li, h2) which got stripped (due to html_strip=1), no spaces are inserted between such elements within the highlight. If snippet_boundary is not set (default) then it works.
Example with snippet_boundary = sentence:
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
when searching for “item” this will result in a highlight such as this
“Item 1Item 2” (no space between “1” and “Item”
when snippet_boundary is not set, then the highlight will be
when searching for “item” this will result in a highlight such as this
"Item 1 Item 2 " (note the space between “1” and “Item”)
According to Manticore Search Manual: Creating a table > NLP and tokenization > Advanced HTML tokenization
it should insert empty spaces between block elements such as li and h2:
“Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.”
How can I fix this?