Problem indexing a term with multiple blend_chars

wordforms · February 2, 2021, 12:30am

I’m using Sphinx 2.2.11 and am having trouble with how Sphinx (and probably also Manticore) indexes terms that contain more than one instance of a blend character.

For example, I have the hyphen and period set as blend_chars:

blend_chars = ., -

Let’s say I have a term in the database as follows:

part1-part2.part3

I would expect that Sphinx would index this term in all possible combinations for each blend_char. For example:

Variant 1: part1-part2.part3
Variant 2: part1 part2.part3
Variant 3: part1-part2 part3
Variant 4: part1 part2 part3

However, that doesn’t seem to be the case.

If I search for:

part2.part3

I don’t find the record containing the term part1-part2.part3.

However, if I search for:

part2 part3

OR

part1 part2 part3

I do find the record.

This suggests to me that Sphinx does not index all possible combinations of the blend_chars. Instead, it appears to index just two versions:

part1-part2.part3 (with blend_chars intact)
part1 part2 part3 (with blend_chars ignored, treated as whitespace)

The documentation suggests this is true, especially in the entry for blend_mode:

https://manual.manticoresearch.com/Creating_an_index/NLP_and_tokenization/Low-level_tokenization#blend_mode

To quote:

By default, tokens that mix blended and non-blended characters get indexed in there [sic] entirety. For instance, when both at-sign and an exclamation are in blend_chars , “@dude!” will get result in two tokens indexed: “@dude!” (with all the blended characters) and “dude” (without any). Therefore “@dude” query will not match it.

So, that’s bad news indeed. But it confirms what I’m seeing.

I explored using blend_mode to fix this in the hope that it would create multiple tokens for each term. However, it seems to help only in situations where the blend_char is at the beginning or end of the term (as in the examples in the docs). In my example, however, the blend_chars are in the middle of the search term, so it doesn’t help to trim them.

Can anyone confirm that they are seeing the same behavior? And can anyone suggest tips on how to fix or work around it?

Thanks very much!

Sergey · February 2, 2021, 9:21am

When you have blend_chars = ., - and search for part2.part3 or part1-part2 Sphinx and Manticore both leave those as single tokens, it doesn’t convert them to part2 AND part3 and part1 AND part2.

BUT when you index part1-part2.part3 it generates 4 tokens: part1-part2.part3, part1, part2 and part3. That’s why you can’t find neither of them with part1-part2 or part2.part3.

The solution is to not use blended chars in your query. If you want to automate it you can use CALL KEYWORDS to see how it would be tokenized during indexation prior to your search query and then use the results to modify your query, e.g.:

mysql> call keywords('part1-part2.part3', 'blend');
+------+-------------------+-------------------+
| qpos | tokenized         | normalized        |
+------+-------------------+-------------------+
| 1    | part1-part2.part3 | part1-part2.part3 |
| 1    | part1             | part1             |
| 2    | part2             | part2             |
| 3    | part3             | part3             |
+------+-------------------+-------------------+
4 rows in set (0.00 sec)

wordforms · February 2, 2021, 12:51pm

Thank you very much for your quick reply, Sergey. I now understand a bit more about how the tokenizer works.

You suggest that I need to make some changes to queries that contain blend_chars. However, if I simply strip out the blended characters, then there are many cases where this will substantially reduce the quality of the search results.

Let me suggest another example. My database includes inventory numbers in the text, and I want those to be searchable. One of the records has the number 123-456.7. I would like my users to be able to find this record if they search for:

456.7

However, if I remove the blend character and transform the query from 456.7 to 456 7, then my search results will include all records with the terms 456 AND 7, no matter where they appear in the document. That will surely return hundreds of irrelevant results, because these terms are common.

In my application, if someone uses a blended character in the query, then it should exist that way in the results.

I’m afraid I don’t see a way to modify my queries so that I can do what I need: find partial matches involving blend_chars, while also getting highly relevant results.

Do you know of how I can modify my queries to do this?

Thank you again for your help. I really appreciate it!

Sergey · February 2, 2021, 1:19pm

Will the phrase operator be helpful in your case, i.e. not 456 7, but "456 7"?

Another thing you can do is make (or hire someone to make it for you) your own index token filter plugin that will convert x-y.z to what you want both on indexing and when you make a search (with help of SELECT ... OPTION token_filter='plugin.so:query'). Then you can fully customize the blend_chars behaviour.

wordforms · February 2, 2021, 4:46pm

Thank you very much again, Sergey.

Will the phrase operator be helpful in your case, i.e. not 456 7 , but "456 7" ?

Yes indeed, I think that works correctly.

I already have a rather complex query pre-parser in my application, so I just need to see if I can work in this change without causing any unexpected behavior.

My procedure is as follows, and I’d appreciate if you could have a quick look to see if you can spot any obvious problems with it.

Tokenize (break up) the original query on word boundaries (i.e., whitespace or any characters not appearing in my blend chars or charset_table).
Iterate through those tokens.
If a token contains a blended character, replace that character with a space using regex. Then put quotes around the new “phrase token.”
Re-assemble the tokens by appending them to each other, and finally submit them to Sphinx.

Does that sound reasonable?

Thank you again for your insight! You are obviously a master when it comes to knowing how Sphinx works.

Sergey · February 3, 2021, 12:34pm

Your query preparation pipeline looks good to me. Just one note:

Then put quotes around the new “phrase token.”

Not sure what happens in case you have to combine few “blended” tokens into a phrase, but you might benefit from using the quorum search operator https://mnt.cr/quorum to enable fuzzy search. But it’s up to you.

In general I would just make good functional testing to make sure it works as you wish. As long as you are worried mostly about matching and not ranking it shouldn’t be a big deal.