Indexing compound word parts

donmartio · March 31, 2021, 6:56pm

Hi,
i’m currently evaluating manticore for my company. As far as i got the it seems pretty good and straight forward. But i got stuck with the simple add document example.
In the example there is a document added with a title containing the word ‘microfiber’. But if i search for ‘micro’ or ‘fiber’ i got no results. All i found in this matter is the wordbreaker app, but this is a console app (which might be useful).
So what is the standard way to get compound words indexed with all the parts? Do i have to create my own dict.txt? Any hint or help appreciated.

tomat · March 31, 2021, 6:59pm

you might enable infixes at index definition then use wildcards at search.
As usually you search for whole words but you might use wildcards or any extended syntax to get more results

donmartio · April 1, 2021, 7:00am

Thanks for the reply, setting min_infix_len to 3 did the job for now.

stevedonkey · October 27, 2021, 10:20am

Is it possible to go futher and do this in reverse? Meaning, if I search for “microfiber”, then can I also get all results with “micro” and “fiber” (obviously ranked lower than “microfiber”, but still viable in my opinion).

The obvious solution would be to use wordbreaker to parse the user’s input and generate a custom query, so in our “microfiber” case, it would produce something like:

SELECT * FROM {index} WHERE MATCH('microfiber|micro|fiber|micro*|*micro|fiber*|fiber*');

Or is there some way to automate this? This becomes excedingly more pronounced in languages like German or Swedish, where half the dictionary is built off of smaller words put together (sometimes changing the form or the word, but I believe the morphology setting takes care of this).

barryhunter · November 2, 2021, 1:30pm

Well wordbreaker IS the way to automate it.

Its kinda the point of the app, is that you can run a word via it, and it will spit out if it thinks it can be split.

Ok its not ‘ideal’ its a command line app. But most programming languages can execute binaries, so it serves as a ‘minimum viable’ API.

I.e. what you call ‘reverse’ is the intended use for wordbeaker! For the ‘forward’, then infix searching is the solution.

stevedonkey · November 12, 2021, 10:29am

ok, everything makes sense so far. We just have to implement these things ourselves. I’m just wondering if it might be worth it to add a config to do infixing and “outfixing” (using wordbreaker to get more terms from compound words) automatically? So that if we do:

SELECT * FROM {index} WHERE MATCH('micro fiber bathmat');

manticore would give us first the standard matches (exact phrase, exact words), then use wordbreaker to look for bath and mat, and then use infixing to look for micro*, fiber* and bathmat*, then *micro, *fiber and *bathmat, and lastly *micro*, *fiber* and *bathmat*.

Probably wouldn’t make sense to do for all searches, but I would imagine it would be quite important when we’ve reached the end of the normal results. Like maybe our dataset has zero results for “bathmat” but tons for “bath mat”, so when we get 0 results for “micro fiber bathmat”, then we can set a flag or add a parameter to tell manticore to go to the next level and automatically do some fuzzy searching.

Or do you think this behavior is something that really should be left up to each developer since each case is different?

barryhunter · November 23, 2021, 12:32pm

In general I suppose it would be nice if manticore had this ‘auto splitting’ as a built in function - akin to the ‘expand_keywords’ option (which already does the infix/prefix expansion you mention)

Manticore Search uses GitHub to track feature requests

…so can make your request there.

(it could even do ‘joining’ eg MATCH(‘micro fiber’) could automatically match “microfiber” too, instead of just relying on ‘micro fiber’ to match. )

… but yes, its something that you can implement yourself. A benefit of doing the splitting/joining externally, is you have control over the ‘wordlists’ used. While just using the output of dumpdict might be enough in most cases, having an intermediate data file means could potentially do some tweaking to the wordlists for deal with edge cases.

Sergey · November 24, 2021, 3:40pm

Not sure it would be easy to use exactly the same method used in wordbreaker since:

for proper functioning we need to apply it not only on the search phase, but during indexation too
the wordbreaker method is based on a ready dictionary
but during indexation the dictionary is not ready yet, so we would need first to prepare it and then make some kind of retokenization, or we can call that training/tokenization phases.

I.e. it should be possible, but would introduce a significant change in the design of indexation.

Also since wordbreaker was made other techniques appeared, e.g. GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation. or GitHub - VKCOM/YouTokenToMe: Unsupervised text tokenizer focused on computational efficiency , Huggingface tokenizer etc. that all also require training, but may be more optimal performance-wise since don’t require the whole dictionary.

barryhunter · November 24, 2021, 7:38pm

for proper functioning we need to apply it not only on the search phase, but during indexation too

I would think only needed at query time. Or indexation, not both. if it during indexing, then the queries would match anyway. Or do it during query, because index doesnt have it split.

the wordbreaker method is based on a ready dictionary

By doing it a query time in theory could do directly from the built in dictionalry (not sure if dict=crc is possible anymore)

Sergey · November 25, 2021, 2:24am

Oh, you mean that:

we would add new setting expand_keywords=wordbreaker
then when we get a query microfiber we would use the wordbreaker algo which would split it to micro and fiber and we would expand the query microfiber to microfiber|("micro fiber")

Yes, indeed, it may work out. Not sure why I thought it wouldn’t yesterday Just missed we can do query expansion at query time. I’ll discuss it with the team.