Using wordforms in addition to prefix search

wordforms · March 23, 2022, 7:55pm

I’m using Sphinx 2.2.11 (haven’t yet migrated to Manticore), and I’m having issues with using the wordforms feature in addition to a prefix search (i.e., term expansion).

In my specific use case, I apply term expansion to all query terms. In other words, I append a star (asterisk) to the end of every search term as part of my query parsing. This happens before the query is sent to Sphinx.

Here’s an example:

Let’s say my wordforms file has the following entry:

US > United States

My document has the term:

USA

My users, knowing that all terms are expanded automatically, will assume that a search for US will match USA. However, this doesn’t happen.

Let’s look at how the query is parsed:

Query:
US

My custom query parser will lookup the term in the wordforms file. It finds the term “US”, and then it takes the right-hand term “United States”, and transforms it into the following query to send to Sphinx:

(United States*|United States)

Based on how Sphinx works, this query will not match my document. Only the right-hand term in the wordforms file is used in the term expansion. That’s because this is the term that is substituted for US in both the document as well as the query. But my users expect that their query will be transformed into:

US*

Is there a way I can take advantage of the wordforms feature while also allowing term expansion to function on the original search term, and not only on the right-hand wordforms entry?

Thanks a lot for any tips!

Sergey · March 24, 2022, 4:05pm

I think the only way you can do so is if you map a wordform to itself. Neither Sphinx nor Manticore handles the left-hand wordforms part the way you are looking for since it’s against the concept: once the left-hand entry is found in a document/search query it’s replaced with the right-hand entry, after that tokens from the left-hand part are completely forgotten, they can’t be a basis for infix search etc.

What I would try first is to change the mappings like this:

➜  ~ cat wordforms
USA > US
United States > US
Unites States of America > US

and then it can work like this:

➜  ~ mysql -P9306 -h0 -v
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 164
Server version: 4.2.1 d9f4b9c96@220118 dev git branch master...origin/master

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Reading history-file /Users/snikolaev/.mysql_history
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> drop table if exists t; create table t(f text) wordforms='/Users/snikolaev/wordforms' expand_keywords='1' min_infix_len='2'; insert into t(f) values('USA'),('US'),('United States'),('United States of America'),('usdt'); select * from t; select * from t where match('us'); select * from t where match('usa'); select * from t where match('united states');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text) wordforms='/Users/snikolaev/wordforms' expand_keywords='1' min_infix_len='2'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(f) values('USA'),('US'),('United States'),('United States of America'),('usdt')
--------------

Query OK, 5 rows affected (0.00 sec)

--------------
select * from t
--------------

+---------------------+--------------------------+
| id                  | f                        |
+---------------------+--------------------------+
| 1514698463906890074 | USA                      |
| 1514698463906890075 | US                       |
| 1514698463906890076 | United States            |
| 1514698463906890077 | United States of America |
| 1514698463906890078 | usdt                     |
+---------------------+--------------------------+
5 rows in set (0.00 sec)

--------------
select * from t where match('us')
--------------

+---------------------+--------------------------+
| id                  | f                        |
+---------------------+--------------------------+
| 1514698463906890078 | usdt                     |
| 1514698463906890074 | USA                      |
| 1514698463906890075 | US                       |
| 1514698463906890076 | United States            |
| 1514698463906890077 | United States of America |
+---------------------+--------------------------+
5 rows in set (0.00 sec)

--------------
select * from t where match('usa')
--------------

+---------------------+--------------------------+
| id                  | f                        |
+---------------------+--------------------------+
| 1514698463906890074 | USA                      |
| 1514698463906890075 | US                       |
| 1514698463906890076 | United States            |
| 1514698463906890077 | United States of America |
+---------------------+--------------------------+
4 rows in set (0.00 sec)

--------------
select * from t where match('united states')
--------------

+---------------------+--------------------------+
| id                  | f                        |
+---------------------+--------------------------+
| 1514698463906890078 | usdt                     |
| 1514698463906890074 | USA                      |
| 1514698463906890075 | US                       |
| 1514698463906890076 | United States            |
| 1514698463906890077 | United States of America |
+---------------------+--------------------------+
5 rows in set (0.00 sec)

wordforms · March 24, 2022, 4:36pm

Hi Sergey,

Thank you very much for your time and for the comprehensive answer (as always).

Your suggestion certainly does solve the issue with the specific example I used, but unfortunately I don’t see how I can generalize it.

If my users expect that tokens in their original query will always be the basis for a prefix search, then I don’t believe I can structure the wordforms file to achieve this. I can’t put all terms on the right-hand side, as this wouldn’t be allowed. So perhaps wordforms simply don’t work in my use case?

What do you think if I abandoned the wordforms feature entirely, and instead use the wordforms file as a basis for query expansion handled by my own code? Basically, I would tokenize the user’s query, search the wordforms file for matches, and then append those to the query with an OR operator.

Or is there a way I can handle this more gracefully with Sphinx/Manticore directly?

Thank you very much again for your help!