Giving extra weight for certain field strings

zaxxon_32 · September 27, 2020, 6:16pm

Is there any way to give extra ranking weight to a match if it contains a certain string?.

If there are multiple matching results which all contain the exact query string I would like to rank one of the rows higher than the others if a field contains a certain string.

For example, If I run the following querys I would like the United States to rank higher.

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^san francisco$"/0.75') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+--------------------+-------------------------+--------+---------------+--------------+------------+------------+-------------+
| id     | country            | state                   | county | city          | country_code | state_code | latitude   | longitude   |
+--------+--------------------+-------------------------+--------+---------------+--------------+------------+------------+-------------+
|  84493 | Dominican Republic | Provincia de El Seibo   |        | San Francisco | DO           | 8          |  18.850000 |  -69.233299 |
|  85583 | El Salvador        | Departamento de Morazan |        | San Francisco | SV           | MO         |  13.697800 |  -88.096497 |
|  85839 | Costa Rica         | Provincia de Heredia    |        | San Francisco | CR           | H          |  10.003500 |  -84.072098 |
|  86016 | Venezuela          | Zulia                   |        | San Francisco | VE           | V          |  10.650000 |  -71.699997 |
|  86498 | Colombia           | Cundinamarca            |        | San Francisco | CO           | CUN        |   4.977600 |  -74.292900 |
|  87487 | Argentina          | Cordoba                 |        | San Francisco | AR           | X          | -31.428600 |  -61.914299 |
|  88691 | Mexico             | Nayarit                 |        | San Francisco | MX           | NAY        |  20.902000 | -105.412804 |
| 106257 | United States      | California              |        | San Francisco | US           | CA         |  37.779400 | -122.417603 |
+--------+--------------------+-------------------------+--------+---------------+--------------+------------+------------+-------------+
8 rows in set (0.001 sec)

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^san diego$"/0.75') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+---------------+------------------------+--------+-----------+--------------+------------+-----------+-------------+
| id     | country       | state                  | county | city      | country_code | state_code | latitude  | longitude   |
+--------+---------------+------------------------+--------+-----------+--------------+------------+-----------+-------------+
|  86019 | Venezuela     | Carabobo               |        | San Diego | VE           | G          | 10.264800 |  -67.951698 |
|  86499 | Colombia      | Departamento del Cesar |        | San Diego | CO           | CES        | 10.339600 |  -73.181000 |
|  96370 | United States | Texas                  |        | San Diego | US           | TX         | 27.768700 |  -98.248299 |
| 106254 | United States | California             |        | San Diego | US           | CA         | 32.799400 | -117.168602 |
+--------+---------------+------------------------+--------+-----------+--------------+------------+-----------+-------------+
4 rows in set (0.005 sec)

Similarly, I would like Ireland to rank higher with this query.

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^dublin$"/0.75') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+---------------+----------------------+--------+--------+--------------+------------+-----------+-------------+
| id     | country       | state                | county | city   | country_code | state_code | latitude  | longitude   |
+--------+---------------+----------------------+--------+--------+--------------+------------+-----------+-------------+
|   4222 | Belarus       | Homyel’ Voblasc’     |        | Dublin | BY           | HO         | 51.735001 |   30.327999 |
|  46584 | Ireland       | Leinster             |        | Dublin | IE           | L          | 53.333099 |   -6.248900 |
|  90439 | United States | Georgia              |        | Dublin | US           | GA         | 32.458099 |  -82.936600 |
|  95912 | United States | Texas                |        | Dublin | US           | TX         | 32.090900 |  -98.345497 |
|  96666 | United States | Virginia             |        | Dublin | US           | VA         | 37.105701 |  -80.685303 |
| 101562 | United States | New Hampshire        |        | Dublin | US           | NH         | 42.895100 |  -72.053398 |
| 103230 | United States | Ohio                 |        | Dublin | US           | OH         | 40.110401 |  -83.113098 |
| 103772 | United States | Pennsylvania         |        | Dublin | US           | PA         | 40.368198 |  -75.201401 |
| 105810 | United States | California           |        | Dublin | US           | CA         | 37.720100 | -121.918999 |
| 109482 | Canada        | Ontario              |        | Dublin | CA           | ON         | 43.507900 |  -81.274696 |
+--------+---------------+----------------------+--------+--------+--------------+------------+-----------+-------------+
10 rows in set (0.004 sec)

barryhunter · September 28, 2020, 6:43pm

In your example, of United States to rank higher, how is that determined? What are you basing the decision on?

There is the MAYBE operator, so could do something like

MATCH('@(country,state,county,city,state_code) "^san francisco$"/0.75 MAYBE "United States" ')

for example.

But you would need to know in advance what country you want to rank highest. Its not clear how you doing that.

zaxxon_32 · September 28, 2020, 8:20pm

I think figuring out how to determine which country to rank higher is part of the problem.

The below query is what I would like to do although you can’t use MAYBE this way.

 SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^san francisco$"/0.75  MAYBE "United States", "united kingdom", "Australia"') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;

So, if there was a query like "san Francisco" it would rank the San Francisco from the United States first. If there was a San Francisco in the United Kingdom it would rank that second and Australia third. Then any other matches would be after those 3 countries because I haven’t given them any extra weight.

I only need to give extra weight to a few countries like the United States.

I was looking at creating a custom ranker for this but couldn’t figure it out.

I just need to give extra weight to rows containing the United States, United Kingdom, etc. If there are multiple equal matches from different countries.

barryhunter · September 28, 2020, 10:36pm

Well you can combine multiple MAYBE’s…

MATCH(' "^san francisco$"/0.75  MAYBE "United States" MAYBE "united kingdom" MAYBE "Australia" ')

I dont know where you got your ‘geolocations’ database, but maybe could get one with a ‘population’ value?

Then could sort by biggest population. In that example the San Fransciso in the US, probably has the biggest population. That would at least give you a basis for sorting?

zaxxon_32 · September 29, 2020, 9:51am

Combining multiple MAYBE’s doesn’t work.

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^manchester$"/0.75 MAYBE "United kingdom" MAYBE "united states"') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+----------------+----------+------------+-------------+--------------+------------+-----------+------------+
| id     | country        | state    | county     | city        | country_code | state_code | latitude  | longitude  |
+--------+----------------+----------+------------+-------------+--------------+------------+-----------+------------+
|  27100 | United Kingdom | England  | Manchester | Manchester  | GB           | ENG        | 53.466301 |  -2.134200 |
|  25482 | United Kingdom | England  | Manchester | Wythenshawe | GB           | ENG        | 53.400002 |  -2.266700 |
|  26008 | United Kingdom | England  | Manchester | Swinton     | GB           | ENG        | 53.500000 |  -2.350000 |
|  26816 | United Kingdom | England  | Manchester | Northenden  | GB           | ENG        | 53.400501 |  -2.244300 |
|  79639 | United Kingdom | England  | Manchester |             | GB           | ENG        | 53.504002 |  -2.211700 |
| 113629 | United Kingdom | England  | Manchester | Didsbury    | GB           | ENG        | 53.414200 |  -2.228000 |
| 113760 | United Kingdom | England  | Manchester | Failsworth  | GB           | ENG        | 53.505100 |  -2.161900 |
| 113762 | United Kingdom | England  | Manchester | Heywood     | GB           | ENG        | 53.588402 |  -2.218800 |
|  90558 | United States  | Georgia  |            | Manchester  | US           | GA         | 32.872101 | -84.631203 |
|  92004 | United States  | Kentucky |            | Manchester  | US           | KY         | 37.151100 | -83.779297 |
+--------+----------------+----------+------------+-------------+--------------+------------+-----------+------------+
10 rows in set (0.028 sec)

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^manchester$"/0.75 MAYBE "United States" MAYBE "united kingdom"') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+----------------+----------+------------+-------------+--------------+------------+-----------+------------+
| id     | country        | state    | county     | city        | country_code | state_code | latitude  | longitude  |
+--------+----------------+----------+------------+-------------+--------------+------------+-----------+------------+
|  27100 | United Kingdom | England  | Manchester | Manchester  | GB           | ENG        | 53.466301 |  -2.134200 |
|  25482 | United Kingdom | England  | Manchester | Wythenshawe | GB           | ENG        | 53.400002 |  -2.266700 |
|  26008 | United Kingdom | England  | Manchester | Swinton     | GB           | ENG        | 53.500000 |  -2.350000 |
|  26816 | United Kingdom | England  | Manchester | Northenden  | GB           | ENG        | 53.400501 |  -2.244300 |
|  79639 | United Kingdom | England  | Manchester |             | GB           | ENG        | 53.504002 |  -2.211700 |
| 113629 | United Kingdom | England  | Manchester | Didsbury    | GB           | ENG        | 53.414200 |  -2.228000 |
| 113760 | United Kingdom | England  | Manchester | Failsworth  | GB           | ENG        | 53.505100 |  -2.161900 |
| 113762 | United Kingdom | England  | Manchester | Heywood     | GB           | ENG        | 53.588402 |  -2.218800 |
|  90558 | United States  | Georgia  |            | Manchester  | US           | GA         | 32.872101 | -84.631203 |
|  92004 | United States  | Kentucky |            | Manchester  | US           | KY         | 37.151100 | -83.779297 |
+--------+----------------+----------+------------+-------------+--------------+------------+-----------+------------+
10 rows in set (0.034 sec)

Even if MAYBE did work it wouldn’t be suitable anyway because it’s more than 20 times slower.

I like the idea of adding a population column. It’ll be easy enough to add one.

What would the MATCH query look like for this?

If there was a query containing "manchester, united states", I wouldn’t want to use the population ranking because the country is already provided. But if the query only contained "manchester" then I need to rank them by population.

barryhunter · September 29, 2020, 10:31am

Well the main thing would be to put the population into a attribute, then can easily use it for ordering…

Couple of different ways of ‘using’ the ordering, but if just put is a numeric attribute, can literally just add ORDER BY population DESC to the query,

But can do compound ordering, if want to factor in other things…

SELECT id,WEIGHT()*population AS myorder FROM ... ORDER BY myorder DESC

That’s probably not very good ordering, the population will probablly ‘blow out’ the ranking way.

Might so something like

SELECT id,INTEGER(ln(WEIGHT())+ln(population)) AS myorder FROM ... ORDER BY myorder DESC

Could also use the population attribute inside a custom ranking expression (rather than the exprssion in SELECT part)

These are untested queries, just of the top of my head as an example.

zaxxon_32 · September 29, 2020, 3:17pm

I’ve added a population column and in manticore.conf assigned the attribute sql_attr_uint = population.

My query in manticore.conf is sql_query = SELECT id, country, state, county, city, country_code, state_code, latitude, longitude, population FROM geo_locations

When I run queries I get the required matches e.g largest population first.

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^dublin$"/0.75') ORDER by population DESC limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+---------------+----------------------+--------+--------+--------------+------------+-----------+-------------+------------+
| id     | country       | state                | county | city   | country_code | state_code | latitude  | longitude   | population |
+--------+---------------+----------------------+--------+--------+--------------+------------+-----------+-------------+------------+
|  46584 | Ireland       | Leinster             |        | Dublin | IE           | L          | 53.333099 |   -6.248900 |    1024027 |
| 105810 | United States | California           |        | Dublin | US           | CA         | 37.720100 | -121.918999 |      57721 |
| 103230 | United States | Ohio                 |        | Dublin | US           | OH         | 40.110401 |  -83.113098 |      45098 |
|  90439 | United States | Georgia              |        | Dublin | US           | GA         | 32.458099 |  -82.936600 |      16197 |
|  95912 | United States | Texas                |        | Dublin | US           | TX         | 32.090900 |  -98.345497 |       3664 |
|  96666 | United States | Virginia             |        | Dublin | US           | VA         | 37.105701 |  -80.685303 |       2686 |
| 103772 | United States | Pennsylvania         |        | Dublin | US           | PA         | 40.368198 |  -75.201401 |       2169 |
| 101562 | United States | New Hampshire        |        | Dublin | US           | NH         | 42.895100 |  -72.053398 |       1572 |
|   4222 | Belarus       | Homyel’ Voblasc’     |        | Dublin | BY           | HO         | 51.735001 |   30.327999 |          0 |
| 109482 | Canada        | Ontario              |        | Dublin | CA           | ON         | 43.507900 |  -81.274696 |          0 |
+--------+---------------+----------------------+--------+--------+--------------+------------+-----------+-------------+------------+
10 rows in set (0.004 sec)

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^manchester$"/0.75') ORDER by population DESC limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+----------------+---------------+------------+------------+--------------+------------+-----------+------------+------------+
| id     | country        | state         | county     | city       | country_code | state_code | latitude  | longitude  | population |
+--------+----------------+---------------+------------+------------+--------------+------------+-----------+------------+------------+
|  27100 | United Kingdom | England       | Manchester | Manchester | GB           | ENG        | 53.466301 |  -2.134200 |     395515 |
| 101626 | United States  | New Hampshire |            | Manchester | US           | NH         | 42.994801 | -71.458000 |     110229 |
|  84432 | Jamaica        | Manchester    |            | Mandeville | JM           | 12         | 18.033300 | -77.500000 |      47115 |
|  97475 | United States  | Connecticut   |            | Manchester | US           | CT         | 41.776001 | -72.528297 |      30577 |
| 113762 | United Kingdom | England       | Manchester | Heywood    | GB           | ENG        | 53.588402 |  -2.218800 |      28024 |
| 113760 | United Kingdom | England       | Manchester | Failsworth | GB           | ENG        | 53.505100 |  -2.161900 |      20555 |
|  26008 | United Kingdom | England       | Manchester | Swinton    | GB           | ENG        | 53.500000 |  -2.350000 |      20000 |
| 113629 | United Kingdom | England       | Manchester | Didsbury   | GB           | ENG        | 53.414200 |  -2.228000 |      14292 |
|  95504 | United States  | Tennessee     |            | Manchester | US           | TN         | 35.498600 | -86.084801 |      10517 |
|  97951 | United States  | Iowa          |            | Manchester | US           | IA         | 42.486099 | -91.445198 |       5073 |
+--------+----------------+---------------+------------+------------+--------------+------------+-----------+------------+------------+
10 rows in set (0.002 sec)

The only problem though, is I don’t have a population for every row. The geonames databases for population are not complete. Some of the rows contains 0 for population.

So, if I run a query for "Sheffield, new zealand".I get:

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^sheffield, new zealand$"/0.75') ORDER by population DESC limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+-------+-------------+-------------------+--------+------------------+--------------+------------+------------+------------+------------+
| id    | country     | state             | county | city             | country_code | state_code | latitude   | longitude  | population |
+-------+-------------+-------------------+--------+------------------+--------------+------------+------------+------------+------------+
| 20470 | New Zealand | Auckland          |        | Auckland         | NZ           | AUK        | -36.850601 | 174.767899 |     417910 |
| 20248 | New Zealand | Wellington        |        | Wellington       | NZ           | WGN        | -41.286598 | 174.775604 |     381900 |
| 20459 | New Zealand | Canterbury        |        | Christchurch     | NZ           | CAN        | -43.533298 | 172.633301 |     363926 |
| 20384 | New Zealand | Auckland          |        | Manukau          | NZ           | AUK        | -36.965801 | 174.799393 |     362000 |
| 20361 | New Zealand | Auckland          |        | North Shore      | NZ           | AUK        | -36.801701 | 174.753296 |     207865 |
| 20436 | New Zealand | Waikato           |        | Hamilton         | NZ           | WKO        | -37.798000 | 175.271698 |     152641 |
| 20447 | New Zealand | Otago             |        | Dunedin          | NZ           | OTA        | -45.858398 | 170.550095 |     114347 |
| 20589 | New Zealand | Bay of Plenty     |        | Tauranga         | NZ           | BOP        | -37.684601 | 176.162003 |     110338 |
| 20394 | New Zealand | Wellington        |        | Lower Hutt       | NZ           | WGN        | -41.216702 | 174.916702 |     101194 |
| 20336 | New Zealand | Manawatu-Wanganui |        | Palmerston North | NZ           | MWT        | -40.376099 | 175.580902 |      75996 |
+-------+-------------+-------------------+--------+------------------+--------------+------------+------------+------------+------------+
10 rows in set (0.006 sec)

This is because Sheffield in New Zealand has 0 in the population field.

Running the same query as above, but without the ORDER BY option, it correctly gets sheffield.

MySQL [(none)]> SELECT * FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^sheffield, new zealand$"/0.75') limit 10 OPTION field_weights=(country=10, state=9, county=4, city=2, state_code=9), ranker=SPH04;
+--------+-------------+-------------------+--------+--------------+--------------+------------+------------+------------+------------+
| id     | country     | state             | county | city         | country_code | state_code | latitude   | longitude  | population |
+--------+-------------+-------------------+--------+--------------+--------------+------------+------------+------------+------------+
|  20304 | New Zealand | Canterbury        |        | Sheffield    | NZ           | CAN        | -43.386002 | 172.012604 |          0 |
|  20367 | New Zealand | Taranaki          |        | New Plymouth | NZ           | TKI        | -39.070400 | 174.065201 |      49168 |
| 111438 | New Zealand | Auckland          |        | New Lynn     | NZ           | AUK        | -36.906898 | 174.686996 |          0 |
|  20239 | New Zealand | Southland         |        | Wyndham      | NZ           | STL        | -46.333328 | 168.850006 |        586 |
|  20240 | New Zealand | Waikato           |        | Whitianga    | NZ           | WKO        | -36.833302 | 175.699997 |       3367 |
|  20241 | New Zealand | Waikato           |        | Whatawhata   | NZ           | WKO        | -37.799999 | 175.149994 |          0 |
|  20242 | New Zealand | Auckland          |        | Whangaparaoa | NZ           | AUK        | -36.633301 | 174.766693 |          0 |
|  20243 | New Zealand | Taranaki          |        | Whangamomona | NZ           | TKI        | -39.150002 | 174.733307 |          0 |
|  20244 | New Zealand | Waikato           |        | Whangamata   | NZ           | WKO        | -37.196098 | 175.866898 |       4253 |
|  20245 | New Zealand | Manawatu-Wanganui |        | Whakarongo   | NZ           | MWT        | -40.333302 | 175.666702 |          0 |
+--------+-------------+-------------------+--------+--------------+--------------+------------+------------+------------+------------+
10 rows in set (0.007 sec)

I’m not sure what I can do about this. Maybe I could give ORDER BY less power somehow.

If I could only use ORDER BY if there are matching results. For example if there are multiple Sheffield’s in New Zealand, then determine which one to rank first by population. If there are not multiple sheffield’s in new zealand then there’s no need to use ORDER BY.

Any ideas?

zaxxon_32 · September 30, 2020, 1:10pm

I’ve managed to work out a query which seems to rank the way I want most of the time,

SELECT *, INTEGER((WEIGHT()*1000)+ln(population)) AS myorder FROM geolocations WHERE MATCH('@(country,state,county,city,state_code) "^sheffield, new zealand$"/0.75') ORDER BY myorder DESC limit 20 OPTION field_weights=(country=10, state=9, county=8, city=9, state_code=9), ranker=sph04;

This allows me to give weight() plenty of ranking power, but also take population into account when ranking.