Best approach to indexing and searching relational data

dkersten · April 2, 2024, 6:30pm

I am trying to figure out the best way to use Manticore for my needs. I have product data that is relatively static and singular to each product, such as descriptions, brands, etc. Then I have attribute and other data that can have 0-50 values per product. When searching, I want to be looking at all this data. On top of this, my products are assigned to warehouses, and I need to be able to filter by warehouse as well as text search some data within the warehouse record (warehouse specific product data). What is the best approach for this situation without losing all the performance of Manticore by having to run multiple queries and step through and query each one? It seems like aggregating the data is the answer, but then I multiply my data by a factor of several thousand, ending up with upwards of 90 iterations of the same data, which is sometimes thousands of characters per line already.

Here is an example:
I have a product record like “wx-202”. This has various descriptive data that I have compiled from different sources. This product is set up in 50 different warehouses, and each warehouse potentially has a different vendor for this product, a different vendor part number, and other whse specific data like prices and bin locations. This product also has 35 attributes (attribute name, attribute value, attribute unit of measure). I also have 7 bullet points for this product, and a table with 4 levels of categorization for this product. I have a user from a specific warehouse or group of warehouses and they should only see products from their warehouse(s). They may also choose to filter by category, brand, or also by any particular attribute (volume, size, color, etc.).

Should I aggregate all this data into one index that has a key of product+whse (since that is the only unique identifier for any aggregated product), or should I create separate indexes for each part and relate them to each other in some manner? I imagine this is a fairly common use case for a search engine, but I can’t seem to find any answer that is reliable and can be done without either heavy processing of multiple queries or massive indexes filled with duplicated data.

Any advice here is appreciated. Thanks

Sergey · April 3, 2024, 5:02am

Here’s what seemingly can help in your case:

JSON attributes - Manticore Search Manual: Creating a table > Data types . You can put any arbitrary meta data into it and then filter/group by it (but you can’t do full-text search in a json attribute).
The JOIN functionality which is available in the dev version (Manticore Search Manual) and is not documented yet. You can find query examples here manticoresearch/test/test_278/test.xml at master · manticoresoftware/manticoresearch · GitHub and the nuances of how it works here INNER/LEFT JOIN · Issue #1673 · manticoresoftware/manticoresearch · GitHub

dkersten · April 3, 2024, 2:12pm

Thank you, I will have to check out the dev version. For now I settled on aggregating text data and using MVA’s to store the related data for filtering (warehouses, companies, etc.). It seems to be working pretty well.

dkersten · April 5, 2024, 11:19pm

I tried the dev version and had little luck with the joins. I could join tables and select data that was related but couldn’t figure out the filtering. So I switched to trying json attributes. For each product in my table I have a json attribute field called “warehouse_data”, and in this json is an array called “warehouses”. Each record in this array looks like this and there could be up to 90 of these per product, one for each warehouse:

	{
		"whse": "102",
		"qoh": "37.00",
		"bin": "NN02B06",
		"hwcompany": 100,
		"statustype": "s",
		"nonstockty": ""
	},

I have a case where I need to filter my results based on combinations of the statystype, qoh value, and bin, while only looking at the data in my other filters, which will always be hwcompany, and sometimes is whse as well.

So my query without this filter would be like this:
SELECT id , ANY(x.hwcompany=200 FOR x IN warehouse_data.warehouses) AS hwcompany , WEIGHT() AS relevance FROM prod WHERE MATCH('*pump*') AND hwcompany=1 AND pcat NOT IN ('0106', '547', '548', '600') ORDER BY relevance DESC LIMIT 10000 OPTION max_matches=10000, ranker=sph04;

If I have a specific whse selected, it looks like this:
SELECT id, ANY(x.whse='201' FOR x IN warehouse_data.warehouses) AS whse, WEIGHT() AS relevance FROM prod WHERE MATCH('*pump*') AND pcat NOT IN ('0106', '547', '548', '600') AND whse=1 ORDER BY relevance DESC LIMIT 25 OPTION max_matches=10000, ranker=sph04;

So my use case is where I now need to add to this to only return results if the following conditions are met with the json data for either the specific warehouse or company. In this example I will just use hwcompany = 200 and format it how I would do it in SQL:
WHERE hwcompany=200 AND ((statustype='s' AND bin!='zzzzz') or (qoh>0 AND statustype IN('s','o'))

Any idea how I would do this using the json attributes? I tried using similar OR logic in the ANY condition statement but got errors and no results.

manx · April 11, 2024, 5:30pm

The way I handled it was to create some views from maria/mysql tables, then added the view as a source, then defined the index in the config file. It works quite well, although a thorough inspection for mariadb columns that needed integer indexes (and adding them) definitely helped performance.

dkersten · April 11, 2024, 8:06pm

I ended up using MySQL to build a value from 0 to 2 that fits the criteria, then I insert that into my index. Since I put the value in the json attribute, I can use it in combination with company or whse data to get only results pertaining to those filters. I’ve since added more json attributes that I use to filter data, such as my product attributes, and it works well.

I went back and tried more with joins, but every time I use a join, the query log shows it just removes that part from my query and pretends I didn’t query any of the data (and returns NULL for the columns I selected). I could query “SELECT id, attr.name FROM prod LEFT JOIN attr ON attr.prod_id = prod.id” and the manticore query log shows I queried “SELECT id, attr.name FROM prod” and the attr.name field is all NULLs. Very odd behavior. JSON attributes is by far the better way to go. I can both filter by the data and return the data I need in the query.

manx · April 11, 2024, 8:29pm

I wasn’t talking about manticore joins, I was talking about a mysql view (with joins). .then importing the mysql view as a source, which I guarantee works just fine.

I’m not sure I understand your issue well-enough to say something intelligent about it, but declaring a field sql_attr_uint seems to be a bugbear in certain situations. I’ve used it for things I want to be able to use the @ in queries… like … “keyword” @document_series_id 10 … kind of short hand… but then it tends to affect other queries. [unexpected results, short results…, etc…]

dkersten · April 11, 2024, 9:51pm

I understood what you are doing and it’s similar to how I handled the situation in my case. I just figured I would throw in my findings on the previously mentioned solution of using the dev version and joins, which didn’t work well for me. Ultimately, the combination of json attributes with some creative aggregation of data prior to inserting into the Manticore Index was the solution. In the specific case I was trying to solve, I used logic in my MySQL query to come up with a range of values, each representing a different logical case, and added that to my json attributes. Originally I was just trying to use the same query logic to filter my results, but the Manticore query language just didn’t support it. It would be nice if Manticore had better support for complex relational (many to many) data that doesn’t aggregate well or needs complex logic to filter, but with some creating thinking it seems I was able to overcome each challenge I had with using Manticore, at least so far.