Can you GROUP BY over HTTP?

Sembiance · February 2, 2024, 6:13pm

Is it possible to do GROUP BY <field> with an HTTP JSON request?
Something like SELECT fileid, filehash FROM files WHERE MATCH('view') GROUP BY filehash to remove duplicate files and only return 1 match per filehash
This works great as SQL, but I’m calling manticore with a HTTP JSON request and don’t see any way in the manual to do a simple ‘GROUP BY ’

Sergey · February 3, 2024, 3:49am

Please see JSON examples in the manual - Manticore Search Manual: Searching > Grouping

Sembiance · February 3, 2024, 4:52am

Yah, I tried that, but it didn’t reduce the quantity of .hits that were returned, the grouped/duplicate filehashes were still returned. I used aggs : {filehash : {terms : {field : “filehash”}}}
Is there some way to specify for a given agg that it should omit all but 1 of the duplicated/grouped hits? I tried size:1 and size:0, no luck.
I’m already using several aggs for faceted search, where I don’t want the hits to change, just interested in the buckets results.
I guess I’m confused how ‘GROUP BY’ and ‘FACET’ from the SQL can some how both be handled by the same ‘aggs’ JSON arg

Sergey · February 3, 2024, 4:31pm

I get the same results via SQL and JSON. Can you modify this example to exemplify the issue.

mysql> drop table if exists files; create table files(fileid int, filehash int, f text); insert into files(f, fileid, filehash) values('view', 1, 1),('view', 2, 1),('view', 3, 2); SELECT fileid, filehash FROM files WHERE MATCH('view') GROUP BY filehash;
--------------
drop table if exists files
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table files(fileid int, filehash int, f text)
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into files(f, fileid, filehash) values('view', 1, 1),('view', 2, 1),('view', 3, 2)
--------------

Query OK, 3 rows affected (0.00 sec)

--------------
SELECT fileid, filehash FROM files WHERE MATCH('view') GROUP BY filehash
--------------

+--------+----------+
| fileid | filehash |
+--------+----------+
|      1 |        1 |
|      3 |        2 |
+--------+----------+
2 rows in set (0.01 sec)
--- 2 out of 2 results in 0ms ---

➜  ~ curl -sX POST http://localhost:9308/search -d '
    {
     "index" : "files",
     "query" : {"match" : {"_all": "view"} },
     "limit": 0,
     "aggs" :
     {
        "filehash_group" :
         {
            "terms" :
             {
              "field": "filehash",
              "size": 100
             }
         }
     }
    }
'|jq .
{
  "took": 0,
  "timed_out": false,
  "hits": {
    "total": 3,
    "total_relation": "eq",
    "hits": []
  },
  "aggregations": {
    "filehash_group": {
      "buckets": [
        {
          "key": 2,
          "doc_count": 1
        },
        {
          "key": 1,
          "doc_count": 2
        }
      ]
    }
  }
}

Sembiance · February 3, 2024, 6:37pm

Here is a simplified test case as a bash script that shows the issue with Manticore 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822):

#!/bin/bash

server="http://127.0.0.1:17000"

cd /tmp/test || exit

rm -rf test.config data
mkdir data

echo 'searchd {
	data_dir = /tmp/test/data
	listen = 127.0.0.1:17000:http
	pid_file = /tmp/test/searchd.pid }' > test.config

searchd --config ./test.config > /dev/null

curl -sX POST "$server"/cli_json -d 'CREATE TABLE test(
	name string attribute indexed,
	filehash string, 
	itemid int, 
	family bit(8))' > /dev/null

curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"view exe","family":4,"filehash":"8e8e4fa0065f64324f6a8d831f7012aed880533ee7b9514a06d07f91bd0a2d83"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"view exe","family":4,"filehash":"8e8e4fa0065f64324f6a8d831f7012aed880533ee7b9514a06d07f91bd0a2d83"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"view exe","family":4,"filehash":"8e8e4fa0065f64324f6a8d831f7012aed880533ee7b9514a06d07f91bd0a2d83"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"view bat","family":3,"filehash":"3ac7b63d959005467a1a61e7b5365bbb6143bfbc914b28a08fc2b35120a1d613"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"dl view doc","family":7,"filehash":"cf76c8502dad15688c0152dacc3f583196b5e76be731376d17b8d0336fd35390"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"dl view exe","family":4,"filehash":"767d31736dc77f04e87ca2b31af2e63d4f2c2b0beee1dfe134f6d2d1123f9509"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"dl view hlp","family":3,"filehash":"45cc89b62219f8a637b0bb37bb135a528b38d441467618b2461c5477e66dead7"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"view exe","family":4,"filehash":"f94480cf396bfd33e3aeca5afae250d299bd02a5ee528e74d9a1747f8232c3c6"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"view hlp","family":3,"filehash":"45cc89b62219f8a637b0bb37bb135a528b38d441467618b2461c5477e66dead7"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"bbview","family":0,"filehash":"c39d814717c46d4a07ae9a9dbec79efa1a274e01df57091abf564e90d40a51f8"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"bbview doc","family":3,"filehash":"4567609ee9a33f8de234e5fc6d79cdd22fe4f415cd3101d180e56c172938c25a"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"bbview exe","family":4,"filehash":"e2373ab662ec2b6002dce40d9d1053b988b9314027d19895c738d1baf9225903"}}' > /dev/null
curl -sX POST "$server"/insert -d '{"index":"test","doc":{"name":"bbview ico","family":2,"filehash":"f7b5136171af416e8758d9da31db49b96d0299bc0a654fa311cb1b53db50d065"}}' > /dev/null


echo -e "SQL query returns 5 hits:"
curl -sX POST "$server"/cli_json -d "SELECT name, filehash FROM test WHERE MATCH('view') AND family=4" | jq '.[0].total'

echo -e "\nHTTP query 5 hits:"
curl -sX POST "$server"/search -d '{"index":"test","query":{"match":{"_all":"view"},"equals":{"family":4}}}' | jq '.hits.total'

echo -e "\n\nSQL query with GROUP BY return 3 hits:"
curl -sX POST "$server"/cli_json -d "SELECT name, filehash FROM test WHERE MATCH('view') AND family=4 GROUP BY filehash" | jq '.[0].total'
echo -e "This is the behavior I want and expected, where the 'GROUPED' hits are reduced so only 1 hit per filehash is returned."

echo -e "\nHTTP query with the 'aggs' method returns all 5 hits still (unexpected):"
curl -sX POST "$server"/search -d '{"index":"test","query":{"match":{"_all":"view"},"equals":{"family":4}},"aggs":{"filehash_group":{"terms":{"field":"filehash"}}}}' | jq '.hits.total'

echo -e "\nHOW can I do an HTTP query that returns the same number of hits with the same behavior as the SQL query with GROUP BY?"

searchd --config ./test.config --stopwait > /dev/null
rm -rf test.config data

Sergey · February 7, 2024, 7:02am

Why don’t you do jq '.aggregations.filehash_group.buckets | length' instead of jq '.hits.total' ?

Sembiance · February 7, 2024, 1:15pm

Your suggestion just gives me the hit count/length. I’m not after just the ‘length’, I’m after the ACTUAL hit data itself. In this simplified test I only have 4 columns but in the real search I have 20 or so and want a lot more of the column data. The buckets don’t contain any data other than what you aggregate on. Thus the only place to retrieve that real data is in the hits result data, not the buckets. Sadly the flaw is that SQL GROUP BY returns just the 3 entries which is the behavior I want to replicate in the JSON QUERY where it’s incorrectly returning 5 entries.

This is likely a fundamental flaw in the ‘search via JSON’ API and Manticore likely doesn’t support this capability via the JSON query API due to how it tries to combine 2 different concepts GROUP BY and FACET into the same query interface.

I can re-engineer my code to formulate ‘SQL’ statements for search instead of using the JSON API. I was just really hoping to avoid that because I think properly ‘escaping’ the user data will be much more of a security nightmare when forming the SQL statement vs submitting a JSON query.

Sergey · February 8, 2024, 3:44am

in the JSON QUERY where it’s incorrectly returning 5 entries

“hits” returns 5 while you need not “hits”, but “aggregations”. But I see your point - you want what’s called top_hits in Elasticsearch:

➜  ~ curl -sX DELETE -H 'Content-Type: application/json' http://localhost:9200/files > /dev/null
curl -sX PUT -H 'Content-Type: application/json' http://localhost:9200/files/_doc/1  -d '{"f": "view", "fileid": 1, "filehash": 1}' > /dev/null
curl -sX PUT -H 'Content-Type: application/json' http://localhost:9200/files/_doc/2  -d '{"f": "view", "fileid": 2, "filehash": 1}' > /dev/null
curl -sX PUT -H 'Content-Type: application/json' http://localhost:9200/files/_doc/3  -d '{"f": "view", "fileid": 3, "filehash": 2}' > /dev/null

➜  ~ curl -sX GET -H 'Content-Type: application/json' http://localhost:9200/files/_search -d '{
  "size": 0,
  "aggs": {
     "agg_filehash": {
       "terms": {
         "field": "filehash",
         "size": 10
       },
       "aggs": {
          "agg_top": {
            "top_hits": {
              "_source":["*"],
              "size": 1
            }
          }
       }
     }
  }
}'|jq .

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "agg_filehash": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 1,
          "doc_count": 2,
          "agg_top": {
            "hits": {
              "total": {
                "value": 2,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "files",
                  "_id": "1",
                  "_score": 1,
                  "_source": {
                    "f": "view",
                    "fileid": 1,
                    "filehash": 1
                  }
                }
              ]
            }
          }
        },
        {
          "key": 2,
          "doc_count": 1,
          "agg_top": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1,
              "hits": [
                {
                  "_index": "files",
                  "_id": "3",
                  "_score": 1,
                  "_source": {
                    "f": "view",
                    "fileid": 3,
                    "filehash": 2
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }
}

Right?

Unfortunately it’s indeed not available via JSON yet. Feel free to create a feature request on GitHub.

Sembiance · February 8, 2024, 6:35pm

Yup, that top_hits thing appears to be what I’m after. I re-coded my code to generate a SQL style query instead of the JSON API and everything works great now, so no worries

TallSteve · February 13, 2024, 12:10pm

Why not ditch the JSON nonsense and use the SQL endpoint; sending your SQL query directly via HTTP. You’ll need to url encode the query, but most languages have a function for this.
https://manual.manticoresearch.com/Connecting_to_the_server/HTTP#SQL-over-HTTP

Note the raw output option too, which may be useful.

I am converting from Elasticsearch and not having to use that nasty machine readable JSON stuff is a joy.