Getting quick overview of attributes matching a query?


#1

Any idea how to quickly summerise the attributes matching the query?

Something LIKE this:

sphinxQL>select count(*),count(distinct takenday),count(distinct takenyear),count(distinct group_ids),count(distinct term_ids) from sample8A where match('@grid_reference TQ3838');

the above doesnt work, as can only do one COUNT(DISTINCT)

ERROR 1064 (42000): sphinxql: too many COUNT(DISTINCT) clauses near '),count(distinct group_ids),count(distinct term_ids) from sample8A where match('@grid_reference TQ3838')'

As complication is that it’s a distributed index, so COUNT(DISTINCT) actully isnt very accurate! Attributes may be MVA!

Doing lots of GROUP BY’s

SELECT takenday,count(*) FROM sample8A where match('@grid_reference TQ3838') GROUP BY takenday
SELECT takenyear,count(*) FROM sample8A where match('@grid_reference TQ3838') GROUP BY takenyear
SELECT GROUPBY(),count(*) FROM sample8A where match('@grid_reference TQ3838') GROUP BY group_ids

etc., does work, but need lots of queries (have over 30 attributes!)

Can be optimized a bit with FACET. But need to set a high LIMIT on each facet, to get the total count, because total_found (via SHOW META) for each FACET is not available.

Just wondering (hoping!) someone has a bright idea. Its for a faceting browsing system to get a summery of the facets available can then click to expand a single one, but getting idea how many of each first would be useful.


#2

you already could group by multiple values like

select count(*) from sample8A where match('@grid_reference TQ3838') group by takenday, takenyear, group_ids, term_ids;

we also have feature request in backlog to add similar feature to distinct expression like count(distinct attr1, attr2, other13)
in case this will fit your needs you might create ticket at issue tracker to be informed then feature will be pushed into master.


#3

ERROR 1064 (42000): index sample8A: MVA values can’t be used in multiple group-by

:slight_smile:

But in general looking for the count per attribute,

Contributors (7)
Years (5)
Days (34)
Groups (5)

Clicking one of them would then show you the 5 different years. And count per year, with the foll

Did wonder about

select GROUP_CONCAT(takenday),GROUP_CONCAT(takenyear),GROUP_CONCAT(term_ids) from sample8A where match('@grid_reference TQ3838') ;

then ‘count’ the number of distinct values in the result. But seems GROUP_CONCAT only works with plain numeric attributes, not strings nor MVA :frowning: - aside from the risk of the text being really long!

Been reading the source for Table Functions, seems thats the most likely avenue, but isnt formalized into a plugable UDFs :frowning:


#4

ISphTableFunc \ CSphTableFuncRemoveRepeats has lot of C++ arguments and depends on many library functions and types, ie embed result set into V8 scripting would be easier than make ISphTableFunc with environment accessible via UDF


#5

FACET has count per attribute and count distinct and I sure run only one query
Why

select id, count(*) from facetdemo2 facet Contributors facet Years facet Days facet Groups

does not work for you?


#6

Yes, FACET does work to get breakdown, but I just want the number of groups per facet, not the data. ie its the ‘total_found’ from the individual queries that actully interested in.

Kinda like this:

$attributes = array('takenday','takenyear','group_ids','term_ids','tag_ids','user_id'); //actully the full list is 30 attributes
$stat = array();
foreach ($attributes as $attribute ) {
   //dont need the data from this
   $conn->Execute("SELECT id FROM sample8 WHERE MATCH($q) GROUP BY $attribute LIMIT 0"); 

    //interested in the number of distinct values. 
   $meta = $conn->getAssoc("SHOW META");
   $stat[$attribute] = $meta['total_found']; 
}

which works, but is running lots of separate queries.

Could also add do it with facet, but to get all the rows, just to get the count would need a big limit

 select id,title FROM sample8 where match('@grid_reference TQ3838') facet takenyear limit 1000 facet takenyear limit 1000 facet group_ids limit 1000;

and get back LOTS of rows, when don’t need the actual rows. Just the number of each. ie cant get the total_found from the individual rows.

If show meta had a total_found from each facet that might work!

> select id,title FROM sample8 where match('@grid_reference TQ3838') facet takenyear limit 0 facet takenday limit 0 facet group_ids limit 0;
> show meta;
 +-------------------------+--------+
 | Variable_name           | Value  |
 +-------------------------+--------+
 | total                   | 166    |
 | total_found             | 166    |
 | total_found[takenyear]  |  16    |
 | total_found[takenday]   |  64    |
 | total_found[groups_ids] |  36    |
 | time                    | 0.520  |

#7

why not just limit 1 in case you need the only count row per query?

select id,title FROM sample8 where match('@grid_reference TQ3838') facet takenyear limit 1 facet takenyear limit 1 facet group_ids limit 1;

#8

Its just the number of rows/groups I want, not the individual counts data, need the high limit to get all the rows … so the client can count them. ie just number of distinct values in in the attribute.

sphinxQL>pager grep rows
PAGER set to 'grep rows'
sphinxQL>select id,title FROM sample8 where match('@grid_reference TQ3838') facet takenyear limit 1000 facet takenday limit 1000 facet group_ids limit 1000;
20 rows in set (0.19 sec)

16 rows in set (0.20 sec)

64 rows in set (0.20 sec)

36 rows in set (0.20 sec)

Would then be used by the application

Years (16)
Days (64)
Groups (36)

Depending on the query there could be hundreds of distinct values, which is why trying to avoid downloading all the data to the client to count it (and then immediately discard it!).


#9

ok now I see that you need to write count (distinct attr at right part of FACET or put multiple count (distinct attr at select list and there is no way to do it at daemon.

Could you write regular UDF that get raw attribute value, put it in a std::set then return std::set length and use that UDF result in outer select as described in subselect section https://manticoresearch.gitlab.io/dev/sphinxql_reference/select_syntax.html#subselects


#10

my bad seems you have to keep values in std::set at fast_rank then pass std::set into slow_rank and that is impossible for now too


#11

or maybe that will work

select id, my_count(years), my_count(days), my_count(groups) from sample8 match('@grid_reference TQ3838') ORDER BY id DESC LIMIT 1

and my_count is something like

DLLEXPORT sphinx_int64_t my_count ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args, char * error_flag )
{
	std::set * q = (std::set*)init->func_data);
	int count = 0;
	if ( args->arg_count )
	{
		q->insert ( args->arg_values[0] );
		count = q->size();
	}
	return count;
}

this way for each document your UDF returns count of uniq values stored so far and with ORDER BY id DESC you will get last row with all values passed via your UDF’s


#12

Could you write regular UDF that get raw attribute value, put it in a std::set then return std::set length

Oh, interesting. Actully have a UDF tat does something VERY similar.

http://svn.geograph.org.uk/svn/modules/trunk/sphinx/uniqueserial.cpp

Possibly can adapt that.

Thank you!

(edited post, as was misremembering what the UDF did, and thought it broken)


#13

yes max(udf .. should be same as count(distinct ... or limit 1 order by id desc