HOWTO: Perform SQL type grouping and aggregation using Tamin

It seems that quite a few of us initially struggle with XQUERY when we want to perform typical relational database operations such as data aggregation e.g. GROUP BY.

I thought it would be a good idea to document my findings in the hope that it saves other Tamino users the initial learning curve with certain aspects of XQUERY.

The example below will perform a typical RDBMS group by operation on documents within a tamino schema. The example uses Tamino 4.2.x.x.

Lets say I have a doctype called enquiry. It contains all enquries raised within a call center application. Each enquiry holds details of the business service the enquiry was against.

Lets say the data looks like this (many of the elements in this doc type have been removed as they are not relevant)

<enquiry>
 <id>1</id>
 <service-description>applications</service-description>
</enquiry>
<enquiry>
 <id>2</id>
 <service-description>applications</service-description>
</enquiry>
<enquiry>
 <id>3</id>
 <service-description>finance</service-description>
</enquiry>
</pre><BR><BR>Its typical that we may want to produce a report that gives the count of enquries for each service.  In a relational database this query would be as follows:<BR><BR><pre class="ip-ubbcode-code-pre">
SELECT service_description, count(*) from enquiry group by service_description;
</pre><BR><BR>We can achieve the same output from tamino by using the follow XQUERY:<BR><BR><pre class="ip-ubbcode-code-pre">
for $service in distinct-values(input()/enquiry/service-description/text())
let $svccount := count(input()/enquiry[service-description = $service])
return
<service>
<name>{$service}</name>
<count>{$svccount}</count>
</service>
</pre><BR><BR>The based upon the above XML data the result would be:<BR><BR><pre class="ip-ubbcode-code-pre">
<service>
 <name>applications</name>
 <count>2</count>
</service>
<service>
 <name>finance</name>
 <count>1</count>
</service>



Note: One thing that should also be taken into account is indexing on the doctype. My sample set of data has 8100 enquiry documents (containing 45 distinct services). Without indexing the service-description element the query took approx 2 minutes to run. If I create a standard index on the service-description element the query time drops significantly down to 5 seconds.

Maybe the if anyone else comes up with useful guides the forum moderator can make them sticky threads. It could give people a quick start and stop the posting of similar threads.

Regards

Ian Gratton

[This message was edited by Ian Gratton on 05 October 2004 at 11:18.]

[This message was edited by Ian Gratton on 05 October 2004 at 12:03.]

Now for todays installment…

The example below shows how you can aggregate 2 levels of data from the same document type. It builds on the data used in the previous example

<enquiry>
 <id>1</id>
 <group-name>General</group-name>
 <service-description>applications</service-description>
</enquiry>
<enquiry>
 <id>2</id>
 <group-name>General</group-name>
 <service-description>applications</service-description>
</enquiry>
<enquiry>
 <id>3</id>
 <group-name>General</group-name>
 <service-description>finance</service-description>
</enquiry>
<enquiry>
 <id>4</id>
 <group-name>Departments</group-name>
 <service-description>finance</service-description>
</enquiry>
</pre><BR><BR>If I wanted to group by group-name and service-description I would write the following in SQL:<BR><BR><pre class="ip-ubbcode-code-pre">
SELECT
      group_name,
      service_description,
      count(*) as enquiry_count
FROM
      enquiry
GROUP BY
      group_name,
      service_description
ORDER BY
      group_name,
      service_description
</pre><BR><BR>In XQUERY we can obtain the same results like so:<BR><BR><pre class="ip-ubbcode-code-pre">
for $group in distinct-values(input()/enquiry/group-name)
return
	for $service in distinct-values(input()/enquiry[group-name = $group]/service-description)
	let $svccount := count(input()/enquiry[group-name = $group and service-description = $service])
	return
	<row>
		<group>{$group}</group>
		<service>{$service}</service>
		<enquiry-count>{$svccount}</enquiry-count>
	</row>
</pre><BR><BR>This would give us the following results<BR><BR><pre class="ip-ubbcode-code-pre">
<row>
 <group>General</group>
 <service>applications</service>
 <enquiry-count>2</enquiry-count>
</row>
<row>
 <group>General</group>
 <service>finance</service>
 <enquiry-count>1</enquiry-count>
</row>
<row>
 <group>Departments</group>
 <service>finance</service>
 <enquiry-count>1</enquiry-count>
</row>



Regards

Ian Gratton

Hi,

Your grouping examples are very nice. I would suggest dropping the unnecessary text() node test from the path expression that is the argument of the distinct-values() function in your first grouping example:

for $service in distinct-values(input()/enquiry/service-description/text())
let $svccount := count(input()/enquiry[service-description = $service])
return

{$service}
{$svccount}

.
The reason is that the node test will avoid an index-based processing of the distinct-values() function.

For your complex grouping example:

for $group in distinct-values(input()/enquiry/group-name)
return
for $service in distinct-values(input()/enquiry[group-name = $group]/service-description)
let $svccount := count(input()/enquiry[group-name = $group and service-description = $service])
return

{$group}
{$service}
{$svccount}



I would suggest the following rewriting:

for $group in distinct-values(input()/enquiry/group-name)
for $service in distinct-values(input()/enquiry/service-description)
let $partition := input()/enquiry[group-name = $group and service-description = $service]
let $svccount := count($partition)
where $partition
return

{$group}
{$service}
{$svccount}

If there is a standard index on ?group-name? and ?service-description? the rewritten query should show a better performance. The reason is that the Tamino 4.2 XQuery Processor only does an index based processing of the distinct-values() function, if it is applied on an unfiltered path expression.

I?m afraid we do not support ?sticky? threads but I will think about a different way to collect interesting posts to provide a quick start to XQuery beginners.

Best regards,

Thorsten Fiebig

Hi Thorsten,

I’ve just run your re-worked query for the second example and it executes quicker than my original.

I think the only problem I have is that your query actually returns additional rows of data. These additional rows contain all combinations of group and service, even if they do not have any enquiries (i.e. enquiry-count of 0). The group by in the relational database equivalent would not have any data to build these results from.

I modified the query to include an additional where clause like so:

for $group in distinct-values(input()/enquiry/group-name)
for $service in distinct-values(input()/enquiry/service-description)
let $partition := input()/enquiry[group-name = $group and service-description = $service]
let $svccount := count($partition)
where $partition and $svccount != 0
return
<row>
   <group>{$group}</group>
   <service>{$service}</service>
   <enquiry-count>{$svccount}</enquiry-count>
</row>



The query is now approximately twice as quick as my original. Thankyou very much.

I am very new to XQUERY myself so your advice is extremely welcome.

Regards

Ian Gratton

Hi,

There seems to be a bug in the Tamino XQuery processor because the check $svccount != 0 is redundant to the non-empty check on $partition.
To reproduce your problem I tried a similar query and it shows the same behavior.

Best regards,

Thorsten Fiebig

Part #3. Grouping by 3 fields.

Yesterday we managed to group data by two fields. Today I needed to be able to group data by 3 fields. The third field I want to group by the status of the enquiry.

In SQL

SELECT
      group_name as `group`,
      service_description as `service`,
      status as 'status',
      count(*) as `enquiry_count`
FROM
      enquiry
GROUP BY
      group_name,
      service_description,
      status
</pre><BR><BR>Based upon Thorsten's tips yesterday this is a very trivial task.  The XQUERY to group by the additional element is:<BR><BR><pre class="ip-ubbcode-code-pre">
for $group in distinct-values(input()/enquiry/group-name)
for $service in distinct-values(input()/enquiry/service-description)
for $statusdesc in distinct-values(input()/enquiry/status)
let $partition := input()/enquiry[group-name = $group and service-description = $service and status = $statusdesc]
let $svccount := count($partition)
where $partition and $svccount != 0
return
<row>
   <group>{$group}</group>
   <service>{$service}</service>
   <status>{$statusdesc}</status>
   <enquiry-count>{$svccount}</enquiry-count>
</row>
</pre><BR><BR>This would give us the following output:<BR><BR><pre class="ip-ubbcode-code-pre">
<row>
 <group>General</group>
 <service>applications</service>
 <status>open</status>
 <enquiry-count>2</enquiry-count>
</row>
<row>
 <group>General</group>
 <service>finance</service>
 <status>closed</status>
 <enquiry-count>1</enquiry-count>
</row>
<row>
 <group>General</group>
 <service>finance</service>
 <status>open</status>
 <enquiry-count>5</enquiry-count>
</row>
<row>
 <group>Departments</group>
 <service>finance</service>
 <status>open</status>
 <enquiry-count>1</enquiry-count>
</row>
<row>
 <group>General</group>
 <service>applications</service>
 <status>closed</status>
 <enquiry-count>2</enquiry-count>
</row>
</pre><BR><BR>So far so good.  The only problem you will notice with the results is that they are not ordered as they would be from a SQL database.  Without us saying so the SQL will automatically keep fields ordered by group, service and then status.  <BR><BR>In my case this order is important as the results will be fed into a reporting tool.<BR><BR>I believe Tamino is not doing this as the query has to 'join' the data back together by the 3 sets of distinct values (which are not retrieved in any specific order).  To ensure the data is returned in the order expected the query should be changed to include a <B>sort by</B> operation like so:<BR><BR><pre class="ip-ubbcode-code-pre">
for $group in distinct-values(input()/enquiry/group-name)
for $service in distinct-values(input()/enquiry/service-description)
for $statusdesc in distinct-values(input()/enquiry/status)
let $partition := input()/enquiry[group-name = $group and service-description = $service and status = $statusdesc]
let $svccount := count($partition)
where $partition and $svccount != 0
return
<row>
   <group>{$group}</group>
   <service>{$service}</service>
   <status>{$statusdesc}</status>
   <enquiry-count>{$svccount}</enquiry-count>
</row>
sort by (group, service, status)



There is no noticeable difference in performance after this change. As always - to help query performance insure the element you intend to group by has a standard index defined.

Tomorrow I will take the above query and filter the results on enquries between given a given start and end date.

Regards

Ian Gratton